Confluence might misreport space contents

This post highlights a challenge some Confluence instances pose - they misreport the content of spaces.

By Heinrich Ulbricht | Thursday, August 22, 2024

There is one specific issue that customers have now and then: not all pages of a space are being migrated. Some pages will be missing in SharePoint.

Why is that?

While there can be multiple reasons, one is Confluence itself misreporting the contents of spaces.

Let’s look at the root issue and how WikiTraccs tries to work around it.

The root issue: Confluence lies about its space contents

If the Confluence issue is present, it affects the most convenient type of WikiTraccs’ source content selectors, the Space Selector.

When WikiTraccs starts migrating a space it asks Confluence about the contents of this space, like “give me all page IDs in this space”.

Confluence will happily answer and the list of page IDs it returns might look like this: 00001, 00002, 00003, 00004, 00005, 00006. This would be 6 pages to migrate. That’s what we expect.

But sometimes the result looks different, although all 6 pages are definitely there. Confluence might report the list of page IDs like this: 00001, 00002, 00002, 00002, 00005, 00006.

Notice the difference? Page ID 00002 is listed three times, while 00003 and 00004 are missing.

This is a problem. Why is Confluence lying to us? I don’t know.

Note

Over the course of two years I got the impression that at least 10% of Confluence on-premises instances are affected. Not sure about cloud, yet.

WikiTraccs tries to work around this issue

The latest release of WikiTraccs contains a workaround for this Confluence issue.

WikiTraccs detects duplicate page IDs and will take that as a hint that page IDs will be missing as well. It will then use a different method to retrieve the page IDs for a space.

One caveat of this workaround is that it’s significantly slower than just getting the list of page IDs handed by Confluence. But at least it should retrieve a complete list.

How to verify that all pages have been migrated?

When Confluence is not lying to us, WikiTraccs’ progress log files are the way to go. The __30-aggregated-info file shows a summary of a space’s migration progress. This is the happy path.

If you have a hunch (or see it in the logs) that Confluence might be lying about the contents of a space, your only chance is to look at and compare with the Confluence database.

Here’s how to get a list of page IDs from the Confluence database, for a given space: Getting a list of pages per space.

Compare the list of page IDs you got from the database with the list of pages WikiTraccs got handed by Confluence. The pages WikiTraccs knows about for a space can be seen in the __25-update-state-of-migrated-pages progress log file.

Hints about duplicates in the WikiTraccs log file

The WikiTraccs common log files contain information about duplicate page detection and applied workarounds.

You want to find this message for each space that has been selected for migration:

No duplicate IDs found for selector (Type=ConfluenceSpaceKey; Query=GOOD)

Above log message says that all is fine for space with key GOOD, as there were no duplicate page IDs. Also, WikiTracccs takes this as a hint that there won’t be page IDs missing (note: this is an assumption that is yet to be proven wrong).

To the contrary, the following log message indicates that duplicates where found for space with key DUPE:

Duplicate content IDs found for selector (Type=ConfluenceSpaceKey; Query=DUPE)

Searching for the text [DUPLICATES] in the WikiTraccs log files will surface further details about the affected spaces, like which pages are affected and which pages could only be retrieved via the built-in workaround.

Working around the issue with the Content ID Selector

The issue only affects Space Selectors as for those WikiTraccs will ask Confluence about space’s contents. Consequently Confluence might choose to lie to us.

To prevent this kind of issue, you might choose Content ID Selectors instead to tell WikiTraccs the page IDs it should migrate. With this type of selector you take the “page ID retrieval” part in your own hands. Have a look at the documentation about the details.

In general, Space Selectors are easier to handle than Content ID Selectors. So in an ideal world, there would be no need to choose one over the other to work around Confluence issues.

How to fix this Confluence issue?

I don’t know. Let me know if you find a solution.

←Previous