Migrating large Confluence spaces to SharePoint
How does WikiTraccs determine which pages to migrate for a space?
When WikiTraccs starts to migrate a space to SharePoint it retrieves the full list of pages for this space.
Say a space contains 20000 pages. WikiTraccs retrieves basic information about those 20000 pages and adds them to a migration queue. (Later, when migrating each page, WikiTraccs retrieves the contents of this page and creates the corresponding SharePoint page.)
Confluence doesn’t allow retrieving information of more than 200 pages at once. So to retrieve information about 20000 pages WikiTraccs needs to request 100 batches of 200 pages, as that is supported.
This paged retrieval is where things get interesting for larger spaces.
Why can large Confluence spaces pose a challenge?
The more pages a space contains the longer it takes to retrieve a batch of pages.
Here are numbers from a Confluence 6 test migration of a space with about 23000 pages:
- retrieving batch 1 takes < 1 second
- retrieving batch 10 takes about 2 seconds
- retrieving batch 25 takes about 4 seconds
- retrieving batch 50 takes about 6 seconds
- retrieving batch 75 takes about 10 seconds
- retrieving batch 100 takes about 14 seconds
Assuming retrieving each batch of 200 pages takes a mean of 8 seconds, retrieving 20000 space pages would take 100*8=800 seconds, which is about 14 minutes.
Exporting this space (with empty dummy pages) via the Confluence space export function takes about 16 minutes:
Successful page migrations
Unfortunately there is not much that can be done about the time it takes to retrieve the list of pages. With release v1.6.8 WikiTraccs started to migrate pages while still retrieving the list of pages for a space. This at least allows the waiting times to be used to migrate the first pages.
WikiTraccs logs the time it takes to retrieve a batch of pages, if it starts to take longer. Look out for those messages in the WikiTraccs.Console log or common log file:
Handled a batch of 200 pages for GIANT... (so far handled 18321) Took 10.601631s to get space GIANT content (paged) from endpoint URL <...>
Confluence 7.18 introduced an API that allows for faster page retrieval, but so far there was no demand I am aware of as migrated environments were older. If you are migrating a Confluence version 7.18 or newer then please comment here to make this demand visible: User faster page retrieval API (Confluence 7.18 and up).
Large Confluence spaces (more than 10000 pages) can add significantly to the migration time.
When WikiTraccs starts migrating a space it retrieves a list of all space pages, which can take several minutes for those large spaces. The list of pages is retrieved multiple times, for example after migrating the space to check if all pages were migrated.
Those times can add up.
So it’s good to know the spaces with large page numbers beforehand, or to learn about them during a test migration.