Register by January 15th, to save $700! Early Early Bird ends in:

Best Practices When Migrating To and From Boomi DataHub (Part One)

by Boomi
Published Jan 27, 2021

I’m an integration specialist at Solita Oy, an IT service management company, and a Boomi implementation partner. Our team was given the task of migrating master data from one Boomi DataHub cloud to another Boomi DataHub cloud location. Migrating data is often a complex task, and every platform and tool provides its own challenges. DataHub has its own quirks that are good to know when performing migrations.

During the migration process we encountered a number of these challenges and found solutions. Now we want to share our lessons learned and best practices so others might benefit.

What is Boomi DataHub, and Why Does it Matter?

Boomi DataHub is a versatile master data management (MDM) solution that is part of the Boomi AtomSphere Platform. It enables a quick and intuitive way to manage, monitor and configure master data. This, combined with Boomi Integration within the AtomSphere Platform, provides a unified solution for maintaining the data between surrounding systems.

Background

Our team was handling a Boomi DataHub instance with two repositories and multiple thousand golden records. Due to a strategy to consolidate data in preferred geographical locations, our customer wanted to relocate the data and functionality to another Boomi DataHub cloud.

We approached Boomi for the possibility of migrating the data in the background but unfortunately there weren’t any ready-made solutions available. There are ways for backing up and restoring data within a singular cloud environment, but migrating from one cloud to another, even within the Boomi AtomSphere environment, was a new challenge.

The MDM implementation style in use in our case followed the coexistence model, where multiple databases and systems containing master data must coexist. This caused additional complexity due to the distributed fashion of governance and importance of maintaining which systems have contributed to the golden records data.

As no ready-made solutions were available, we decided to find our own way and solution to the issue. During this process we discovered multiple things that don’t work, and some things that do work.

Things To Consider Before Migration

Prepare your integration processes

We found that transferring a DataHub production environment was not feasible in a single go, as any errors during the migration might cause issues within production.

Instead, we chose to set up the new DataHub repository in the new cloud and change our integrations in a way that all incoming data could be sent to either or both of the DataHub environments.

We could control the flow of data from integrations to DataHub through environment-extendable properties. This would give us fine control on the data flow without unnecessary deployments. The idea was to guide all incoming data towards DataHub to be written in both DataHubs, the old and the new, at the same time. Any data read from DataHub would be coming from one DataHub or the other, but this could also be controlled via the environment extensions.

What this setup enabled was that we could perform the migration handover smoothly by monitoring both the old and new DataHubs for some time while comparing the data status periodically. This identified any errors in migration, allowed us to fix data issues, and when the time came, switch the DataHub repositories with minimal impact on production usage.

We of course rehearsed the process multiple times with test environments before and during the migration of actual production data. Test data could only reveal part of the issues that actual complex production data with rich history would introduce to the process, so we had to return back to the test environments to ensure new solutions to revealed errors would work.

Find the correct tool for your migration process

Depending on your repository set up and migration aims you will be forced to perform multiple requests to fetch data, upsert data, monitor the upsert process and compare the data states. You will also need to perform transformations, storage, comparisons, and enrichment on the data.

Our first choice was to use Boomi Integration as the tool for migrating the data. During the migration we found out that wasn’t the optimal choice for our job. Requesting data from DataHub had to be done through the REST API as the DataHub connector couldn’t provide us with all the details we needed, like record history, metadata, and advanced queries. Because of this we used the HTTP Client connector, which turned out to be stateless. So each time we made a request the connector had to perform authentication. Of course, if only the latest state of the data is relevant and the actual history of records or links to surrounding systems are not important, then the process becomes a lot simpler. Then you could get away with just getting a .csv dump of the database, doing a quick transformation and uploading the data to the new DataHub.

The REST API allows only the fetching of 200 golden records at a time. To fetch more golden records, the query needs to be repeated with an offset token provided with the previous reply. This prevents the API from choking under too large loads, but at the same time forces users to make repeated queries in order to get multiple records.

In order to get 10,000 golden records, the first query would have to be performed 50 times. Then, as we intended to transfer also the histories of the golden records, we needed to fetch this separately for each record. Add the possible need for metadata queries in order to get the status of source links of each record, the number of requests needed got rather large rather fast.

With a stateless HTTP connection like the HTTP Client Connector in Boomi, the time it took to just fetch the data grew too much to maintain a feasible working rhythm and timetable. Just fetching the data and all the histories of the records took over an hour when stateless connections were used. This was cut down to about 15 minutes with the stateful connections. Introducing stateful connections, asynchronicity, and breadth-first data uploads with some other more minor improvements, the execution time of a whole migration process went from about a week down to about three hours.

The combined effect of breadth first uploads and stateful connections had the most impact. When the running time exceeded a day or more, we were subject to problems occurring due to natural variations on the load on the systems. Batches weren’t handled in the same order as they were sent. Connections timed out as we waited for batches to finish. Also remember that we had to rerun the process multiple times in order to handle data corruption issues.

We chose to utilize custom Python scripts to perform the data fetching and insertions. We are sure that even better options for tooling would have been available, but with the time limits and our team’s familiarity with Python, it ended up as our choice.

So, before you start your migration process, research your options for the tool to use.

Keep security in mind

Whichever tool you choose remember not to leave the actual golden records, API keys or other sensitive data lying around your migration machines. Forgetting such data might not only leave security openings but might also cause you to have a registry of sensitive data in your hands that might be governed by separate and specific laws that you don’t even realize. HR data in particular can contain a lot of sensitive data. Discuss any such situations with your GDPR representative in order to ensure data security.

Clean up your data

Migration is the perfect time to go through your data with varying queries and inspections. Getting your data quality up to standards before the migration will make the process a lot easier and will also minimize the amount of data needed to transfer. Even though duplicate entries shouldn’t occur with good matching rules and data governing, there are situations where they do pop up.

Migrating data with duplicate entries might cause records to end up in quarantine in the middle of the migration process. If you aren’t prepared for sudden stops in the process or entries getting quarantined, you can end up with corrupted data or a need to redo the migration process on some part of the records. This causes a lot of detrimental manual work.

Plan the migration

Proper preparation won’t stop all the errors or prevent all the surprises, but it will minimize them.

Proper research into tool options saves you a lot of time during the process.

Reserve enough time for the project. Depending on MDM solutions used and tool options, you might have to create your own tools or spend time on automating parts of the process like data comparisons.

Plan the switch from the old MDM solution to the new one. Overlap between the old and new solutions might give you the possibility to verify the success of data migration and configurations. Running data to both old and new MDMs gives you the option to compare how the MDMs react to data and that they keep data synchronous.

Prepare for rollbacks. With Boomi DataHub you should contact your Boomi representative to discuss rollback possibilities. You also need to take into account any rollback possibilities in surrounding systems. A data error in an MDM solution is quick to spread to surrounding systems.

Your old system connections might look a bit different after the migration. How should manually modified golden records look like in the new MDM? With Boomi DataHub, you cannot use the “MDM” source that is shown on manually edited records. You have to either modify the source contributing to these manual changes to reflect one of your systems or come up with an extra source that will take responsibility for these changes.

Are you going to migrate also archived data? Getting end-dated records changes the process a bit, as the chance of duplicates increases. A record might have had multiple instances in the past and trying to migrate these might cause you to end up with duplicate errors in the migration process.

In the second blog of this series, I cover MDM configuration.

Boomi DataHub is a cloud-native master data management (MDM) solution that sits at the center of the various data silos within your business – including your existing MDM solution, to provide you an easy to implement, scalable, flexible, and secure master data management hub as a service. For more information, go here or contact a Boomi expert.

On this page

On this page

Stay in touch with Boomi

Get the latest insights, news, and product updates directly to your inbox.

Subscribe now