During my time collaborating with Humly, an agency responsible for managing educator vacancies, I had the opportunity to work on several data migration projects. The company operates in both Sweden and the United Kingdom and, over the course of my collaboration, acquired seven other companies, merging them all under the Humly brand.
Since the company's domain is human resources management, particularly in the education sector, my recommendations will be skewed towards it but I think that a huge part of it will be universal.
Technical background
The application was initially built with Ruby on Rails, using MongoDB as the database. However, during the seventh acquisition, the database was migrated to PostgreSQL.
The data sources were as followed:
MongoDB database a separate version and provider from what was currently used in our application
Microsoft SQL Server a standalone database and also a part of commercial HRM platform software
CSV files exported from the database or HRM system
Preparation & planning
The migration process requires detailed planning as it can take up to a couple of months depending on the volume of data and the information available to the team.
Here are some main points to consider:
Work with your Product Owner to establish a clear go-live date for the data migration.
The final stage of migration typically happens the weekend before the go-live date. Make sure this timing works for you and other developers involved.
Whenever possible, divide the process into stages with specific deadlines. This allows the operational team to review each portion of migrated data and prepare it for final launch. If issues arise, you’ll be able to address them early.
Arrange to receive the most recent data copy on the evening before the migration weekend, unless you have direct access. This might require coordination with someone who can provide the data outside of normal working hours.
Determine who will handle reverse-engineering the data schema. In some cases, I’ve done this myself; in others, I received valuable assistance from the Product Owner, who provided the schema mapping.
If possible, have a developer or IT contact from the acquired company available to answer questions, particularly during the initial stages. Their insights can be crucial to avoiding misunderstandings.
Ask your Product Owner to designate a Migration Manager to oversee business-related aspects of the migration. This could be the Product Owner, an acquisitions manager, or a manager from the acquired company. This person will be your main contact for operational questions and will help ensure smooth integration.
For critical processes such as educator salaries or client invoicing, develop a transition plan that ensures continuity. Verify that your systems and integrations can handle these functions seamlessly within the acquired company.
After data mapping is complete, establish how you’ll access it. This might involve direct database connections, database dumps, or CSV files.
Plan how and when you’ll invite newly imported users, ensuring they have a smooth onboarding experience.
Starting up
Begin by reviewing the source code and breaking it down into specific components that need to be imported. In my case, these key components included:
educators and their compliance profiles,
clients (e.g. schools, kindergartens),
client contacts (e.g. managers at the schools, kindergartens),
assignments (during which time the selected educator should be teaching a specific subject).
For each component, identify the minimum data requirements—the essential database fields necessary to ensure records are technically valid and provide adequate operational value. Then, compare this information to the system mapping in the target environment, noting any gaps or missing fields.
Secondly, organise a meeting with the Migration Manager to discuss the base terms for the migration.
Main agenda points for the meeting:
If he/she was working for an acquired company, then first give them a tour of your application, focus on how you are presenting different sets of data. Ensure they have access to staging or even a live system to look around on their own.
Show him/her the list from the first step but rewrite it to use the business terms from your application instead of database fields or model names. Highlight on the list places where you are missing data - you need to decide how you want to handle it.
Ask if there is any extra information that should be included in the import, you may need to introduce some new fields (they sometimes don't need to be shown but might be required later for manual reports and such).
Check about any assets like images or files that may need to be migrated.
Decide the scope of import. Maybe you don't need all the data? In our case we usually migrated past 2 years and skipped: inactive educator accounts, clients assignments that were cancelled
The last part is accounting. Check that the details agreed beforehand are still valid. Set a specific date up to when the imported data should be considered as paid and invoiced on the previous system.
Check external processes that may access the new data and trigger undesired notifications or other actions
Writing the Migration Scripts: Choosing Between Script or API, and Remote vs. Local Execution
The migration script can be just an additional piece of code that iterates over the records from the source system and creates them in the target database. It is executed just once or twice and then discarded. It also could be an advanced private API that each endpoint is responsible for creating a specific type of data and your script just hands over bulks of records to the endpoint and then it gets processed in the background. The code can be run locally connecting to an external production database or you may deploy it on a cloud service. Due to a time constraint we chose the local execution of plain scripts.
Avoiding duplicates
When designing the migration script, it’s essential to account for potential re-runs, as interruptions can happen when processing large data volumes.
Common issues include:
unforeseen error due to corner-case in the migration script
unexpected closing of connection to the database
networking issues: lack of internet, DNS cache problems
crash of the computer running the script.
To prevent duplicates across multiple runs, consider the following approach:
Use External References for Imported Records
Mark all imported records with a unique identifier derived from the source data. I used a formula combining the company name and the primary key (or an external reference if no primary key was available), such as "foo-company/495732". For records that map to multiple database entries, add prefixes to distinguish them (e.g., "foo-company/health/495732" or "foo-company/references/495732"). Each table involved in the migration should have a string column called external_ref to store these identifiers, linking back to the original record.
Implement a Two-Step Process in the Script
Structure the script to first gather existing external_ref values from the target database, then exclude them from the import process. Performing this exclusion early on (ideally at the SQL level) allows the script to resume quickly from the last successful point, avoiding reprocessing of previously imported data.
Hire team to Your HRM Software Development
Use CSV when possible
This is only for small and medium sized data collection: when possible it is useful to export the records (only needed columns) from the source database to an intermediate CSV file and then based on that create actual records. The benefits are:
you can upload the CSV to google drive as spreadsheet and allow operation team to look at the data & review it before you import it
they can make necessary changes to the data by removing or adjusting the rows then you can easily export it back to CSV and use it in your script
the migration script is easier to write - you don't need to connect to specific database
Don't trust the data
When importing any piece of data you need to make sure that it conforms to the types and restrictions you have on your system.
I have learnt it the hard way: I imported the user accounts and some email addresses were malformatted - it turned out that in their system they didn't have any validation on that field which resulted in emails like just names of people or multiple valid emails joined by semicolons in one field. It did not cause problems for the applications I maintained but it affected the software that facilitated the integration with the Hubspot.
The following are the key guards I use:
for common strings or texts always strip whitespaces at both ends
make sure the numbers are converted to proper values - this is especially important for floats
for emails check that they have valid structure, apply downcasing, raise error otherwise
parse dates and times to make sure they are valid and you are assuming the correct format (for time ensure you are presupposing the correct timezone)
for any string values that are mapped to some well defined values on your system (for example those that are represented as dropdowns in the UI) it's best to throw error when the imported value is outside the mapping to prevent invalid values
some bigger text fields may contain rich text like HTML so it might be good to strip the tags if you handle it differently in your application
Speed it up
When you have tens of thousands of records then it will not be feasible to import them one by one. It will take a long time. I had once a large collection of comments that I calculated it would iteratively take 2 days to import. In those cases bulk import is the solution. Both Mongoid and Active Record provide a way to pass an array of records and save them in the database in a matter of seconds. The only downside is that you need to handle validation and run any callbacks manually. It is worth to note that turning off some callbacks on the model also quickens the iterative approach as well.
Files import
I usually write a separate script for the files import. The sources can be different: binary data from the database or google drive (or some other cloud storage). In case of the former it is best to ask the people that prepare it to create some sort of structure - group the documents in the subdirectories named with some unique identifier so that you could deduce it from the path of the currently processed file during the script execution.
Reporting the results
I found it most helpful to log to a file any records I'm skipping and the reason why I did it. The more details you provide the easier it will be to answer any questions later about any missing records. That way you will be sure that skipped records were actually excluded and not omitted due to a bug in the script. It is best to save the report in some parsable format like for example: JSON then you can easily write some code to check/process it later if needed. I save the result reports on google drive for future reference.
Testing on staging
When everything is ready I first test it on the staging environment. Since some data is sensitive I strip it or mask it (for example leaving the first 2 characters but replacing others with some arbitrary letter like 'X') on the staging import. I wrap assignments of such data in a special function that checks the global variable in the script and when it is enabled it obscures or removes the pieces of information. This is useful for security numbers, names, comments, etc. For emails I usually leave the first 4 characters and append part of the record's unique ID in case of UUIDs, then replace the host with example.com domain.
After all data is imported into staging I ask the Migration Manager to review it. For this process I pick a couple of records from each category and I write them down with corresponding reference ID from the original system. That way he/she will be able to compare/contrast the data.
It is also useful to play around with the imported data - pick some random ones and try to modify some insignificant information and see if it validates correctly in the UI and lets you save it.
Perform migration
Before running the production migration write down the steps that you need to follow.
Include in your plan steps to clean up:
remove credentials to production database,
remove local database with the original system data.
Persist the report files on a cloud or GitHub repository.
Summary
Data migration is a complex process requiring careful planning and execution to ensure data accuracy, efficiency, and alignment with the target system’s requirements. By preparing well—reviewing data for validity, using CSV files for intermediate steps, avoiding duplicates, and bulk-importing large datasets—you can simplify the process and prevent issues from surfacing later. Always perform thorough testing in a staging environment, involve stakeholders for data review, and document each step meticulously. A clear, step-by-step production migration plan, combined with logging and report generation, ensures transparency and traceability for future reference.