How we migrated TBs of student sessions data in AWS with zero downtime?

Kenil Domadia
Feb 13, 2024
4 min read

This article delves into the migration process of terabytes of student data, encompassing billions of sessions, from self-managed Cassandra within AWS to a more efficient database system. It highlights the challenges faced, the strategic approach adopted, and the post-migration benefits.

The primary objective was to transition billions of data points with zero downtime within a tight timeline. The existing system on Cassandra was incurring substantial costs, approximately $420,000 annually, necessitating a swift and effective migration strategy. The migration was a pivotal move to enhance the company's data handling capabilities into the future, ensuring robustness and agility in their data-driven operations.

Problem Analysis

The challenge was formidable: migrating billions of student session data from Cassandra to a new database system with zero downtime. The existing self-managed Cassandra database was costly, with annual expenses of around $420,000, and lacked the scalability needed for the forecasted traffic in the upcoming years. The migration needed to be executed swiftly and efficiently, without the luxury of a direct database connection, typically used in Database Migration Service (DMS) jobs.

Directly linking to a decade-old, self-managed Cassandra cluster using DMS jobs, handling hundreds of thousands of concurrent users, presented considerable risks. We evaluated which Cassandra tables to migrate, considering our recent shift from a monolith to microservices. This required selecting only essential tables for migration, as some functions had moved to other microservices. Thankfully, the Cassandra service's GET APIs were a key factor in accessing necessary data, proving crucial for our migration plan.

The goal was not just to migrate the data but also to enhance the database's performance, ensuring it could handle the increased load and provide better service in the long term.

Choosing new database within AWS

DynamoDB was our first immediate choice because it promised a world of seamless scalability, cost-effectiveness, and lightning-fast responses, all familiar territory for our developers. Excitedly, we embarked on a proof of concept (POC), and everything was going smoothly - it felt like a celebration was just around the corner.

But then, we hit a snag: DynamoDB had a record size limit of 400KB and a significant portion of our database had records exceeding this limit. We rolled up our sleeves and tried our first trick - compressing the data before inserting it into the table. We thought, "A little extra CPU usage is a small price to pay." But even after compression, our data stubbornly remained over the 400KB limit. Restructuring the table wasn't an option. Our data was like a treasure chest of student responses, packed with images and complexities, much like a puzzle that refused to be simplified.

We moved on to the next workaround of storing large objects in Amazon S3 with a pointer in DynamoDB. It introduced significant complexity in synchronizing data across two different services. The additional latency and cost of frequent access of large objects from S3 did not align with our efficiency goals.

Large object storage strategies for Amazon DynamoDB

We considered various options, including DynamoDB, DocumentDB, and Aurora Serverless PostgreSQL, ultimately selecting DocumentDB for its alignment with the company's specific needs.

Zero Downtime Migration Plan

The migration was executed in phases:

Phase One: Implement Dual-writing of new data to both Cassandra and DocumentDB, ensuring a seamless transition for ongoing operations.

Phase Two: Migrate historical data from Cassandra to DocumentDB, a critical step in preserving the integrity of past records.

Phase Three: Route production traffic to the new DocumentDB database along with dual writing to the old Cassandra cluster, allowing for a fallback to the Cassandra system if any issues were identified with DocumentDB.

Phase Four: The final cutover, where production traffic was routed exclusively to the new DocumentDB database, marking the sunset of the old Cassandra cluster.

The second phase of our migration, involving the transfer of historical data, required a careful, controlled approach since DocumentDB was already handling live production traffic from Phase 1. Our strategy was to develop a Lambda script for migration that would use a unique student session identifier to retrieve data from the Cassandra service using GET APIs and then insert it into DocumentDB using PUT APIs.

We gathered all historical millions of records session identifiers, representing sessions

created before Phase 1, and implemented a throttling mechanism through SQS-Lambda, allowing for the parallel processing of 20 rows per lambda execution. To manage the rate of requests to the service, we used "Concurrent Lambdas." Our performance tests showed that with a Lambda concurrency of 10 and 200 sessions being migrated in parallel, we could complete the migration in a week. This timeline gave us a comfortable window to monitor the migration process and gear up for the subsequent phase.

Migration plan for pre-stored historical data using AWS Lambda and SQS

We scheduled each new phase to remain in production for a week for observation before progressing to the subsequent phase. The implementation of feature flags in each phase allowed us the flexibility to switch between phases as needed. Additionally, we conducted performance tests for each phase to verify that the service could effectively handle the traffic.

Migration Benefits

Cost Efficiency: The shift to DocumentDB resulted in substantial cost savings, reducing annual expenses by about $300K compared to the Cassandra system.
Scalability: DocumentDB offered enhanced scalability, adeptly handling the increasing volume of data and user traffic, a critical factor for growing user base.
System Stability: Post-migration, the new database environment exhibited good stability and performance, and significantly reduced latency. It’s been a quite, well-behaved database for the past 6 months without any weekend-long firefights.