Today I have an interesting task that one of our Dynamo DB tables need to add a sort key (used to be called range key). Unlike secondary index, obviously I cannot alter the table to add a sort key, because it will change the partition and Dynamo DB will not allow such operation. It seems that the easiest approach is
- Create a new table B with primary key and sort key, copy the data from table A to table B;
- Drop table A, and recreate table A with the same schema as table B, then copy the data from table B to table A.
The downside is that there will be an outage, but we are okay with that if we can minimize the duration to be just a couple of hours.
Another challenge is that the table has more than 250MM rows. So step 2 needs to be distributed.
I have used Data Pipeline before to back-up data from Dynamo DB to S3 on a weekly schedule, so Data Pipeline seems to be a feasible choice for this job (see the last paragraph for its limitations). Unfortunately there is no template I can re-use, so here is the configuration I set up:
CopyActivity (HiveCopyActivity) has input: DynamoDBFrom (data format: DynamoDBExportDataFormat), and output: DynamoDBTo. It runs on EMRResource (EmrCluster) with the following parameters:
Master Instance Type: m3.2xlarge
Core Instance Type: m3.2xlarge
Core Instance Count: 10
Ami Version: 3.11.0
Terminate After: 24 hours
That’s it, save and no errors! Activate the job!
Edit: Apparently this solution will not scale either. When I compare the table size between from and to, there is a significant difference. I tested with a small table of 1MM records and it worked fine. My guess is that how distributed scan operation works on Dynamo DB is not consistent.