Data is akin to code - the best data is data that is never stored. Unfortunately, this is not how products operate, and you might find yourself in a situation where data becomes a liability and a burden.

Recently, I have been working on a side project where the majority of the data in the app was stored in a single table. That table grew beyond a maintainable point; major pain points are nothing new, and in my case, it boiled down to:

So how to deal with it? As usual - it depends; the natural solution would be sharding/partitioning or even a mix of both. Then with partitioning, you might look for native solutions (provided by your RDBMS) or do it on the application level - which also can make sense.

I was working with MariaDB and Rails and in the end, I chose partial sharding (moving majority of the data into brand new DB with MariaDB’s page compression enabled) and partitioning on application level based on a range - I’m not going to go into technical details here as it’s very application specific, but majority of the new data lands in new partitions (or in my case - physical tables).

As a side product of this decision - separating tables physically allowed me to deal with some minor optimizations per table level. The data was 14y old and there were some inconsistencies caused by various bugs introduced (and then fixed) over the years.

How to do it safely?

I have no idea, but this is how I did it:

First I updated rails, d’oh. Rails 6.1 ships with REALLY good support for multiple databases; you don’t have to think about migrations, or managing multiple schemas - everything just works. Kudos to the Rails team for this feature.

Then - no magical solution - I had to wrap every call that touched that table in some abstraction class. That allowed me to put the read/write from old table/new tables behind a global feature flag. I was also able to control read/writes in a controlled manner so the app could continue to run while we worked on the data migration plan.

As the application was relatively simple narrowing down the entry points was relatively easy, but work itself was still painful. Obviously, with sharding - everything that does joins becomes problematic, and up to this day I still have to rethink few problems and how to solve them given the new database structure.

Moving the data was also quite tricky, Ruby quickly crumbled under the amount of data (hello strings allocation) and in the end I had to hack spaghetti code CLI tool in golang; few takes away for future self:

Was MariaDBs’s page compression any good?

Yes! We managed to fit all the stuff along with some relationships within ~600GB, so in our case compression ratio was quite pleasing.