Data is akin to code - the best data is data that is never stored. Unfortunately, this is not how products operate, and you might find yourself in a situation where data becomes a liability and a burden.
Recently, I have been working on a side project where the majority of the data in the app was stored in a single table. That table grew beyond a maintainable point; major pain points are nothing new, and in my case, it boiled down to:
except single index I was screwed when it comes to querying it; every query becomes gymnastic where I had to iterate over your collection as a straightforward query would probably hang for days in the database
forget about adding new columns or modifying the structure; if you’re lucky you might run on a recent postgres/mariadb/whatever engine that allows you to add columns in an instant manner (or you are living the NoSQL live and have a different set of problems); given data growth you’re probably running some legacy version and you’re screwed again
assuming you can schedule a maintenance window it would probably take days and you have to have another table-size-TB free space for potential table rebuild
operational-wise it’s another rotten egg - think backups/restore pain
So how to deal with it? As usual - it depends; the natural solution would be sharding/partitioning or even a mix of both. Then with partitioning, you might look for native solutions (provided by your RDBMS) or do it on the application level - which also can make sense.
I was working with MariaDB and Rails and in the end, I chose partial sharding (moving majority of the data into brand new DB with MariaDB’s page compression enabled) and partitioning on application level based on a range - I’m not going to go into technical details here as it’s very application specific, but majority of the new data lands in new partitions (or in my case - physical tables).
As a side product of this decision - separating tables physically allowed me to deal with some minor optimizations per table level. The data was 14y old and there were some inconsistencies caused by various bugs introduced (and then fixed) over the years.
How to do it safely?
I have no idea, but this is how I did it:
First I updated rails, d’oh. Rails 6.1 ships with REALLY good support for multiple databases; you don’t have to think about migrations, or managing multiple schemas - everything just works. Kudos to the Rails team for this feature.
Then - no magical solution - I had to wrap every call that touched that table in some abstraction class. That allowed me to put the read/write from old table/new tables behind a global feature flag. I was also able to control read/writes in a controlled manner so the app could continue to run while we worked on the data migration plan.
As the application was relatively simple narrowing down the entry points was relatively easy, but work itself was still painful. Obviously, with sharding - everything that does joins becomes problematic, and up to this day I still have to rethink few problems and how to solve them given the new database structure.
Moving the data was also quite tricky, Ruby quickly crumbled under the amount of data (hello strings allocation) and in the end I had to hack spaghetti code CLI tool in golang; few takes away for future self:
using SQLite as a local cache layer for speeding up batching over IDs was a nice idea; the cache grew to a few GB which was a little bit too much for Redis, but SQLite took it without a problem
trying to use Ruby for this particular problem wasn’t the best idea to begin with - I had a good experience with postgres_fwd but most heavy-lifting was done on the DB level there; in the end I re-used some tests written in rspec and had some integration tests in a way for the whole thing
I wish coded the ability for graceful resume sooner; thankfully I was able to run the process for the given date range (with some limitations)
I wish that I added simple Slack notifications sooner - were extra useful just to monitor the process as everything was ad-hoc running on a single node in a screen (welp, I got spoiled with the simplicity of “deployment” - or rather a lack of it - when it comes to shipping small golang apps)
Was MariaDBs’s page compression any good?
Yes! We managed to fit all the stuff along with some relationships within ~600GB, so in our case compression ratio was quite pleasing.