What I have done so far? Part 4

Disaster recovery - ~8h rewrite from MongoDB to Postgres

I recall once we had a huge problem with our MongoDB cluster - without going into too many details - it was like a 3TB+ beast that we have used mostly as some sort of cache layer. Unfortunately one of the replica servers decided to die on us - plugging a new server into the cluster as you can imagine slowed down everything to the point everything almost stopped - new server tried to pull the data from the main server that was under heavy load already and we ended up in a weird spot.

I have decided to draft a new cache layer - this time based on Postgres - it was not the best decision to use Mongo for this purpose anyway - it turned out it wasn’t the best tool for the problem.

In the end, we have abandoned the existing cluster and allowed for our background workers to start filling fresh Postgres instance with new data (we were in very lucky position we could do that in the first place, otherwise things could get ridiculously complex!) - and we used Postgres since then.

Much later - as the new Postgres database got bigger and bigger - we have decided to follow the hot/cold storage approach and split that into two separate instances. But that’s a different story done by another team.

Again I think the most difficult part was making a call - it was a high urgency and high-risk decision, but in the end it everything turned out nicely.