I will give you some of my thoughts about migrating legacy database to Ruby On Rails framework. Welcome to part one.
It’s 32°C outside (and apparently inside), I’m trying to focus and finish my old project that I started after-hours. Being a programmer ain’t easy :P. But let’s get back on track. So to happens I have this legacy, really old, not maintained for years database. It comes from some old custom php cms, that is you might say - far from perfect. It doesn’t have any logical constraints (no data integrity) and validation is a mess.
I have started from rough PHP script that tries to fix what can be fixed in a one way or another, purges database from spam (hello there web bots), applies some ‘static’ fixes (like rewriting some custom prepared data) and leaves the rest as-it. I won’t post the code here, because it’s just written Rambo style - I didn’t payed attention to the optimization or anything as I had to run it basically once.
After that it was time to cook rake tasks that will migrate table by table this data into my new app. I have setup a legacy database in database.yml and created bunch of legacy classes in external file like that:
class LegacyPost < ActiveRecord::Base
establish_connection :legacy
self.table_name = 'board_posty'
self.primary_key = "odpid"
end
Then was the time to start working or rake task, by trial and error I ended up with a function that could be applied for most of my cases.
def run_migrate_task(new_model, legacy_model, rewrite_hash, broken = [])
error_count = migrate_count = 0
p_key = legacy_model.primary_key.to_sym unless legacy_model.primary_key.nil? # get primary key of legacy table
id_important = rewrite_hash.keys.include?(:id) # we are rewriting old ids
legacy_model.find_in_batches(:batch_size => 2000) do |old_objects|
existing_ids = id_important ? new_model.select(:id).where(:id => old_objects.map(&p_key)).map(&:id).to_set : {}
ActiveRecord::Base.transaction do
old_objects.each do |old_object|
begin
if (!id_important || !existing_ids.include?(old_object[p_key])) && !broken.include?(old_object[p_key])
new_object = new_model.new
rewrite_hash.each { |key, value| eval("new_object.#{key} = #{value}") } # setup new values by eval'ing passed hash
if new_object.save
migrate_count+=1
else
puts "ERROR | ID: #{old_object[p_key]} | MSG: #{new_object.errors.full_messages.to_sentence}"
error_count+=1
end # if save
end # if important
rescue Exception => exception
puts "EXCEPTION | ID: #{old_object[p_key]} | #{exception.message}"
error_count+=1
end # begin
end # transaction commit
end # each old objects
end # each batch
puts "DONE | Migrated: #{migrate_count} | Error: #{error_count}"
end
What this function does is takes new and old model and hash that consists of keys corresponding to the new model and raw code to apply for that value, optional I’m passing array if IDs of totally broken objects or object that I decided that I don’t want to import for some reason. I’m checking if object was properly saved (validation message) and rescuing any exceptions as I just let it fail if I couldn’t fix it in php script. Few notices:
if I’m rewriting old IDs (and most of the time I am) first I’m fetching stuff I have already imported (maybe in previous run before some additional tweaks) - I don’t want to import 200k records just because two were broken
using
find_in_batchesis quite useful here, apparentlyallworked fine until I got to table that had half a million rows and then ran out of memory, my bad :Pwrap it inside a transaction for some speed improvement, you may want to disable validation or go with raw SQL generation instead - it all depends of your needs
So I can call it like that:
task :posts => :environment do
setup_env # setup configuration, load classes and stuff
rewrite = {
:id => "old_object.odpid",
:topic_id => "old_object.tematid",
:user_id => "old_object.autor",
:content => "html_to_bbc(old_object.tresc, true)", # call custom function here
:updated_at => "old_object.edytowano > 0 ? Time.at(old_object.edytowano) : Time.at(old_object.dopisano)",
:updated_by => "old_object.edytowal > 0 ? Time.at(old_object.edytowal) : nil",
:created_at => "Time.at(old_object.dopisano)"
}
run_migrate_task(Post, LegacyPost, rewrite)
end
Maybe it’s not the perfect solution, not the smartest and quickest - but in my opinion migrations are supposed to be dirty, doable in reasonable time and just should work. You won’t maintain this code in the future anyway, so it’s up to you how much effort you are willing to put it. As it’s my side project I decided to be cheap ;).
In next chapter I will give you some hints how to convert pretty messed up HTML to BBCode (don’t give me that look!) without using regular expressions (that don’t cut it anyway in the end). Cheers!
Updated 14.08.2012
After some investigation of my migration I realized that using include? on huge Array in this case is not the way to go. Instead convert array to Set. Why you ask? Here you can find some useful information. Simple benchmark below.
array = (1..1_000_000).to_a
set = array.to_set
Benchmark.bm(15) do |x|
x.report("Array.include?") { 1000.times { array.include?(500_000) } }
x.report("Set.include?") { 1000.times { set.include?(500_000) } }
end
user system total real
Array.include? 20.940000 0.000000 20.940000 ( 20.991647)
Set.include? 0.000000 0.000000 0.000000 ( 0.000199)