awesomeprogrammer.com

Sharing ideas

Solving problems

Gathering solutions

Exchanging thoughts

Ruby On Rails

PHP

Postgres

Debian & Ubuntu
jQuery & CSS

Rails Legacy Database Migration

I will give you some of my thoughts about migrating legacy database to Ruby On Rails framework. Welcome to part one.

It’s 32°C outside (and apparently inside), I’m trying to focus and finish my old project that I started after-hours. Being a programmer ain’t easy :P. But let’s get back on track. So to happens I have this legacy, really old, not maintained for years database. It comes from some old custom php cms, that is you might say – far from perfect. It doesn’t have any logical constraints (no data integrity) and validation is a mess.

I have started from rough PHP script that tries to fix what can be fixed in a one way or another, purges database from spam (hello there web bots), applies some ‘static’ fixes (like rewriting some custom prepared data) and leaves the rest as-it. I won’t post the code here, because it’s just written Rambo style – I didn’t payed attention to the optimization or anything as I had to run it basically once.

After that it was time to cook rake tasks that will migrate table by table this data into my new app. I have setup a legacy database in database.yml and created bunch of legacy classes in external file like that:

1
2
3
4
5
class LegacyPost < ActiveRecord::Base
  establish_connection :legacy
  self.table_name  = 'board_posty'
  self.primary_key = "odpid"
end

Then was the time to start working or rake task, by trial and error I ended up with a function that could be applied for most of my cases.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
def run_migrate_task(new_model, legacy_model, rewrite_hash, broken = [])

  error_count = migrate_count = 0
  p_key = legacy_model.primary_key.to_sym unless legacy_model.primary_key.nil? # get primary key of legacy table
  id_important = rewrite_hash.keys.include?(:id) # we are rewriting old ids

  legacy_model.find_in_batches(:batch_size => 2000) do |old_objects|

    existing_ids = id_important ? new_model.select(:id).where(:id => old_objects.map(&p_key)).map(&:id).to_set : {}
    ActiveRecord::Base.transaction do
      old_objects.each do |old_object|
        begin
          if (!id_important || !existing_ids.include?(old_object[p_key])) && !broken.include?(old_object[p_key])
            new_object = new_model.new
            rewrite_hash.each { |key, value| eval("new_object.#{key} = #{value}") } # setup new values by eval'ing passed hash

            if new_object.save
              migrate_count+=1
            else
              puts "ERROR | ID: #{old_object[p_key]} | MSG: #{new_object.errors.full_messages.to_sentence}"
              error_count+=1
            end # if save
          end # if important
        rescue Exception => exception
          puts "EXCEPTION | ID: #{old_object[p_key]} | #{exception.message}"
          error_count+=1
        end # begin
    end # transaction commit
    end # each old objects
  end # each batch

  puts "DONE | Migrated: #{migrate_count} | Error: #{error_count}"
end

What this function does is takes new and old model and hash that consists of keys corresponding to the new model and raw code to apply for that value, optional I’m passing array if IDs of totally broken objects or object that I decided that I don’t want to import for some reason. I’m checking if object was properly saved (validation message) and rescuing any exceptions as I just let it fail if I couldn’t fix it in php script. Few notices:

  • if I’m rewriting old IDs (and most of the time I am) first I’m fetching stuff I have already imported (maybe in previous run before some additional tweaks) – I don’t want to import 200k records just because two were broken

  • using find_in_batches is quite useful here, apparently all worked fine until I got to table that had half a million rows and then ran out of memory, my bad :P

  • wrap it inside a transaction for some speed improvement, you may want to disable validation or go with raw SQL generation instead – it all depends of your needs

So I can call it like that:

1
2
3
4
5
6
7
8
9
10
11
12
13
  task :posts => :environment do
    setup_env # setup configuration, load classes and stuff
    rewrite = {
      :id         => "old_object.odpid",
      :topic_id   => "old_object.tematid",
      :user_id    => "old_object.autor",
      :content    => "html_to_bbc(old_object.tresc, true)", # call custom function here
      :updated_at => "old_object.edytowano > 0 ? Time.at(old_object.edytowano) : Time.at(old_object.dopisano)",
      :updated_by => "old_object.edytowal > 0 ? Time.at(old_object.edytowal) : nil",
      :created_at => "Time.at(old_object.dopisano)"
    }
    run_migrate_task(Post, LegacyPost, rewrite)
  end

Maybe it’s not the perfect solution, not the smartest and quickest – but in my opinion migrations are supposed to be dirty, doable in reasonable time and just should work. You won’t maintain this code in the future anyway, so it’s up to you how much effort you are willing to put it. As it’s my side project I decided to be cheap ;).

In next chapter I will give you some hints how to convert pretty messed up HTML to BBCode (don’t give me that look!) without using regular expressions (that don’t cut it anyway in the end). Cheers!

Updated 14.08.2012

After some investigation of my migration I realized that using include? on huge Array in this case is not the way to go. Instead convert array to Set. Why you ask? Here you can find some useful information. Simple benchmark below.

1
2
3
4
5
6
7
8
9
10
11
array = (1..1_000_000).to_a
set = array.to_set

Benchmark.bm(15) do |x|
  x.report("Array.include?") { 1000.times { array.include?(500_000) } }
  x.report("Set.include?")   { 1000.times { set.include?(500_000) } }
end

                      user     system      total        real
Array.include?   20.940000   0.000000  20.940000 ( 20.991647)
Set.include?      0.000000   0.000000   0.000000 (  0.000199)

Comments