awesomeprogrammer.com

Sharing ideas

Solving problems

Gathering solutions

Exchanging thoughts

Ruby On Rails

PHP

Postgres

Debian & Ubuntu
jQuery & CSS

Rails Legacy Database Migration, Part 2

(Link to Part 1) Now let’s look at some tools that can help you migrating messy html into some more easy to maintain format. Obvious choice for many would be markdown, but as markdown is quite geeky I decided to go with well known by average user – and hated by many programmers – BBCode. BBCode would also allow me to implement some custom tags that I needed anyway – so from now we will stick with it.

How to parse HTML to BBCode in Ruby / Rails?

After some unsuccessful tries with regular expressions that aren’t really solutions to the problem. Here are my weapons of choice:

  • Loofah – powered with Nokogiri and libxml2 is a excellent choice
  • BBCoder – easy to configure and clean BBCode to HTML parser
  • Tidy – Ruby interface for TidyHTML (if you have really messy content as I do you can tidy-it-up a little bit before processing)

Loofah allows to define custom scrubbers that we will later apply to our dirty, dirty html. I won’t repeat here the documentation and/or manual, because you can easily read it yourself – and I’m really recommending getting to know how nokogiri works and why it’s so awesome. Let’s just cut straight to the problem.

Here code for my Loofah’s scrubber that converts HTML to BBCode:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
class HTMLBBCoder < Loofah::Scrubber

  def initialize
    @direction = :bottom_up
  end

  def scrub(node)
      return CONTINUE if node.text?

      case node.name
      when "i", "em"
        node.replace("[i]#{node.inner_html}[/i]")
      when "br"
        node.replace("\n")
      when "strong"
        node.replace("[b]#{node.inner_html}[/b]")
      when "a"
        if node['href'] && node.inner_html
          node.replace("[url=#{node['href']}]#{node.inner_html}[/url]")
        else
          node.remove # remove urls without href or title
        end
      when "img"
        node.replace("[img]#{node['src']}[/img]")
      when "table", "td" ,"tr", "li", "b", "ol", "ul", "u"
        node.replace("[#{node.name}]#{node.inner_html}[/#{node.name}]")
      when "span", "div", "p" # handle text aligment, font-weight etc.
        if node.attributes['style']
          if node.attributes['style'].to_s.match(/bold/)
             node.replace("[b]#{node.inner_html}[/b]")
          elsif node.attributes['style'].to_s.match(/italic/)
            node.replace("[i]#{node.inner_html}[/i]")
          elsif node.attributes['style'].to_s.match(/underline/)
            node.replace("[u]#{node.inner_html}[/u]")
          elsif node.attributes['style'].to_s.match(/center/) || (node['align'] && node['align'].match(/center/))
             node.replace("[center]#{node.inner_html}[/center]")
          else
            node.replace(node.inner_html)
          end
        else
          node.replace(node.inner_html)
        end
      when "font"
        if node['color']
          node.replace("[color=#{node['color']}]#{node.inner_html}[/color]")
        else
          node.inner_html.empty? ? node.remove : node.replace(node.inner_html)
        end
      else
        STOP
    end # case
  end # scrub
end # class

It goes from bottom, rewrites the html tags into BBCode tags and on the way fixes some common problems I have encountered. You will probably need to tweak it for your needs, but at least now you have a base. It’s not pretties thing in the world, but it gets the job done.

Now let’s use some Tidy help before calling the scrubber. You can tweak settings after reading the manual. I decided to go with:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
tidy = Tidy.open({:show_warnings => false,
                  :wrap => 0,
                  'char-encoding' => 'utf8',
                  'output-bom' => 0,
                  'show-body-only' => 1,
                  'word2000' => 1,
                  'merge-spans' => 1,
                  'merge-divs' => 1,
                  'drop-empty-paras' => true,
                  'tidy-mark' => false,
                  'drop-proprietary-attributes' => true,
                  'quote-ampersand' => false,
                  'force-output' => true}) do |tidy|
  clean = tidy.clean(text)
  # maybe here you'll want to check for errors/warnings
end

After tidy’ing it up you probably will have to force encoding like that:

1
Loofah.fragment(clean.force_encoding('UTF-8')).scrub!(bbcoder).to_s

Also – depending of your chunks of html you may need to wrap your text inside some proper html tags like <p> or <div>

Comments