A quick guide to getting Hpricot to work for you

Filed in Software Developement 1 Comment

Hpricot is a cool tool for ruby that performs fast and efficient web page scraping by leveraging the DOM in a web page and the power of XPath. The DOM is just a tree structure (defined in the HTML code), and XPath lets you query this this tree structure as if it were XML. (Recall that most web pages are now XHTML compliant).

The problem with Hpricot is that XPath is not all that fun to work with. But there are some things you can do to make things easier on yourself. For one, there are several tools out there than will help you get the explicit (absolute) xpath for any element in the DOM (firebug extension for firefox comes to mind). Something to note when taking this approach is that the xpath query is not always reliable; some browsers interpret HTML differently. In the case of firefox, i was finding that firefox would transparently insert <tbody> tags under <table> tags in the DOM tree, even though they were not in the HTML code.Then there is MY solution, make Hpricot work for you. Hpricot has the ability to build your query string for you. The steps are as follows:

  1. first search for the DOM node
  2. reverse-generate the xpath query string

To find the DOM node, you will need to use the following query string:

//text()[text()*='text to search for']

The above query string will locate any nodes that contain the text “text to search for” somewhere.  If you want to search for an exact match, remove the asterix from the query.  this query will return a list of element nodes, from which you can request the node’s xpath.  That’s it.  Here is some basic ruby code to get the job done.

require 'rubygems'
require 'hpricot'
require 'open-uri'

uri = 'http://www.google.ca/'
query = "//text()[text()*='more']"
doc = Hpricot(open(uri))

doc.search(query).each do |row|
    # print our xquery
    puts "[#{row.xpath}] => "
    # print the row data
    puts "#{row.to_html}\n"
end

This will print out a list of DOM tree node’s that contain the text your searching for and the XPath query it takes to get to each one specifically.  Happy scraping!

, , ,

TOP