introducing GScraper 0.1.3

Today GScraper 0.1.3 was released into the wild. GScraper came about after countless times doing data-exploration in IRB and wondering “I wish I could cross-reference this data with Google Search, oh yeah they disabled their Search API, because their evil“. Furthermore, many of the Ruby GData APIs leave a lot of the GData-ness exposed, where I would rather prefer a nice abstracted data-layer. On top of that, GData requires you have a GMail account to authenticate with; I just want to get some search data, not give them more of my own data. Thus GScraper was born, a Ruby web-scraping interface to various Google Services.

GScraper currently supports accessing Google’s Search service, with support for other search services in the works. GScraper also supports accessing Google Services with custom User-Agent strings. GScraper requires Hpricot and Mechanize for the web-scraping functionality.

To install gscraper, simply run the following commands:

$ sudo gem install gscraper

Here are some examples of GScraper in action:

  • Basic query:
      q = GScraper::Search.query(:query => 'ruby')
  • Advanced query:
      q = GScraper::Search.query(:query => 'ruby') do |q|
        q.without_words = 'vs.'
        q.within_past_day = true
        q.numeric_range = 2..10
      end
  • Queries from URLs:
      q = GScraper::Search.query_from_url('http://www.google.com/search?as_q=ruby&as_epq=&as_oq=rails&as_ft=i&as_qdr=all&as_occt=body&as_rights=%28cc_publicdomain%7Ccc_attribute%7Ccc_sharealike%7Ccc_noncommercial%29.-%28cc_nonderived%29')
      q.query # => "ruby"
      q.with_words # => "rails"
      q.occurrs_within # => :title
      q.rights # => :cc_by_nc
  • Getting the results:
      q.first_page.select { |result| result.title =~ /Blog/ }
      q.page(2).map { |result| result.title.reverse }
  • A Result object contains the rank, title, summary and URL of the search result.
      q.page(2).urls # => [...]
      q.page(3).summaries # => [...]
      q.first_page.ranks_of { |result| result.url =~ /^https/ } # => [...]
      q.first_page.titles_of { |result| result.summary =~ /password/ } # => [...]
  • Iterating over the results:
      q.each_on_page(2) do |result|
        puts result.title
      end
  • Setting the User-Agent globally:
      GScraper.user_agent # => nil
      GScraper.user_agent = 'Awesome Browser v1.2'
  • Setting the User-Agent per call:
      q.page(3, :user_agent => "I am not a Bot")
      q.page(4, :user_agent_alias => "Windows IE 7")

Finally, the documentation can be found at gscraper.rubyforge.org. Enjoy.

Advertisements

About this entry