So I decided to write a Web Spider

This month I got an itch to start spidering various web-sites, generating link graphs or just collecting “interesting” URLs. I tried writing a flexible spider with plenty of call-backs and filters, based on Arron Pattersons’ four line web spider using WWW::Mechanize. Soon I realized a major problem with this approach, WWW::Mechanize caches every page it visits in memory. This slows things down considerably when spidering the likes of www.wired.com (they must be getting tired of me using them as a test-site for all things HTTP/HTML).

So I started looking for a decent web-spider that someone previously wrote and documented. There’s Spider (quite the imaginative name), it has a very powerful call-back simply called on. There’s also SimpleCrawler, which allows you to simply spider an entire domain.

Spidr Banner

Based on the features of these two projects I created Spidr. Now you know why all the new projects are misspelled, all the good names are taken. Spidr provides black-lists and white-lists for hosts, ports, link patterns and link extensions. Spidr also provides call-backs for every page visited, every URL visited or every URL that matches a specified pattern. When Spidr fetches a page, it creates a temporary Spidr::Page object which has many convenience methods for your programming enjoyment.

In the next week I’ll try to post some examples of Spidr doing some common and interesting tasks. Also if anyone has questions or feature suggestions, please leave some comments.

Advertisements

About this entry