IBM recently published an article on creating spiders with Linux. I’ve always had a fascination with how this was done and knew it couldn’t be that difficult. Reading this article does, however, make you appreciate the logic that has gone into the big Search Engines’ spiders.
After a uniquely timed request to scrape some data from the web, it was like God was telling me, “Now is the time to learn web scraping and spidering.” So, I jumped at the opportunity.
Fortunately, the first few examples in the article used Ruby. The last one used Python because of its built in HTML Parsing library. I was able to find an HTML parsing gem for Ruby called hpricot which I found to work very well.
Unfortunately, the request that I had to do this came about from a confidential project - so I can’t share my code at this point. Nonetheless, it was quite a fun little project that included using ActiveRecord from Rails to store my scraped data in a MySQL database.
Check out the article and have fun programming. I know I did. Oh yeah, I looked at MySpace’s robots.txt file and it only specified limits for a user agent that I couldn’t recognize. I guess I should do a little digging to see if my scraper violates any legal terms of service.