Crawl the web breadth-first from a seed url, statefully
There are already tens of Node.js scripts that spider / crawl sites, many more Python and Ruby ones. It isn't clear how they work, or that they even recurse.
This project is a work in progress. I intend to document at least its philosophy, if not methodology, better than the competition.
ruthless because it does not respect robots.txt.
I wanted a stateful queue, in case of stops or restarts. Redis was the first choice, but I had too much metadata.
So I'm using Postgres right now. Not sure if that's the best idea. The
pages table has these columns:
id, parent_id, url, tag, depth, content, plaintext, queued, fetched, failed, error
parent_idset to the linking page's
depth + 1if they are on the same domain (protocol and subdomain insensitive), or
depth + 100if they do not share the domain (defined by the
hostnamefield that Node.js's
Here is an sample of depths retrieved for a single seed site (a blog) that I let run for a couple of minutes:
SELECT depth, COUNT(depth) FROM pages GROUP BY depth ORDER BY depth; Depth Count 0 1 1 81 2 319 3 731 4 851 100 22 101 151 102 1071 103 1593
As it was, I hadn't gotten through the 3-deep sites yet.
Question: should I even keep track of sites >10 deep? Do I care about other domains?
TODO: Add more than one worker! It's currently kind of slow, because most things happen in series.
Basically, I think the
seen cache can handle most locking issues; as soon as
work() fetches a new url, add the url to seen.
Even if a page is fetched twice, it's not a big deal!
If the supplied credentials have superuser privileges, the database and
pages table will be created automatically.
Otherwise, run the following at your command line to initialize (and reset) everything to the defaults:
dropdb ruthlesscreatedb ruthlesspsql ruthless < schema.sql
Copyright © 2012–2013 Christopher Brown. MIT Licensed.