bot-marvin
Highly scalable crawler with best features.
Basic useful feature list:
- Asynchronus crawling
- Distributed Breadth first crawls
- Scalable horizontally as well vertically
- Url partitioning for better scheduling
- Scheduling using fetch interval and priority
- Supports robots.txt and sitemap.xml parsing
- Uses Apache Tika for file parsing
- Web app for viewing crawled data and analytics
- Faul Tolerant and Auto Recovery on failures
- Wide range support of all meta tags and http codes.
- Support for all the tags advised by google crawl guide.
- Creates web graph
- Collects rss feeds and author info
- Pluggable parsers
- Pluggable indexers (currently MongoDB supported)
install
sudo npm install bot-marvin
Starting your first crawl
//You need to create a seed.json file first //it looks like this "_id": "http://www.imdb.com" "parseFile": "nutch" "priority": 1 "fetch_interval": "monthly" "limit_depth": -1 "_id": "http://www.elastic.co" "parseFile": "nutch" "priority": 1 "fetch_interval": "monthly" "limit_depth": -1 "_id": "http://www.rottentomatoes.com" "parseFile": "nutch" "priority": 1 "fetch_interval": "monthly" "limit_depth": 10 /* _id : is the url parseFile : is the file name present in parsers dir (default: 'nutch') priority : is from 1-100 indicates the percentage of urls of the domain in a single crawl job. Number of urls of a domain in batch = (priority/100) * batch_size Fetch interval is recrawl interval supported values (always|weekly|monthly|yearly) you can add custom time intervals in the config limit_depth: is used to restrict crawling by depth, -1 means no limit by depth */
# Step 1 Set your db configuration sudo bot-marvin-db# Step 2 Set your bot config sudo bot-marvin --config # Step 3 Load your seed file sudo bot-marvin --loadSeedFile <path_to_your_seed_file> # Step 4 Run your crawler sudo bot-marvin
Contributing
- Fork it!
- Create your feature branch:
git checkout -b my-new-feature
- Commit your changes:
git commit -am 'Add some feature'
- Push to the branch:
git push origin my-new-feature
- Submit a pull request :D
http://tilakpatidar.github.io/bot-marvin
Documentation is available atStuff used to make this:
- request for making http requests
- mongodb for mongodb connectivity
- underscore Js utility functions library
- immutable Js lib for advanced data structures
- check-types for Strict type checking
- cheerio for parsing html pages
- robots for parsing robots.txt files
- colors for beautiful consoling
- crypto for encryption
- death for handling gracefull exit
- minimist for cmd line features
- progress for download progress bars
- string-editor for providing nano like editor for editing config from terminal
- node-static server for web app
- feed-read for parsing rss feeds