A simple web crawler on node.
- persistent files on local disk or database (currently support mongodb)
- resume and continue
- multiple paths with content extraction patterns
- convert to markdown
- auto detect html encoding
- save images
npm install simple-node-crawler
var Crawler = require('simple-node-crawler');
var c = new Crawler({
host:'developer.51cto.com',
patterns: [{'path': 'art/', 'pattern': '.m_l' }],
usedb: true,
saveImage: true
}).start('http://developer.51cto.com/col/1308/');
-
host
- host constraint for the crawling. -
patterns
- if you want to crawl a specific path, you can specify the path name or leave it as ''; the pattern is the css patterns for the main body of the webpage, id/class/tag name are supported, if you need all the html body, you can specify 'body'. -
usedb
- if you want to use local file system, then set to false. If you have mongodb installed and want to use it, then set to true. -
saveImage
- whether to save images to local file system. -
dbConnectionString
- mongodb connection string, default to 'mongodb://localhost/test' -
utf8
- whether need to convert to uft8. Default to true. -
crawlerNumber
- how many cralwer thread you want to have. Default to 5.
- keyword analysis & extraction
MIT