simple-node-crawler

A simple web crawler on node.

features

persistent files on local disk or database (currently support mongodb)
resume and continue
multiple paths with content extraction patterns
convert to markdown
auto detect html encoding
save images

install

npm install simple-node-crawler

usage

var Crawler = require('simple-node-crawler');

var c = new Crawler({
	host:'developer.51cto.com',
	patterns: [{'path': 'art/', 'pattern': '.m_l' }],
	usedb: true,
	saveImage: true
}).start('http://developer.51cto.com/col/1308/');

configuration

host - host constraint for the crawling.
patterns - if you want to crawl a specific path, you can specify the path name or leave it as ''; the pattern is the css patterns for the main body of the webpage, id/class/tag name are supported, if you need all the html body, you can specify 'body'.
usedb - if you want to use local file system, then set to false. If you have mongodb installed and want to use it, then set to true.
saveImage - whether to save images to local file system.
dbConnectionString - mongodb connection string, default to 'mongodb://localhost/test'
utf8 - whether need to convert to uft8. Default to true.
crawlerNumber - how many cralwer thread you want to have. Default to 5.

feature to be implemented

keyword analysis & extraction

license

MIT

simple-node-crawler

simple-node-crawler

features

install

usage

configuration

feature to be implemented

license

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

Weekly Downloads

Version

License

Last publish

Collaborators

simple-node-crawler

simple-node-crawler

features

install

usage

configuration

feature to be implemented

license

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

DownloadsWeekly Downloads

Version

License

Last publish

Collaborators

Weekly Downloads