simple-node-crawler

1.0.5 • Public • Published

simple-node-crawler

A simple web crawler on node.

features

  • persistent files on local disk or database (currently support mongodb)
  • resume and continue
  • multiple paths with content extraction patterns
  • convert to markdown
  • auto detect html encoding
  • save images

install

npm install simple-node-crawler

usage

var Crawler = require('simple-node-crawler');

var c = new Crawler({
	host:'developer.51cto.com',
	patterns: [{'path': 'art/', 'pattern': '.m_l' }],
	usedb: true,
	saveImage: true
}).start('http://developer.51cto.com/col/1308/');

configuration

  • host - host constraint for the crawling.
  • patterns - if you want to crawl a specific path, you can specify the path name or leave it as ''; the pattern is the css patterns for the main body of the webpage, id/class/tag name are supported, if you need all the html body, you can specify 'body'.
  • usedb - if you want to use local file system, then set to false. If you have mongodb installed and want to use it, then set to true.
  • saveImage - whether to save images to local file system.
  • dbConnectionString - mongodb connection string, default to 'mongodb://localhost/test'
  • utf8 - whether need to convert to uft8. Default to true.
  • crawlerNumber - how many cralwer thread you want to have. Default to 5.

feature to be implemented

  • keyword analysis & extraction

license

MIT

Readme

Keywords

Package Sidebar

Install

npm i simple-node-crawler

Weekly Downloads

6

Version

1.0.5

License

MIT

Last publish

Collaborators

  • aaronzhcl