Light Crawler - Directed Crawler
A simplified directed web crawler, easy to use for scraping pages and downloading resources.
English Doc(Here) or 中文文档.
Install
npm install light-crawler
Example
const Crawler = ;// create a instance of Crawlerlet c = ;// add a url or an array to requestc;// define a scraping rulec// start your crawlercstart;
Crawler Property
In light-crawler,requesting page is called task
.Tasks will be put into task-pool and be executed in order.
-
settings
: basic settings of crawler-
id
: id of the crawler,integer or string,defalut:null
-
interval
: crawling interval,defalut:0
(ms).or a random value in a range e.g.[200,500]
-
retry
: retry times,defalut:3
-
concurrency
: an integer for determining how many tasks should be run in parallel,defalut:1
-
skipDuplicates
: whether skip the duplicate task(same url),defalut:false
-
requestOpts
: request options of task,this is global request optionstimeout
: defalut:10000
proxy
: proxy addressheaders
: headers of request,defalut:{}
- or other settings in request opts
-
-
taskCounter
: count all finished tasks whether they are failed or not -
failCounter
: count all failed tasks -
doneCounter
: count tasks which has done -
started
: boolean -
finished
: boolean -
errLog
: record all error infos in crawling -
downloadDir
: downloaded files in here, default:../__dirname
-
drainAwait
: crawler will be finished when task-pool is drained.This prop will let crawler await adding tasks when task-pool is drained.default:0
(ms) -
tasksSize
: size of task-pool, exceeding tasks is in the buffer of task-pool, default:50
-
logger
: show the console log, default:false
Crawler API
Crawler(opts: object)
construtor of Crawler
// e.g.:let c = interval: 1000 retry: 5 ... // other props of `crawler.settings` requestOpts: timeout: 5000 proxy: 'http://xxx' ... // other props of `crawler.requestOpts` ;
tweak(opts: object)
tweak settings of crawler
addTasks(urls: string or array[, props: obejct])
add task into task-pool
// e.g. // add single task // input: urlc; // input: url, prop// set request options for the task(will override global)c; // input: url, next(processor of the task)// crawler rules will not process this task againc; // input: url, prop, nextc; // or input an objectc; // add multiple tasks // input: an array of stringc; // add prop for tasksc;// get these props in processing functionc; // input: an array of objectc;
addRule(reg: string|object, func: function)
define a rule for scraping
// e.g.:let tasks = 'http://www.google.com/123' 'http://www.google.com/2546' 'http://www.google.com/info/foo' 'http://www.google.com/info/123abc';c;c;c;// or you can not define the rulec; // $(i.e. cheerio.load(result.body)) is a optional argc;
Tip: light-crawler will transform all
.
in rule string.So you can directly writewww.a.com
instead ofwww\\.a\\.com
. If you need.*
,you can use**
, just like the upper example.If you have to use.
,just<.>
.
start()
start the crawler
// e.g.:cstart;
pause()
pause the crawler
resume()
resume the crawler
isPaused()
the crawler is is paused or not
stop()
stop the crawler
uniqTasks()
reomve duplicate task(deeply compare)
log(info: string, isErr: boolean, type: int)
crawler's logger
// e.g.:// if it's an error,c.errLog will append itc;// console print: // [c.settings.id if it has]some problems // type is color code of first '[...]', e.g.'[Crawler is Finished]'// 1 red,2 green,3 yellow,4 blue,5 magenta,6 cyan...so onc;// console print: // [c.settings.id if it has][Parsed]([Parsed] wil be blue)blahblah~ // you can do something after log() everytimec
Download Files
just add downloadTask: true
for task you need to download
// e.g.:// specify download directoryc; let file = 'http://xxx/abc.jpg';// 'abc.jpg' will be downloaded into 'D:\\yyy'c;// or you can specify its namec;// or specify relative dir(to 'D:\\yyy')// if this directory ('jpg') doesn't exist,crawler will create itc;// or specify absolute dirc;
Events
start
after the crawler is started
// e.g.c;
beforeCrawl
task's props: id
,url
,retry
,working
,requestOpts
,downloadTask
,downloadFile
...so on
// e.g.c;
drain
when task-pool and its buffer are drained
// e.g.c;
error
Utils API
getLinks(html: string, baseUrl: string)
get all links in the element
// e.g.:let html = ` <div> <ul> <li> <a href="http://link.com/a/1">1</a> <a href="a/2">2</a> <a href="b/3">3</a> </li> <li><a href="4">4</a></li> <li>foo</li> </ul></div>`;let links = Crawler;console;// ['http://link.com/a/1','http://link.com/a/2','http://link.com/b/3','http://link.com/4'] // you can also use cheeriolet $ = cheerio;let links = Crawler;
getImages(html: string, baseUrl: string)
like getLinks
, get src
from <img>
.
loadHeaders(file: string)
load request headers from file
example.headers
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip, deflate, sdch
Accept-Language:zh-CN,zh;q=0.8,en;q=0.6
Cache-Control:max-age=0
Connection:keep-alive
Cookie:csrftoken=Wwb44iw
Host:abc
Upgrade-Insecure-Requests:1
User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64)
...
load this file and set headers for requesting
let headers = Crawler;c;
getRegWithPath(fromUrl: string)
get reg string with path of fromUrl
let reg = Crawler;// reg: http://www.google.com/test/**
Advanced Usage
addRule
// since 1.5.10, the rule of scraping could be a objectc;c;// following rules has same reg string, but name are differentc;c; // using function match could make rules more complex// boolean match(task)c;c;c;
loadRule
recycle rules
// lc-rules.jsexportscrawlingGoogle = reg: 'www.**.com' name: 'google' { // ... }; // crawler.jslet c = ;c;c; // or expand the function named 'scrape'// implement the 'expand' in 'loadRule'// on the other hand, you can use 'this'(Crawler) in 'addRule' or 'loadRule'crawlingGoogle = // ... { ; }; crawlerAAA; crawlerBBB;
removeRule
remove some rules
// by its 'ruleName'let rule = // ... ruleName: 'someone' // ...c;c;