dcrawler

0.0.8 • Public • Published

node-distributed-crawler

Features

  • Distributed crawler
  • Configurable url parser and data parser
  • jQuery selector using cheerio
  • Parsed data insertion in Mongodb collection
  • Domain wise interval configuration in distributed enviroment
  • node 0.8+ support

Note: update to latest version (0.0.4+), don't use 0.0.1

I am actively updating this library, for any feature suggestion or git fork request are welcomed :)

Installation

$ npm install dcrawler

Usage

var DCrawler = require("dcrawler");
 
var options = {
    mongodbUri:     "mongodb://0.0.0.0:27017/crawler-data",
    profilePath:    __dirname + "/" + "profile"
};
var logs = {
    dbUri:      "mongodb://0.0.0.0:27017/crawler-log",
    storeHost:  true
};
var dc = new DCrawler(options, logs);
dc.start();

Note: mongodb connection uri (mongodbUri & dbUri) should be same (queueing of urls should be centeralized)

The DCrawler takes options and log options construcotr:

  1. options with following porperties *:
  • mongodbUri: Mongodb connection uri (Eg: 'mongodb://0.0.0.0:27017/crawler') *
  • profilePath: Location of profile directory which contains config files. (Eg: /home/crawler/profile) *
  1. logs to store logs in centrelized location using winston-mongodb with following porperties:
  • dbUri: Mongodb connection uri (Eg: 'mongodb://0.0.0.0:27017/crawler')
  • storeHost: Boolean, true or false to store workers host name or not in log collection.

Note: logs is required when you want to store centralize logs in mongodb, if you don't want to store logs no need to pass logOptions variable in DCrawler constructor

var dc = new DCrawler(options);

Create config file for each domain inside profilePath directory. Check example profile example.com, contains config with following porperties:

  • collection: Name on collection to store parsed data in mongodb. (Eg: 'products') *
  • url: Url to start crawling. String or Array of url. (Eg: 'http://example.com' or ['http://example.com']) *
  • interval: Interval between request in miliseconds. Default is 1000 (Eg: For 2 secods interval: 2000)
  • followUrl: Boolean, true or false to fetch further url from the crawled page and crawl that url as well.
  • resume: Boolean, true or false to resume crawling from previous crawled data.
  • beforeStart: Function to execute before start crawling. Function has config param which contains perticular profile config object. Example function:
beforeStart: function (config) {
    console.log("started crawling example.com");
}
  • parseUrl: Function to get further url from crawled page. Function has error, response object and $ jQuery object param. Function returns Array of url string. Example function:
parseUrl: function (error, response, $) {
    var _url = [];
    
    try {
        $("a").each(function(){
            var href = $(this).attr("href");
            if (href && href.indexOf("/products") > -1) {
                if (href.indexOf("http://example.com") === -1) {
                    href = "http://example.com/" + href;
                }
                _url.push(href);
            }
        )};
    } catch (e) {
        console.log(e);
    }
    
    return _url;
}
  • parseData: Function to exctract information from crawled page. Function has error, response object and $ jQuery object param. Function returns data Object to insert in collection . Example function:
parseData: function (error, response, $) {
    var _data = null;
    
    try {
        var _id = $("h1#productId").html();
        var name = $("span#productName").html();
        var price = $("label#productPrice").html();
        var url = response.uri;
        
        _data = {
            _id: _id,
            name: name,
            price: price,
            url: url
        }
    } catch (e) {
        console.log(e);
    }
    
    return _data;
}
  • onComplete: Function to execute on completing crawling. Function has config param which contains perticular profile config object. Example function:
onComplete: function (config) {
    console.log("completed crawling example.com");
}

Chirag (blikenoother -[at]- gmail [dot] com)

Package Sidebar

Install

npm i dcrawler

Weekly Downloads

1

Version

0.0.8

License

none

Last publish

Collaborators

  • blikenoother