Noiseless Party Machine

    crawlerr

    1.5.0 • Public • Published

    crawlerr

    Greenkeeper badge Build Status License npm version npm downloads

    crawlerr is simple, yet powerful web crawler for Node.js, based on Promises. This tool allows you to crawl specific urls only based on wildcards. It uses Bloom filter for caching. A browser-like feeling.


    • Simple: our crawler is simple to use;
    • Elegant: provides a verbose, Express-like API;
    • MIT Licensed: free for personal and commercial use;
    • Server-side DOM: we use JSDOM to make you feel like in your browser;
    • Configurable pool size, retries, rate limit and more;

    Installation

    $ npm install crawlerr

    Usage

    crawlerr(base [, options])

    You can find several examples in the examples/ directory. There are the some of the most important ones:

    Example 1: Requesting title from a page

    const spider = crawlerr("http://google.com/");
     
    spider.get("/")
      .then(({ req, res, uri }) => console.log(res.document.title))
      .catch(error => console.log(error));

    Example 2: Scanning a website for specific links

    const spider = crawlerr("http://blog.npmjs.org/");
     
    spider.when("/post/[digit:id]/[all:slug]", ({ req, res, uri }) => {
      const post = req.param("id");
      const slug = req.param("slug").split("?")[0];
     
      console.log(`Found post with id: ${post} (${slug})`);
    });

    Example 3: Server side DOM

    const spider = crawlerr("http://example.com/");
     
    spider.get("/").then(({ req, res, uri }) => {
      const document = res.document;
      const elementA = document.getElementById("someElement");
      const elementB = document.querySelector(".anotherForm");
     
      console.log(element.innerHTML);
    });

    Example 4: Setting cookies

    const url = "http://example.com/";
    const spider = crawlerr(url);
     
    spider.request.setCookie(spider.request.cookie("foobar=…"), url);
    spider.request.setCookie(spider.request.cookie("session=…"), url);
     
    spider.get("/profile").then(({ req, res, uri }) => {
      //… spider.request.getCookieString(url);
      //… spider.request.setCookies(url);
    });

    API

    crawlerr(base [, options])

    Creates a new Crawlerr instance for a specific website with custom options. All routes will be resolved to base.

    Option Default Description
    concurrent 10 How many request can be run simultaneously
    interval 250 How often should new request be send (in ms)
    null See request defaults for more informations

    public .get(url)

    Requests url. Returns a Promise which resolves with { req, res, uri }, where:

    Example:

    spider
      .get("/")
      .then(({ res, req, uri }) => …);

    public .when(pattern)

    Searches the entire website for urls which match the specified pattern. pattern can include named wildcards which can be then retrieved in the response via res.param.

    Example:

    spider
      .when("/users/[digit:userId]/repos/[digit:repoId]", ({ res, req, uri }) => …);

    public .on(event, callback)

    Executes a callback for a given event. For more informations about which events are emitted, refer to queue-promise.

    Example:

    spider.on("error", …);
    spider.on("resolve", …);

    public .start()/.stop()

    Starts/stops the crawler.

    Example:

    spider.start();
    spider.stop();

    public .request

    A configured request object which is used by retry-request when crawling webpages. Extends from request.jar(). Can be configured when initializing a new crawler instance through options. See crawler options and request documentation for more informations.

    Example:

    const url = "https://example.com";
    const spider = crawlerr(url);
    const request = spider.request;
     
    request.post(`${url}/login`, (err, res, body) => {
      request.setCookie(request.cookie("session=…"), url);
      // Next requests will include this cookie
     
      spider.get("/profile").then();
      spider.get("/settings").then();
    });

    Request

    Extends the default Node.js incoming message.

    public get(header)

    Returns the value of a HTTP header. The Referrer header field is special-cased, both Referrer and Referer are interchangeable.

    Example:

    req.get("Content-Type"); // => "text/plain"
    req.get("content-type"); // => "text/plain"

    public is(...types)

    Check if the incoming request contains the "Content-Type" header field, and it contains the give mime type. Based on type-is.

    Example:

    // Returns true with "Content-Type: text/html; charset=utf-8"
    req.is("html");
    req.is("text/html");
    req.is("text/*");

    public param(name [, default])

    Return the value of param name when present or defaultValue:

    • checks route placeholders, ex: user/[all:username];
    • checks body params, ex: id=12, {"id":12};
    • checks query string params, ex: ?id=12;

    Example:

    // .when("/users/[all:username]/[digit:someID]")
    req.param("username");  // /users/foobar/123456 => foobar
    req.param("someID");    // /users/foobar/123456 => 123456

    Response

    public jsdom

    Returns the JSDOM object.


    public window

    Returns the DOM window for response content. Based on JSDOM.


    public document

    Returns the DOM document for response content. Based on JSDOM.

    Example:

    res.document.getElementById();
    res.document.getElementsByTagName();
    // …

    Tests

    npm test

    Install

    npm i crawlerr

    DownloadsWeekly Downloads

    4

    Version

    1.5.0

    License

    MIT

    Unpacked Size

    36.5 kB

    Total Files

    24

    Last publish

    Collaborators

    • bartozzz