Nine Pedestrians Mesmerized
    Share your code. npm Orgs help your team discover, share, and reuse code. Create a free org »

    fetch-politelypublic

    Fetch Politely

    Build Status Dependency Status js-semistandard-style

    A library for ensuring polite outgoing HTTP requests that respect robots.txt and aren't made too close to each other

    Installation

    npm install fetch-politely --save

    Usage

    Simple:

    var fetchInstance = new FetchPolitely(function (err, url, message) {
      if (err) { return; }
     
      // The URL has been cleared for fetching – the hostname isn't throttled and robots.txt doesn't ban it
    }, {
      // Robots.txt checking requires specification of a User Agent as Robots.txt can contain User Agent specific rules
      // See http://en.wikipedia.org/wiki/User_agent for more info in format
      userAgent: 'Your-Application-Name/your.app.version (http://example.com/optional/full/app/url)',
    });
     
    // When a slot has been reserved the callback sent in the constructor will be called
    fetchInstance.requestSlot('http://foo.example.org/interesting/content/');

    FetchPolitely()

    var fetchInstance = new FetchPolitely(callback, [options]);

    Parameters

    • callbackfunction (err, url, message, [content]) {};, called for each succesful request slot

    Options

    • throttleDuration – for how long in milliseconds to throttle requests to each hostname. Defaults to 10 seconds.
    • returnContent – whether to fetch and return the content with the callback when a URL has received a request slot. Defaults to false.
    • logger – a Bunyan compatible logger library. Defaults to bunyan-duckling which uses console.log()/.error().
    • lookup – an object or class that keeps track of throttled hosts and queued URL:s. Defaults to PoliteLookup.
    • lookupOptions – an object that defines extra lookup options.
    • allowed – a function that checks whether a URL is allowed to be fetched. Defaults to PoliteRobot.allowed().
    • robotCache – a cache method used by PoliteRobot to cache fetched robots.txt. Defaults to wrapped lru-cache.
    • robotCacheLimit – a limit of the number of items to keep in the default lru-cache of PoliteRobot.
    • robotPool – an HTTP agent to use for the request-library of PoliteRobot.
    • userAgentrequired by PoliteRobot and options.returnContent. The User Agent to use for HTTP requests.

    Methods

    • requestSlot – tries to reserve a request slot for a URL.

    Static

    • FetchPolitely.PoliteError – a very polite error object used for eg. informing about denied URL:s
    • FetchPolitely.PoliteLookup – defines the interface for keeping track of throttled hosts and queued URL:s
    • FetchPolitely.PolitePGLookup – alternative lookup that uses PostgreSQL as the backend
    • FetchPolitely.PoliteRobot – checks whether URL:s are allowed to be fetched according to Robots.txt.

    fetchInstance.requestSlot()

    fetchInstance.requestSlot(url, [message], [options]);

    Parameters

    • url – the URL to reserve a request slot for
    • message – a JSON-encodeable optional message containing eg. instructions for the FetchPolitely callback.

    Options

    • allow – if set to true the URL will always be allowd and not be sent to the allowed function.
    • allowDuplicates – if set to false no more than one item of every url + message combination will be queued.

    PoliteLookup

    The simplest of simple implementations for keeping track of throttled hosts and queued URL:s. Handles it all in-memory. Same interface can be used to build a database backend for this though.

    PolitePGLookup

    A PostgreSQL + Knex-driven lookup that throttles hosts and queues URL using database tables.

    Use by setting up the tables in pglookup.sql and include by setting the FetchPolitely options to:

    {
      lookup: FetchPolitely.PolitePGLookup,
      lookupOptions: {
        knex: knexInstance
      }
    }

    Pull Requests are welcome if someone wants to pull out the Knex-dependency. Most projects where this has been used with Postgres has been using Knex so it got used here as well.

    lookupOptions

    • knexrequired – the database connection to use, provided through a Knex object.
    • purgeWindow – the minimum interval in milliseconds between two host purges. Defaults to 500 ms.
    • concurrentReleases – how many parallell database lookups to perform to check for released URL:s. Defaults to 2.
    • releasesPerBatch – how many URL:s to fetch in each database lookup. Defaults to 5.
    • onlyDeduplicateMessages – bool that if set will only deduplicate URL:s with the same message when deduplicating. Defaults to false.

    Lint / Test

    npm test or to watch, install grunt-cli then do grunt watch

    Keywords

    none

    install

    npm i fetch-politely

    Downloadsweekly downloads

    8

    version

    1.1.0

    license

    MIT

    repository

    githubgithub

    last publish

    collaborators

    • avatar