node package manager


Easily build flexible, scalable, and distributed, web crawlers.

This project will soon be superseded by node-web-crawler.

Flexible Web Crawler

Easily build flexible, scalable, and distributed, web crawlers for node.

var flexible = require('flexible');
// Initiate a crawler. Chainable. 
var crawler = flexible('')
    .route('*/search?q=', function (reqresbodydocnext) {
        console.log('Search results handled for query:', req.params.q);
    .route('*/users/:name', function (reqresbodydocnext) {
        crawler.navigate('' +;
    .route('*', function (reqresbodydocnext) {
        console.log('Every other document is handled by this route.');
    .on('complete', function () {
        console.log('All of the queued locations have been crawled.');
    .on('error', function (error) {
        console.error('Error:', error.message);
  • Asynchronous friendly, and evented, API for easily building flexible, scalable, and distributed web crawlers.
  • An array based queue for small crawls, and a PostgreSQL based queue for massive, and efficient, crawls.
  • Uses a fast, lightweight, and forgivable, HTML parser to ensure proper document compatibility for crawling.
  • Component system; use different queues, a router (wildcards, placeholders, etc), and other components.
npm install flexible

Or from source:

git clone git:// 
cd flexible
npm link
Crawl the web using Flexible for node.
Usage: node [...]/flexible.bin.js
  --url, --uri                  URL of web page to begin crawling on.                        [string]  [required]
  --domains, -d                 List of domains to allow crawling of.                        [string]
  --interval, -i                Request interval of each crawler.                          
  --encoding, -e                Encoding of response body for decoding.                      [string]
  --max-concurrency, -m         Maximum concurrency of each crawler.                       
  --max-crawl-queue-length, -M  Maximum length of the crawl queue.                         
  --user-agent, -A              User-agent to identify each crawler as.                      [string]
  --timeout, -t                 Maximum seconds a request can take.                        
  --follow-redirect             Follow HTTP redirection responses.                           [boolean]
  --max-redirects               Maximum amount of redirects.                               
  --proxy, -p                   An HTTP proxy to use for requests.                           [string]
  --controls, -c                Enable pause (ctrl-p), resume (ctrl-r), and abort (ctrl-a).  [boolean]  [default: true]
  --pg-uri, --pg-url            PostgreSQL URI to connect to for queue.                      [string]
  --pg-get-interval             PostgreSQL queue get request interval.                     
  --pg-max-get-attempts         PostgresSQL queue max get attempts.

Returns a configured, navigated and or with crawling started, crawler instance.

Returns a new Crawler object.

Configure the crawler to use a component.

Process a location, and have the crawler navigate (queue) to it.

Have the crawler crawl (recursive).

Have the crawler pause crawling.

Have the crawler resume crawling.

Have the crawler abort crawling.

  • navigated (url) Emitted when a location has been successfully navigated (queued) to.
  • document (doc) Emitted when a document is finished being processed by the crawler.
  • paused Emitted when the crawler has paused crawling.
  • resumed Emitted when the crawler has resumed crawling.
  • complete Emitted when all navigated (queued) to locations have been crawled.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see