crawlkit

2.0.2 • Public • Published

CrawlKit

Build status npm npm David node bitHound Overall Score Commitizen friendly semantic-release Code Climate

A crawler based on PhantomJS. Allows discovery of dynamic content and supports custom scrapers. For all your ajaxy crawling & scraping needs.

  • Parallel crawling/scraping via Phantom pooling.
  • Custom-defined link discovery.
  • Custom-defined runners (scrape, test, validate, etc.)
  • Can follow redirects (and because it's based on PhantomJS, JavaScript redirects will be followed as well as <meta> redirects.)
  • Streaming
  • Resilient to PhantomJS crashes
  • Ignores page errors

Install

npm install crawlkit --save

Usage

const CrawlKit = require('crawlkit');
const anchorFinder = require('crawlkit/finders/genericAnchors');
 
const crawler = new CrawlKit('http://your/page');
crawler.setFinder({
    getRunnable: () => anchorFinder
});
 
crawler.crawl()
    .then((results) => {
        console.log(JSON.stringify(results, true, 2));
    }, (err) => console.error(err));

Also, have a look at the samples.

API

See the API docs (published) or the docs on doclets.io (live).

Debugging

CrawlKit uses debug for debugging purposes. In short, you can add DEBUG="*" as an environment variable before starting your app to get all the logs. A more sane configuration is probably DEBUG="*:info,*:error,-crawlkit:pool*" if your page is big.

Contributing

Please contribute away :)

Please add tests for new functionality and adapt them for changes.

The commit messages need to follow the conventional changelog format so semantic-release picks the semver versions properly. It is probably easiest if you install commitizen via npm install -g commitizen and commit your changes via git cz.

Available runners

Products using CrawlKit

Package Sidebar

Install

npm i crawlkit

Weekly Downloads

29

Version

2.0.2

License

MIT

Last publish

Collaborators

  • joscha