CrawlKit
A crawler based on PhantomJS. Allows discovery of dynamic content and supports custom scrapers. For all your ajaxy crawling & scraping needs.
- Parallel crawling/scraping via Phantom pooling.
- Custom-defined link discovery.
- Custom-defined runners (scrape, test, validate, etc.)
- Can follow redirects (and because it's based on PhantomJS, JavaScript redirects will be followed as well as
<meta>
redirects.) - Streaming
- Resilient to PhantomJS crashes
- Ignores page errors
Install
npm install crawlkit --save
Usage
const CrawlKit = ;const anchorFinder = ; const crawler = 'http://your/page';crawler; crawler ;
Also, have a look at the samples.
API
See the API docs (published) or the docs on doclets.io (live).
Debugging
CrawlKit uses debug for debugging purposes. In short, you can add DEBUG="*"
as an environment variable before starting your app to get all the logs. A more sane configuration is probably DEBUG="*:info,*:error,-crawlkit:pool*"
if your page is big.
Contributing
Please contribute away :)
Please add tests for new functionality and adapt them for changes.
The commit messages need to follow the conventional changelog format so semantic-release picks the semver versions properly. It is probably easiest if you install commitizen via npm install -g commitizen
and commit your changes via git cz
.
Available runners
- HTML Codesniffer runner: Audit a website with the HTML Codesniffer to find accessibility defects.
- Google Chrome Accessibility Developer Tools runner: Audit a website with the Google Chrome Accessibility Developer Tools to find accessibility defects.
- aXe runner: Audit a website with aXe.
- Yours? Create a PR to add it to this list here!