pauk
Basic Node.js web crawler.
Install
$ npm install pauk
Example:
var Pauk = pauk = maxRequests: 10; pauk { cache;}; pauk;
Test
$ grunt
API
new Pauk(config)
Constructor. Configuration defaults:
www: true // if true, www.example.com will resolve to example.commaxRequests: 5 // maximum number of total requestsignoreQuery: true // if true, query part of the URI will be ignored
Object passed to the constructor will be merged to default configuration object.
Example:
var pauk = maxRequests: 55;
crawl(uri, parent)
Main method that will crawl the webpage.
- uri - URI to be crawled
- parent - URI of the page that contains link to uri
Example:
var pauk = ;pauk;
onFinish
Called when crawling is finished. Parameter passed to this function is object containing the crawling data. Keys of this object are normalized URIs and values are objects containing:
error: 'string' // (optional) in case of an error assets: images: // values of src attributes of 'img' tags scripts: // values of src attributes of 'script' tags css: // values of href attributes of 'link' tags other: // values of href attributes of 'a' tags that are not URIs links: // values of href attributes of 'a' tags (relative are converted to absolute) external: // values of href attributes of 'a' tags that are external links
Example:
var pauk = ;pauk { cache;}pauk;