claw

Simple web scraper chassis, scrapes fields from a list of web pages and dumps the results to JSON/CSV files.

claw

A very simple web scraper chassis.

Takes:

  • a page url
  • a selection to scrape
  • fields to pull out from within that section
  • an output folder
  • number of seconds to delay

and it creates CSV and JSON files with the results. Claw creates a separate file for each page it scrapes.

// libararies
var claw = require('claw');
    
// get settings
var page = 'http://www.bing.com/search?q=hello';

var selector = 'h3 a';

var fields = {
    "text" : "text()",
    "href" : "attr('href')"
};

claw(page, selector, fields, 'output', 3);

Give it an array of pages, and it will save the results of each page to a separate file.

claw(['http://www.bing.com/search?q=hello', 'http://www.bing.com/search?q=goodbye'], selector, fields, 'output', 3);

Claw can also grab its page list from JSON file that is a list of urls (or an object with .href properties).

claw("pages.json", selector, fields, 'output', 3);

Questions? Ideas? Hit me up on twitter - @dylanized