Simple web scraper chassis, scrapes fields from a list of web pages and dumps the results to JSON/CSV files.


A very simple web scraper chassis. Claw takes a web page (or list of pages), scrapes some info from those pages, then dumps the results to a JSON or CSV file.

Accepts parameters

  • a page url, or array of URLs
  • a selection to scrape
  • fields to pull out from within that section
  • an output folder
  • number of seconds to delay
    var claw = require('claw');
    var claw_config = {
        pages : [
        selector : 'h3',
        fields : {
            "text" : "$(sel).find('a').text()",
            "href" : "$(sel).find('a').attr('href')"
        delay : 3

Each page gets saved to a separate output file.

Here's what the exported JSON looks like:

        "text": "HELLO! Online: celeb & royal news, magazine, babies, weddings",
        "href": ""
        "text": "Hello - Wikipedia, the free encyclopedia",
        "href": ""
        "text": "Hello | Define Hello at",
        "href": ""

and here's the CSV:

"HELLO! Online: celeb & royal news, magazine, babies, weddings, �",""
"Hello - Wikipedia, the free encyclopedia",""
"Hello | Define Hello at",""

Claw can import its configuration from a JSON file:


The file looks like this:

    "pages" : [
    "selector" : "h3",
    "fields" : {
        "href" : "$(sel).find('a').attr('href')"
    "delay" : 5

You can also use claw from the command line. First, install it globally:

npm install -g claw

Then run it in the same folder as your config file:

claw sample1.json

This will create a folder called sample1, with your results.

Claw can grab its page list from a JSON file that is a list of urls (or an object with .href properties). Instead of an array, just set "pages" to a file name and path.

claw_config.pages = "pages.json";

Questions? Ideas? Hit me up on twitter - @dylanized