Simple web scraper chassis, scrapes fields from a list of web pages and dumps the results to JSON/CSV files.
A very simple web scraper chassis. Claw takes a web page (or list of pages), scrapes some info from those pages, then dumps the results to a JSON or CSV file.
- a page url, or array of URLs
- a selection to scrape
- fields to pull out from within that section
- an output folder
- number of seconds to delay
var claw = require'claw';var claw_config =pages :''''selector : 'h3'fields :"text" : "$(sel).find('a').text()""href" : "$(sel).find('a').attr('href')"delay : 3clawinitclaw_config;
Each page gets saved to a separate output file.
Here's what the exported JSON looks like:
and here's the CSV:
text,href"HELLO! Online: celeb & royal news, magazine, babies, weddings, �","""Hello - Wikipedia, the free encyclopedia","""Hello | Define Hello at Dictionary.com",""
Claw can import its configuration from a JSON file:
The file looks like this:
"pages" :"""""selector" : "h3""fields" :"href" : "$(sel).find('a').attr('href')""delay" : 5
You can also use claw from the command line. First, install it globally:
npm install -g claw
Then run it in the same folder as your config file:
This will create a folder called sample1, with your results.
Claw can grab its page list from a JSON file that is a list of urls (or an object with .href properties). Instead of an array, just set "pages" to a file name and path.
claw_config.pages = "pages.json";
Questions? Ideas? Hit me up on twitter - @dylanized