A page scraping DSL for extracting structured information from unstructured XHTML, built on Node.js and jQuery.
Over my past couple years in the industry, there have been several times where I need to scrape structured information from (relatively) unstructured XHTML websites.
My approach to doing this has gradually evolved to include the following technologies:
I was starting to notice a lot of code duplication in my scraping scripts, enter jDistiller:
- jDistiller is a simple and powerful DSL for scraping structured information from XHTML websites.
- it is built on jQuery and Node.js.
- it grows out of my experiences, having built several one-off page scrapers.
npm install jdistiller
- first you create an instance of the jDistiller object:
var jDistiller = require'jdistiller'jDistiller;
- the set() method is used to specify key/css-selector pairs to scrape data from:
set'headline' '#article h1.articleHeadline'set'firstParagraph' '#article .articleBody p:eq(0)';
Simple Example (New York Times)
var jDistiller = require'jdistiller'jDistiller;set'headline' '#article h1.articleHeadline'set'firstParagraph' '#article .articleBody p:eq(0)'distill''console.logJSONstringifydistilledPage;
A closure can optionally be provided as the third parameter for the set() method.
If a closure is given, the return value of the closure will be set as a key's value, rather than the text value of the selector.
DSL Using an Optional Data Processing Closure
var jDistiller = require'jdistiller'jDistiller;set'headline' '#article h1.articleHeadline'set'firstParagraph' '#article .articleBody p:eq(0)'set'image' '#article .articleBody .articleSpanImage img'return elementattr'src'distill''console.logJSONstringifydistilledPage;
The closure will be passed the following values:
- element: a jQuery element matching the CSS selector specified in set().
- prev: if multiple elements on the page match the selector, the closure is will be executed once for each. prev can be used to interact with the object created by previous executions of the closure. As an example, we might want to increment a counter if the same link occurs multiple times on the same page.
- this: the state is shared between multiple executions of the same closure (see examples/wikipedia.js, to get an idea of why this is useful).
- strings: the last string returned by the closure will be used as the value.
- numbers: the last number returned by the closure will be used as the value.
- arrays: when an array is returned, it will be merged with all other arrays returned for the given key. The final merged array will be set as value.
- objects: when an object is returned, the object will be merged with all other objects returned. The final object will be used as the value.
- key/object-pair: this special return type allows value to be populated with an object that has dynamically generated key names.
Array Merging Example
var jDistiller = require'jdistiller'jDistiller;set'paragraphs' '#article .articleBody p'return elementtextdistill''console.logJSONstringifydistilledPage;
Object Merging Example
var jDistiller = require'jdistiller'jDistiller;set'headlines' '.mw-headline'thiscount = thiscount || 0;thiscount ++;if thiscount === 2return'second_heading': elementtexttrimif thiscount === 3return'third_heading': elementtexttrimdistill''console.logJSONstringifydistilledPage;;
var jDistiller = require'jdistiller'jDistiller;set'links' '#bodyContent p a'var key = elementattr'href';return keytitle: elementattr'title'href: keyoccurrences: prevkey ? prevkeyoccurrences + 1 : 1distill''console.logJSONstringifydistilledPage;;
I'm excited about jDistiller, I think it solves the scraping problem in an elegant way.
Don't be shy with your feedback, and please contribute.
-- Ben @benjamincoe