headcrab

0.1.0 • Public • Published

headcrab

Latch on; devour.

Headcrab is a web scraping toolkit for transforming webpages into structured data. It provides simple but useful features, such as reusable templates (called transformers) using CSS selectors, rate limiting, routing and auto paginating, all of which take a lot of the boilerplate and hassle out of scraping.

I'm aiming to cover a lots of common patterns, so if you have any feature suggestions please do get in touch with me by filing an issue or by email.

Be respectful when scraping. Rate limit your requests (just use the interval option) and cache your data. If you're going to be scraping heavily, contact the site owner or an admin first. Don't scrape if an API is available, and don't be evil.

Installation

Headcrab can be used in Node or io.js, and installed using npm.

$ npm install --save headcrab

If you want to use the CLI, you should save it globally too.

$ npm install -g headcrab

Usage

First include it in your code.

var headcrab = require("headcrab");

Every method in Headcrab is bound to a transformer. A Headcrab transformer consists of a CSS selector, a transform function and some useful options. It accepts HTML, parses it using cheerio and then applies the transform to every selected element.

That's a bit wordy, it's simpler in code. Say we wanted to grab every post on the Hacker News front page, transform each into an object with url and title keys, and return a single array.

First make a 'Hacker' transformer.

var Hacker = headcrab({
    // Find every post title
    selector: "tr .title a",
    // transform is called with every post title element.
    transform: function(link){
        // link is the 'a' object. Use cheerio methods.
        return {
            url: link.attr("href"),
            title: link.html().trim()
        }
    }
});

Now that you have a transformer, you have a bunch of options. The methods will be detailed later, but for now here's a simple scrape:

// Scrape the front page
Hacker.scrape("http://news.ycombinator.com", function(err, posts){
    console.log(posts);
    // posts is an array of transform results
    // => [{ url: "http://www.pluminjs.com/", title: "Create and manipulate fonts using Javascript" }, ...]
});

By separating the transformer from the URL, we can reuse it for every Hacker News list page.

Transformer Methods

For all method examples assume the following transformer.

// Gets the textual content of every h1
var transformer = headcrab({
    selector: "h1",
    transform: function(){
        return $(this).text();
    }
});

#scrape(url[s], [options], callback)

A transformer can be used to scrape a single webpage and return an array of results.

transformer.scrape("http://example.com", function(err, data){
    if(err) throw err;
    console.log(data); // => ["Example Domain"];
});

You can also pass an array to scrape multiple webpages.

transformer.scrape(["http://example.com", "http://something.else"], function(err, data){
    console.log(data.length === 2); // true
    console.log(data); // => [["Example Domain"], ["Something else", ...]];
});

You can pass a hash of options as the second argument. To improve the example above, we could set merge, and receive a flattened array of data.

transformer.scrape(["example.com", "something.else"], {
    merge: true
}, function(err, data){
    // See this flattened array.
    console.log(data); // => ["Example Domain", "Something else", ...]
});

Scrape Options

  • interval - Integer. Time in milliseconds to wait between each request. Requests are processed sequentially, and the timeout is set after each request is complete. Defaults to none.
  • each(data, url, idx) - Function. Function to be called after each page is scraped. Passed the data for that page, the page url and index. Defaults to null.
  • merge - Boolean. When multiple URLs are passed, the result will be an array of all those results. This flattens this array a layer, so all the results are together. It will do another sort as well if it's defined on the transformer. Defaults to false.
  • limit - Integer. Number of URLs in array to scrape, starting from the first. Useful when URLs are entered procedurally. Defaults to all.

#parse(html)

Parse a HTML string, or array of HTML strings, using the transformer.

var title = transformer.parse("<h1>Title</h1> Article body...");
// => ["Title"]
 
var titles = transformer.parse([
    "<h1>Article #1</h1>",
    "<h1>Article #2</h1>",
    "<h1>Article #3</h1>",
]);
// => [["Article #1"], ["Article #2"], ["Article #3"]]

#each(arr, options, callback)

Scrape multiple pages using a basic route and data from an array.

// Game is a transformer. It'll transform
// a Giant Bomb page into a hash of data.
var Game = headcrab({
    selector: "h1 a.wiki-title",
    transform: function(el){
        return {
            url: "http://giantbomb.com" + el.attr("href"),
            title: el.text()
        }
    }
});
 
// Visit each game page and transform.
Game.each([{id: 2980}, {id: 24079}, {id: 20238}], {
    route: "http://giantbomb.com/this-is-just-vanity/3030-{id}",
    interval: 30000,
    merge: true
}, function(err, games){
    // games is an array of game data.
    console.log(games);
});

For cases when your array doesn't contain objects, use a 'param' template key in your route to include the value.

UserTransformer.each([123, 124, 125], {
    // Simple routing template. Include id.
    route: "http://example.com/users?id={param}",
    interval: 30000,
    merge: true
}, function(err, users){
    console.log(users);
});

Each options

All of the options in #scrape, plus:

  • route - String. A string including keys from objects in the array. If array elements are not objects use 'param' as a template key. Required.

#walk(options, callback)

Walk through pages of a website using a simple route. The following (using the 'Hacker' transformer defined above) will scrape pages 1, 2 and 3 of Hacker News and return the results as a single array.

Hacker.walk({
    route: "https://news.ycombinator.com/news?p={param}",
    pages: 3,
    merge: true
}, function(err, data){
    // Data is an array of post objects.
});

Walk options

All of the options in #scrape and #each, plus:

  • start - Integer. Page number to start at. Defaults to 1.
  • pages - Integer. Number of pages to include. Defaults to 5.

#extend(options)

Extend a transformer, replacing some of its options and adding new ones.

// Selects all headers, instead of just h1.
var extended = transformer.extend({
    selector: "h1, h2, h3, h4, h5, h6",
    transform: function(el){
        return {
            level: parseInt(el[0].name.replace("h"), 10),
            content: el.html()
        }
    }
});
 
var titles = extended.parse("<h1>headcrab></h1><h3>#extend(options)</h3>");
console.log(titles);
// => [{level: 1, content: "headcrab"}, {level: 3, content: "#extend(options)"}];

Transformer Options

A Headcrab transformer is nothing but its options. A basic example looks like this:

// Grabs titles
var Titles = headcrab({
    selectors: "h1, h2, h3",
    transform: function(el, idx){
        return {
            priority: el.tagName.replace("h"),
            text: el.html()
        }
    }
});
  • selector - String. A CSS3/CSSSelect selector. Defaults to "body". See https://github.com/fb55/css-select for a list.
  • transform(el, idx, options, $) - Function to use on every selector match. Should return the transformed value - eg an object of extracted data. Is passed the cheerio element, index, Transformer options hash and cheerio object containing all selections. Defaults to a no-op, selection objects will be returned as they are.
  • limit - Integer. Maximum number of selections. Defaults to all.
  • keepFalsy - Boolean. If the transform function returns a falsy value, should it be kept? Defaults to false.
  • sort(a, b) - Function. A sorting function which delegates to Array#sort. Will be passed two transformed results. Defaults to order processed.

The following delegate to third party libraries.

May I suggest...

If your transform involves any unique helper functions that don't belong anywhere else, I think it's a good convention to add them straight to the options hash. It's a nice way to keep it all together, and means helpers can be can be replaced by extending the transformer.

// transforms/links.js
var headcrab = require("headcrab");
module.exports = headcrab({
    selector: "a",
    transform: function(el, idx, options){
        return options.coolLinks(el.attr("href"));
    },
    coolLinks: function(link){
        // return a cooler link
    }
});

For single use scraping

Although creating multi-use transformers is recommended, you can use headcrab for simple operations with a different syntax - headcrab(url, options, callback). Options define the transformer. For example:

// Grab the title of example.com
headcrab("http://example.com", {
    selector: "h1",
    transform: function(el){
        return el.text();
    }
}, function(err, title){
    console.log(title[0]);
    // => "Example Domain"
});

CLI

Headcrab also comes bundled with a CLI.

We emphasise the benefits of creating multi-use transformers. If you export a transformer configuration from a file you can use it from the CLI, and stream out the results as JSON. For example a transformer like this...

// example-transformer.js
module.exports = {
    selector: "ul.articles li",
    transform: function(el){
        return {
            title: el.find("h3").html(),
            description: el.find("span.deck").html(),
            author: el.find(".byline").text().trim()
        }
    }
}

...can be used like this...

$ headcrab http://example.com/articles -u example-transformer.js -o data/articles.json

...to save some JSON like this to data/articles.json.

{
    "results": [
        {
            "title": "This is a title",
            "description": "This is the <em>description</em>",
            "author": "Richard Foster"
        }
    ]
}

Alternatively you can pass a selector string and use the CLI without a transformer. The following will extract the innerHTML of all p elements on http://example.com and pipe out some JSON.

$ headcrab http://example.com -s "p"
=> ["This domain is established to be used for illustrative examples in documents. You may use this domain in examples without prior coordination or asking for permission.","<a href=\"http://www.iana.org/domains/example\">More information...</a>"]

CLI options

You can pass one or more URLs to headcrab.

$ headcrab http://reddit.com/r/programming http://reddit.com/r/web_design -u redditor.js -i 10000

Either a selector -s {str} or a transformer -u {path} is required.

  • -s / --selector - String. A selector string to use. Will take the innerHTML of each match. Alternatively use a transformer.
  • -u/ --use - String. Relative path to a transformer to use. If the file exports an object it will be wrapped in a headcrab transformer object. Alternatively use a selector.
  • -o / --output - String. Relative path to output the JSON result to. Defaults to piping to stdout.
  • -p / --pretty - Boolean or Number. Use the p flag to prettify the JSON result. When true, it will use 4 spaces. You can pass a number to change this, eg -p 2 will use 2 spaces. Defaults to false.

You can pass a subset of options to the scrape operation. See the #scrape options for more information.

  • -i / --interval - Number. Milliseconds to wait between scrapes. Defaults to 0.
  • -m / --merge - Boolean. Should the result arrays be merged? Defaults to false.
  • -l / --limit - Number. Limit the number of URLs passed which should be scraped. Useful when scraping programmatically. Defaults to all URLs.

Credits + License

It's MIT licensed so you can do what you like with it. Public attribution is nice too. Don't use it for anything evil.

Thanks to request, cheerio, and Half-Life.

Package Sidebar

Install

npm i headcrab

Weekly Downloads

8

Version

0.1.0

License

MIT

Last publish

Collaborators

  • zuren