Extract rich metadata from URLs.
npm install scrappy --save
Scrappy uses a simple two step process to extract the metadata from any URL or file. First, it runs through plugin-able
scrapeStream middleware to extract metadata about the file itself. With the result in hand, it gets passed on to a plugin-able
extract pipeline to format the metadata for presentation and extract additional metadata about related entities.
Makes the HTTP request and passes the response into
Accepts a HTTP response object and transforms it into
Accepts a readable stream and input scrape result (at a minimum should have
url, but could add other known metadata - e.g. from HTTP headers), and returns the scrape result after running through the plugin function. It also accepts an
abort function, which can be used to close the stream early.
The default plugins are in the
plugins/ directory and combined into a single pipeline using
compose (based on
throwback, but calls
next(stream) to pass a stream forward).
Extraction is based on a single function,
extract. It accepts the scrape result, and an optional array of helpers. The default extraction maps the scrape result into a proprietary format useful for applications to visualize. After the extraction is done, it iterates over each of the helper functions to transform the extracted snippet.
Some built-in extraction helpers are available in the
helpers/ directory, including a default favicon selector and image dimension extraction.
This example uses
scrapeAndExtract (a simple wrapper around
extract) to retrieve metadata from a webpage. In your own application, you may want to write your own
makeRequest function or override other parts of the pipeline (e.g. to enable caching or customize the user-agent, etc).
# Build the fixtures directory with raw content.node scripts/fixtures.js# Scrape the metadata results from fixtures.node scripts/scrape.js# Extract the snippets from the previous results.node scripts/extract.js