Jason the Miner
Harvesting data at the <html>
mine... Jason the Miner, a versatile Web scraper for Node.js.
⛏ Features
- Composable: via a modular architecture based on pluggable processors. The output of one processor feeds the input of the next one. There are 3 types of processors:
- loaders: to fetch the data (via HTTP requests, by reading text files, etc.)
- parsers: to parse the data (HTML by default) & extract the relevant parts according to a predefined schema
- transformers: to transform and/or output the results (to a CSV file, via email, etc.)
- Configurable: each processor can be chosen & configured independently
- Extensible: you can register your own custom processors
- CLI-friendly: Jason the Miner works well with pipes & redirections
- Promise-based API
- MIT-licensed
⛏ Installing
$ npm install -g jason-the-miner
⛏ Demos
Clone the project...
$ git clone https://github.com/mawrkus/jason-the-miner.git$ cd jason-the-miner$ npm install$ npm run demos
...and have a look at the "demos" folder, among them, you'll find scraping:
- Simple GitHub search (JSON, CSV, Markdown output)
- Extended GitHub search with issues (including following links & paginating issues)
- Goodreads books and following to Amazon to grab their product ID
- Google search and follow search results for finding mobile apps
- IMDb images gallery links (with pagination)
- Mixcloud stats, templating them & sending them by mail
- Mixcloud SPA scraping controlling a headless browser
- Avatars download
- Bulk insertions to Elasticsearch from a CSV file
- ...
⛏ Examples
CLI
Scraping the most popular Javascript scrapers from GitHub:
// github-config.json "load": "http": "url": "https://github.com/search" "params": "q": "scraper" "l": "JavaScript" "type": "Repositories" "s": "stars" "o": "desc" "parse": "html": "repos": ".repo-list .repo-list-item h3 > a" "transform": "json-file": "path": "./github-repos.json"
$ jason-the-miner -c github-config.json
Alternatively, with pipes & redirections:
// github-config.json "parse": "html": "repos": ".repo-list .repo-list-item h3 > a"
$ curl https://github.com/search?q=scraper&l=JavaScript&type=Repositories&s=stars&o=desc | jason-the-miner -c github-config.json > github-repos.json
API
const JasonTheMiner = ; const jason = ; const load = http: url: "https://github.com/search" params: q: "scraper" l: "JavaScript" type: "Repositories" s: "stars" o: "desc" ; const parse = html: "repos": ".repo-list .repo-list-item h3 > a" ; jason;
⛏ The config file
"load": "[loader name]": // loader options "parse": "[parser name]": // parser options "transform": "[transformer name]": // transformer options
Loaders
Jason the Miner comes with 5 built-in loaders:
Name | Description | Options |
---|---|---|
http |
Uses axios as HTTP client | All axios request options + [_concurrency=1] (to limit the number of concurrent requests when following/paginating) & [_cache] (to cache responses on the file system) |
browser |
Uses puppeteer as browser | puppeteer launch , goto , screenshot , pdf and evaluate options |
file |
Reads the content of a file | path , [stream=false] , [encoding="utf8"] & [_concurrency=1] (to limit the number of concurrent requests when paginating) |
csv-file |
Uses csv-parse to read a CSV file | All csv-parse options in a csv object + path + [encoding="utf8"] |
stdin |
Reads the content from the standard input | [encoding="utf8"] |
For example, an HTTP load config which responses will be cached in the "tests/http-cache" folder:
..."load": "http": "baseURL": "https://github.com" "url": "/search?l=JavaScript&o=desc&q=scraper&s=stars&type=Repositories" "_concurrency": 2 "_cache": "_folder": "tests/http-cache" ...
Check the demos folder for more examples.
Parsers
Currently, Jason the Miner comes with a 2 built-in parsers:
Name | Description | Options |
---|---|---|
html |
Parses HTML, built with Cheerio | A parse schema |
csv |
Parses CSV, built with csv-parse | All csv-parse options |
HTML schema definition
Examples
... "html": // Single value "repo": ".repo-list .repo-list-item h3 > a" // Collection of values "repos": ".repo-list .repo-list-item h3 > a" // Single object "repo": "name": ".repo-list .repo-list-item h3 > a" "description": ".repo-list .repo-list-item div:first-child" // Single object, providing a root selector _$ "repo": "_$": ".repo-list .repo-list-item" "name": "h3 > a" "description": "div:first-child" // Collection of objects "repos": "_$": ".repo-list .repo-list-item" "name": "h3 > a" "description": "div:first-child" // Following "repos": "_$": ".repo-list .repo-list-item" "name": "h3 > a" "description": "div:first-child" "_follow": "_link": "h3 > a" "stats": "_$": ".pagehead-actions" "watchers": "li:nth-child(1) a.social-count" "stars": "li:nth-child(2) a.social-count" "forks": "li:nth-child(3) a.social-count" // Paginating "repos": "_$": ".repo-list .repo-list-item" "name": "h3 > a" "description": "div:first-child" "_paginate": "_link": ".pagination > a[rel='next']" "_depth": 1 ...
Full flavour
... "html": "title": "title | trim" "metas": "lang": "html < attr(lang)" "content-type": "meta[http-equiv='Content-Type'] < attr(content)" "stylesheets": "link[rel='stylesheet'] < attr(href)" "repos": "_$": ".repo-list .repo-list-item ? text(crawler)" "_slice": "0,3" "name": "h3 > a" "last-update": "relative-time < attr(datetime)" "_follow": "_link": "h3 > a" "description": "meta[property='og:description'] < attr(content) | trim" "url": "link[rel='canonical'] < attr(href)" "stats": "_$": ".pagehead-actions" "watchers": "li:nth-child(1) a.social-count | trim" "stars": "li:nth-child(2) a.social-count | trim" "forks": "li:nth-child(3) a.social-count | trim" "_follow": "_link": ".js-repo-nav span[itemprop='itemListElement']:nth-child(2) > a" "open-issues": "_$": ".js-navigation-container li > div > div:nth-child(3)" "desc": "a:first-child | trim" "opened": "relative-time < attr(datetime)" "_paginate": "_link": "a[rel='next']" "_slice": "0,1" "_depth": 2 ...
As you can see, a schema is a plain object that recursively defines:
- the names of the values/collection of values that you want to extract: "title" (single value), "metas" (object), "stylesheets" (collection of values), "repos" (collection of objects)
- how to extract them:
[selector] ? [matcher] < [extractor] | [filter]
(check "Parse helpers" below)
Additional instructions can be passed to the parser:
_$
acts as a root selector: further parsing will happen in the context of the element identified by this selector_slice
limits the number of elements to parse, likeString.prototype.slice(begin[, end])
_follow
tells Jason to follow a single link (fetch new data) & to continue scraping after the new data is received_paginate
tells Jason to paginate (fetch & scrape new data) & to merge the new values in the current context, here multiple links can be selected to scrape in parallel multiple pages
Parse helpers
The following syntax specifies how to extract a value:
[property name]: [selector] ? [matcher] < [extractor] | [filter]
For instance:
..."repos": ".repo-list-item h3 > a ? text(crawler) < attr(title) | trim"...
Will extract a "repos" array of values from the links identified by the ".repo-list-item h3 > a" selector, matching only the ones containing the text "crawler". The values will be retrieved from the "title" attribute of each link and will be trimmed.
Jason has 4 built-in element matchers:
text(regexString)
html(regexString)
attr(attributeName,regexString)
slice(begin,end)
They are used to test an element in order to decide whether to include/discard it from parsing. If not specified, Jason includes every element.
7 built-in text extractors:
text([optionalStaticText])
(by default)html()
attr(attributeName)
regex(regexString)
date(inputFormat,outputFormat)
(parses a date with moment)uuid()
(generates a uuid v1 with uuid)count()
(counts the number of elements matching the selector, needs an array schema definition)
and 5 built-in text filters:
trim
single-space
lowercase
uppercase
json-parse
(to parse JSON, like JSON-LD)
Transformers
Name | Description | Options |
---|---|---|
stdout |
Writes the results to stdout | [encoding="utf8"] |
json-file |
Writes the results to a JSON file | path & [encoding="utf8"] |
csv-file |
Writes the results to a CSV file using csv-stringify | csv : same as csv-stringify + path , [encoding='utf8'] and [append=false] (whether to append the results to an existing file or not) |
download-file |
Downloads files to a given folder using axios | [baseURL] , [parseKey] , [folder='.'] , [namePattern='{name}'] , [maxSizeInMb=1] & [concurrency=1] |
email |
Sends the results by email using nodemailer | Same as nodemailer, split between the smtp and message options |
Jason supports a single transformer or an array of transformers:
... "transform": "json-file": "path": "./github-repos.json" "csv-file": "path": "./github-repos.csv"
⛏ Bulk processing
Parameters can be defined in a CSV file and applied to configure the processors:
"bulk": "csv-file": "path": "./github-search-queries.csv" "csv": "columns": true "delimiter": "," "load": "http": "baseURL": "https://github.com" "url": "/search?l={language}&o=desc&q={query}&s=stars&type=Repositories" "_concurrency": 2 "parse": "html": "title": "< text(Best {language} repos)" "repos": ".repo-list .repo-list-item h3 > a" "transform": "json-file": "path": "./github-repos-{language}.json"
github-search-queries.csv :
language,query
JavaScript,scraper
Python,scraper
⛏ API
constructor({ fallbacks = {} } = {})
fallbacks
defines which processor to use when not explicitly configured (or missing in the config file):
load
: 'identity',parse
: 'identity',transform
: 'identity',bulk
: null
The fallbacks change when using the CLI (see bin/jason-the-miner.js
):
load
: 'stdin',parse
: 'html',transform
: 'stdout',bulk
: null
loadConfig(configFile)
Loads a config from a JSON or JS file.
jason;
harvest({ bulk, load, parse, transform } = {})
Launches the harvesting process:
jason ;
You can pass custom options to temporarily override the current config:
jason ;
To permanently override the current config, you can modify Jason's config
property:
const allResults = ; jason ;
registerHelper({ category, name, helper })
Registers a parse helper in one of the 3 categories: match
, extract
or filter
.
helper
must be a function.
const url = ; jason;
registerProcessor({ category, name, processor })
Registers a new processor in one of the 3 categories: load
, parse
or transform
.
processor
must be a class implementing the run()
method:
jason; { // receives automatically its config } /** * @param * @return {Promise.<*>} */ { // must be implemented & must return a promise. } jasonconfigtransform = template: "templatePath": "my-template.tpl" "outputPath": "my-page.html" ;
Be aware that loaders must also implement the getConfig()
and buildLoadOptions({ link })
methods.
Have a look at the source code for more info.
⛏ Testing
$ git clone https://github.com/mawrkus/jason-the-miner.git$ cd jason-the-miner$ npm install$ npm run test
⛏ Resources
- Web Scraping With Node.js: https://www.smashingmagazine.com/2015/04/web-scraping-with-nodejs/
- X-ray, The next web scraper. See through the noise: https://github.com/lapwinglabs/x-ray
- Simple, lightweight & expressive web scraping with Node.js: https://github.com/eeshi/node-scrapy
- Node.js Scraping Libraries: http://blog.webkid.io/nodejs-scraping-libraries/
- https://www.scrapesentry.com/scraping-wiki/web-scraping-legal-or-illegal/
- http://blog.icreon.us/web-scraping-and-you-a-legal-primer-for-one-of-its-most-useful-tools/
- Web scraping o rastreo de webs y legalidad: https://www.youtube.com/watch?v=EJzugD0l0Bw
- Scraper API blog: https://www.scraperapi.com/blog/
⛏ A final note...
Please take these guidelines in consideration when scraping:
- The content being scraped is not copyright protected.
- The act of scraping does not burden the services of the site being scraped.
- The scraper does not violate the Terms of Use of the site being scraped.
- The scraper does not gather sensitive user information.
- The scraped content adheres to fair use standards.