reqscraper
Lightweight wrapper for Request and X-Ray JS.
Sample Usage
This module contains the requestJS
for making HTTP requests, and x-ray
for easily scraping websites, called req
and scrape
respectively.
Both return promise. req
has internal control structure to retry request up to 5 times for failsafe.
Brief API doc
-
req(options)
, whereoptions
is a request options object. SeerequestJS
for full detail. -
scrape(dyn, url, scope, selector)
, wheredyn
is the boolean to use dynamic scraping usingx-ray-phantom
;url
is the page url,scope
andselector
are some HTML selectors. Seex-ray
for full detail. -
scrapeCrawl(dyn, url, selector, tailArr, [limit])
, wheredyn
is true for dynamic scraping usingx-ray-phantom
;
req(options)
Convenient wrapper for request js
- HTTP request method that returns a promise.
param | desc |
---|---|
options |
A request options object. See requestJS for full detail. |
// importsvar scraper = require('reqscraper');var req = scraper.req; // the request module // sample use of reqvar options = { method: 'GET', url: 'https://www.google.com', headers: { 'Accept': 'application/json', 'Authorization': 'some_auth_details' } } // returns the request result in a promise, for chainingreturn req(options)// prints the result.then(console.log)// prints the error if thrown.catch(console.log)
scrape(dyn, url, scope, selector)
Scraper that returns a promise. Backed by x-ray
.
param | desc |
---|---|
dyn |
the boolean to use dynamic scraping using x-ray-phantom |
url |
the page url to scrape |
[scope] |
Optional scope to narrow now the target HTML for selector |
selector |
HTML selector. See x-ray for full detail. |
// importsvar scraper = require('reqscraper');var scrape = scraper.scrape; // the scraper // sample use of scrape, non-dynamicreturn scrape(false, 'https://www.google.com', 'body')// prints the HTML <body> tag.then(console.log) // You can also call it with scope in param #3, and selector in #4return scrape(false, 'https://www.google.com', 'body', ['li'])// prints the <li>'s inside the <body> tag.then(console.log)
scrapeCrawl(dyn, url, selector, tailArr)
An extension of scrape
above with crawling capability. Returns a promise with results in a tree-like JSON structure. Crawls by a breath-first tree structure, and does not crawl deeper if the root of a branch is not crawlable.
param | desc |
---|---|
dyn |
the boolean to use dynamic scraping using x-ray-phantom |
url |
the base page url to scrape and crawl from |
selector |
The selector for the base page (first level) |
tailArr |
An array of selectors for each level to crawl. Note that a preceeding selector must specify the urls to crawl via hrefs . |
[limit] |
An optional integer to limit the number of children crawled at every level. |
// importsvar scraper = require('reqscraper');var scrapeCrawl = scraper.scrapeCrawl; // the scrape-crawler // dynamic scrapervar dc = scrapeCrawl.bind(null, true)// static scrapervar sc = scrapeCrawl.bind(null, false) // sample use of scrape-crawl, static // base selector, level 0// has attribute `hrefs` for crawling nextvar selector0 = { img: ['.dribbble-img'], h1: ['h1'], hrefs: ['.next_page@href']} // has attribute `hrefs` for crawlingvar selector1 = { h1: ['h1'], hrefs: ['.next_page@href']}// the last selector where crawling ends; no need for `hrefs`var selector2 = { h1: ['h1']} // Sample call of the methodsc( 'https://dribbble.com', selector0, // crawl for 3 more times before stoppping at the 4th level [selector1, selector1, selector1, selector2] ).then(function(res){ // prints the result console.log(JSON.stringify(res, null, 2))}) // Same as above, but with a limit on how many children should be crawled (3 below)sc( 'https://dribbble.com', selector0, // crawl for 3 more times before stoppping at the 4th level [selector1, selector1, selector1, selector2], 3 ).then(function(res){ // prints the result console.log(JSON.stringify(res, null, 2))})
Changelog
Aug 18 2015
- Added scrapecrawl, basically a scraper extended from
scrape
that can also crawl. - Updated README for better API doc.