PuppetScraper is a opinionated wrapper library for utilizing Puppeteer to scrape pages easily, bootstrapped using Jared Palmer's tsdx.
Most people create a new scraping project by require
-ing Puppeteer and create their own logic to scrape pages, and that logic will get more complicated when trying to use multiple pages.
PuppetScraper allows you to just pass the URLs to scrape, the function to evaluate (the scraping logic), and how many pages (or tabs) to open at a time. Basically, PuppetScraper abstracts the need to create multiple page instances and retrying the evaluation logic.
Version 0.1.0 note: PuppetScraper was initially made as a project template rather than a wrapper library, but the core logic is still the same with various improvements and without extra libraries, so you can include PuppetScraper in your project easily using npm
or yarn
.
Brief example
Here's a basic example on scraping the entries on first page Hacker News:
// examples/hn.js const PuppetScraper = ; const ps = await PuppetScraper; const data = await ps; console; await ps;
View more examples on the examples
directory.
Usage
Installing dependency
Install puppet-scraper
via npm
or yarn
:
$ npm install puppet-scraper --- or ---$ yarn add puppet-scraper
Install peer dependency puppeteer
or Puppeteer equivalent (chrome-aws-lambda
, untested):
$ npm install puppeteer --- or ---$ yarn add puppeteer
Instantiation
Create the PuppetScraper instance, either launching a new browser instance, connect or use an existing browser instance:
const PuppetScraper = ;const Puppeteer = ; // launches a new browser instanceconst instance = await PuppetScraper; // connect to an existing browser instanceconst external = await PuppetScraper; // use an existing browser instanceconst browser = await Puppeteer;const existing = await PuppetScraper;
Customize options
launch
and connect
has the same props with Puppeteer.launch
and Puppeteer.connect
, but with an extra concurrentPages
and maxEvaluationRetries
property:
const PuppetScraper = ; const instance = await PuppetScraper;
concurrentPages
is for how many pages/tabs will be opened and use for scraping.
maxEvaluationRetries
is for how many times the page will try to evaluate the given function on evaluateFn
(see below), where if the evaluation throws an error, the page will reload and try to re-evaluate again.
If concurrentPages
and maxEvaluationRetries
is not determined, it will use the default values:
;;
Scraping single page
As shown like the example above, use .scrapeFromUrl
and pass an object with the following properties:
url: string
, page URL to be openedevaluateFn: function
, function to evaluate (scraper method)pageOptions: object
,Puppeteer.DirectNavigationOptions
props to override page behaviors
const data = await instance;
pageOptions
defaults the waitUntil
property to networkidle0
, which you can read more on the API documentation.
Scraping multiple pages
Same as .scrapeFromUrl
but passes urls
property which contain string
s of URL:
urls: string[]
, page URLs to be openedevaluateFn: function
, function to evaluate (scraper method)pageOptions: object
,Puppeteer.DirectNavigationOptions
props to override page behaviors
const urls = Array; const data = await instance;
Closing instance
When there's nothing left to do, don't forget to close the instance with closes the browser:
await instance;
Access the browser instance
PuppetScraper also exposes the browser instance if you want to do things manually:
const browser = instance___internalbrowser;
Contributing
Thanks goes to these wonderful people (emoji key):
Griko Nibras 💻 🚧 |
This project follows the all-contributors specification. Contributions of any kind welcome!