Unleash awesomeness. Private packages, team management tools, and powerful integrations. Get started with npm Orgs »

scrape-pages

3.1.2 • Public • Published

scrape-pages

Travis Master Build Status npm license

This package scrapes sites for text and files based on a single config file representing the crawler's flow.

⚠️ This project is under active development. Expect bugs and frequent api changes. If you wish to see progress, check out the github projects boards

Installation

npm install scrape-pages

Usage

lets download the five most recent images from NASA's image of the day archive

const { scraper } = require('scrape-pages')
// create a config file
const config = {
  scrape: {
    download: 'https://apod.nasa.gov/apod/archivepix.html',
    parse: {
      selector: 'body > b > a:nth-child(-n+10)',
      attribute: 'href'
    },
    scrapeEach: {
      download: 'https://apod.nasa.gov/apod/{value}',
      parse: {
        selector: 'a[href^="image"]',
        attribute: 'href'
      },
      scrapeEach: {
        name: 'image',
        download: 'https://apod.nasa.gov/apod/{value}'
      }
    }
  }
}
const options = {
  folder: './downloads',
  logLevel: 'info',
  logFile: './nasa-download.log'
}
 
// load the config into a new 'scraper'
const scraper = await scrape(config, options)
const { on, emit, query } = scraper
on('image:compete', id => {
  console.log('COMPLETED image', id)
})
on('done', () => {
  console.log('finished.')
  const result = query({ scrapers: ['images'] })
  // result = [{
  //   images: [{ filename: 'img1.jpg' }, { filename: 'img2.jpg' }, ...]
  // }]
})

For more real world examples, visit the examples directory

Documentation

The scraper instance created from a config object is meant to be reusable and cached. It only knows about the config object. scraper.run can be called multiple times, and, as long as different folders are provided, each run will work independently. scraper.run returns emitter

scrape

param type required type file description
config ConfigInit Yes src/settings/config/types.ts what is being downloaded
options RunOptionsInit Yes src/settings/options/types.ts how something is downloaded

scraper

The scrape function returns a promise which yeilds these utilities (on, emit, and query)

on

Listen for events from the scraper

event callback arguments description
'done' queryFor when the scraper has completed
'error' error if the scraper encounters an error
'<scraper>:progress' queryFor, download id emits progress of download until completed
'<scraper>:queued' queryFor, download id when a download is queued
'<scraper>:complete' queryFor, download id when a download is completed

emit

While the scraper is working, you can affect its behavior by emitting these events:

event arguments description
'useRateLimiter' boolean turn on or off the rate limit defined in the run options
'stop' stop the crawler (note that in progress requests will still complete)

each event will return the queryFor function as its first argument

query

This function is an argument in the emitter callback and is used to get data back out of the scraper whenever you need it. These are its arguments:

name type required description
scrapers string[] Yes scrapers who will return their filenames and parsed values, in order
groupBy string Yes name of a scraper which will delineate the values in scrapers

Motivation

The pattern to download data from a website is largely similar. It can be summed up like so:

  • get a page from a url
    • scrape the page for more urls
      • get a page
        • get some text or media from page

What varies is how much nested url grabbing is required and in which steps data is saved. This project is an attempt to generalize that process into a single static config file.

Describing a site crawler with a single config enforces structure, and familiarity that is less common with other scraping libraries. Not only does this make yours surface api much more condensed, and immediately recognizable, it also opens the door to sharing and collaboration, since passing json objects around the web is safer than executable code. Hopefully, this means that users can agree on common configs for different sites, and in time, begin to contribute common scraping patterns.

Generally, if you could scrape the page without executing javascript in a headless browser, this package should be able to scrape what you wish. However, it is important to note that if you are doing high volume production level scraping, it is always better to write your own scraper code.

install

npm i scrape-pages

Downloadsweekly downloads

0

version

3.1.2

license

MIT

homepage

github.com

repository

Gitgithub

last publish

collaborators

  • avatar
Report a vulnerability