simple-webscraper

1.5.4 • Public • Published

Web Scraper

  • CSS selectors
  • exporting function
  • pre-configured to insert results into SQLite database and generate CSV
  • stop conditions:
    • time
    • number of results
    • number of websites
  • filter function to check for results
  • post- and pre-processing functions
  • init with options or set them later with spider.setVal1(v).setVal2(v2)
  • builder (call chaining) design pattern
  • extensible

API

Docs in gh-pages.

const startURL = "https://stackoverflow.com/questions/...";
const crawler = new Spider(startURL);
crawler.setRespSecW8(20)
       .appendSelector('p.info')
       .appendSelector('p.more-info')
       .appendFollowSelector('.btn.next')
       .appendFollowSelector('.btn.next-page')
       .setPostProcessTextFunct(text => text.replace('mother', 'yes'))
       .setFilterFunct(txt => !!txt.match('sunflower'))
       .setTimeLimit(120) // sec
       .setThreadCount(8) // #workers
       .setSiteCount(100) // distinct URLs
       // run returns void, you might want to provide an export function for each result (see below)
       // by default goes to sqlite ./db and prints to console
       .run(); 

OR use init object in the constructor

// DEFAULT init options
const spiderOpts = {
  // Function<String, String, String, Promise>
  exportFunct: exports.combine(exports.console(), exports.sqlite()),
  // predicate i.e. Function<String, Boolean>
  filterFunct: (txt) => true, 
  // Array<String>
  followSelectors: [], 
  // String
  logInfoFile: undefined, // logging goes to console
  // String
  logInfoFile: undefined, // logging goes to console
  // Integer
  redirFollowCount: 3,
  // Integer
  respSecW8: 10,
  // Array<String>
  selectors: [], 
  // Integer
  resultCount: 100,
  // Integer
  siteCount: 10, // #sites
  // Integer
  threadCount: 4,
  // Integer
  timeLimit: 60, // sec
};
 
const startURL = "https://stackoverflow.com/questions/...";
const crawler = new Spider(startURL, spiderOpts);
crawler.run();
const startURL = "https://stackoverflow.com/questions/...";
const crawler = new Spider(startURL);
crawler.setRespSecW8(20)
       .appendSelector('p.info')
       .appendSelector('p.more-info')
       .appendFollowSelector('.btn.next')
       .appendFollowSelector('.btn.next-page')
       .setPostProcessTextFunct(text => text.replace('mother', 'yes'))
       .setFilterFunct(txt => !!txt.match('sunflower'))
       .setTimeLimit(120) // sec
       .setThreadCount(8) // #workers
       .setSiteCount(100) // distinct URLs
       // run returns void, you might want to provide an export function for each result (see below)
       // by default goes to sqlite ./db and prints to console
       .run(); 

See export functions below to save results.

Export Function

Must be of type (uri: string, selector: string, text: string) => Promise<*>. There is a few configurable export functions that you can use:

Import the exporting module:

const { exporting, Spider }  = require('simple-webscraper');

Declare a spider:

const spider = new Spider(uri, { /* opts */ });
  • sqlite

    Generates a Result table with columns: id INT, text TEXT, selector TEXT, uri TEXT columns.

    spider.setExportFunct(exporting.sqlite()) // generate output db name
          .run();
    spider.setExportFunct(exporting.sqlite('my-database.sqlite'))
          .run();
  • console

    spider.setExportFunct(exporting.console()) // default formatter
          .run();
    spider.setExportFunct(exporting.console('%s :: %s => %s')) // string formatter for (uri, selector, text)
          .run();
    spider.setExportFunct(exporting.console((uri, selector, text) => `${uri} :: ${text.slice(0, 100)}`))
          .run();
  • file

    spider.setExportFunct(exporting.file()) // default file name, default formatter
          .run();
    spider.setExportFunct(exporting.file('results.csv')) // custom file name, default csv formatter
          .run();
    spider.setExportFunct(exporting.file('results.log', 'INFO %s, %s, %s')) // custom file name, string formatter
          .run();
    spider.setExportFunct(exporting.file('results.log', (uri, selector, text) => `${uri} :: ${text.slice(0, 100)}`))
          .run();
  • combine (used to broadcast results to many exports)

    spider.setExportFunct(exporting.combine(
        exporting.sqlite(), 
        exporting.console(), 
        exporting.file(),
      )).run();
  • db

    spider.setExportFunct(exporting.db(dbURI)) // look at sequelize docs
          .run();
  • default (enabled by default, sends to console, CSV file and sqlite database)

It's very easy to define your own export function. E.g. imagine wanting to POST each result to some 3rd party API.

const myExportFunction = async (uri, selector, text) => {
  const res = await http.post(myURI, { uri, selector, text });
  return;
};

Example

More examples in ./examples.

const { Spider, exporting } = require('simple-webscraper');
 
(async function() {
  const s = new Spider('https://www.jobsite.co.uk/jobs/javascript');
 
  const sqliteExport = await exporting.sqlite('./db', true /* force wipe if exists */);
 
  s.setExportFunct(sqliteExport)
   .appendSelector(".job > .row > .col-sm-12")
    // don't look for jobs in London, make sure they are graduate!
   .setFilterFunct(txt => !!txt.match('raduate') && !txt.match('London'))
    // next page 
   .appendFollowSelector(".results-footer-links-container ul.pagination li a[href*='page=']") 
    // stop after 3 websites (urls)
   .setSiteCount(3)
    // run for 30 sec
   .setTimeLimit(30)
   .run();
})();

Package Sidebar

Install

npm i simple-webscraper

Weekly Downloads

1

Version

1.5.4

License

MIT

Unpacked Size

30.6 kB

Total Files

13

Last publish

Collaborators

  • nl253