@coya/web-scraper

    0.2.2 • Public • Published

    Web Scraper

    Web scraper on top of PhantomJS or Chromium.
    If you chose to use PhantomJS, the module is designed as a connection client/server between the PhantomJS web scraper server and a client acting like a driver and sending scraping HTTP requests to the server.
    Chromium is different because it is driven directly from NodeJS.

    Installation

    npm install @coya/web-scraper
    

    Build (for dev)

    git clone https://github.com/Cooya/WebScraper
    cd WebScraper
    npm install // it will also install the development dependencies
    npm install phantomjs -g // if you need PhantomJS, install it globally
    npm run build
    npm run example // run the example script in "examples" folder
    

    Usage examples

    The package allows to inject JS function :

    const { ChromiumScraper } = require('@coya/web-scraper');
     
    // if you want to use PhantomJS instead of Chromium
    // const { PhantomScraper } = require('@coya/web-scraper');
     
    const scraper = ChromiumScraper.getInstance();
     
    const getLinks = function() { // return all links from the requested page
        return $('a').map(function(i, elt) {
            return $(elt).attr('href');
        }).get();
    };
     
    scraper.request({
        url: 'cooya.fr',
        fct: getLinks // function injected in the page environment
    })
    .then(function(result) {
        console.log(result); // returned value of the injected function
        scraper.close(); // end the client/server connection and kill the web scraper subprocess
    }, function(error) {
        console.error(error);
        scraper.close();
    });

    Or to inject JS function from an external script :

    const { ChromiumScraper } = require('@coya/web-scraper');
     
    // if you want to use PhantomJS instead of Chromium
    // const { PhantomScraper } = require('@coya/web-scraper');
     
    const scraper = ChromiumScraper.getInstance();
     
    scraper.request({
        url: 'cooya.fr',
        fct: __dirname + '/externalScript.js', // external script exporting the function to be injected
    })
    .then(function(result) {
        console.log(result); // returned value of the injected function
        scraper.close(); // end the client/server connection and kill the web scraper subprocess
    }, function(error) {
        console.error(error);
        scraper.close();
    });

    externalScript.js :

    module.exports = function() { // return all links from the requested page
        return $('a').map(function(i, elt) {
            return $(elt).attr('href');
        }).get();
    };

    Methods

    ScraperClient.getInstance()

    The ScraperClient object is a singleton, only one client can be created, so this method is required to get the client instance.

    request(params)

    Send a request to a specific url and inject JavaScript into the page associated. Return a promise with the result in parameter.

    Parameter Type Description Default value
    params object see below for details about this none

    close()

    Terminate the PhantomJS web scraper process that will allow to end the current NodeJS script properly.

    Request parameters spec

    Parameter Type Description Required
    url string target url yes
    fct function JS function to inject into the page yes
    fct string path to script path and function to inject separated by hash key (e.g. "path/to/script/script.js#functionToCall") yes
    referer string referer header parameter set in each request optional
    args object object passed to the injected function optional
    debug boolean enable the debug mode (verbose) optional

    Install

    npm i @coya/web-scraper

    DownloadsWeekly Downloads

    1

    Version

    0.2.2

    License

    ISC

    Last publish

    Collaborators

    • avatar