National Pest Management
    Share your code. npm Orgs help your team discover, share, and reuse code. Create a free org »

    docparse-parse-scraped-workerpublic

    Docparse Scraper API server

    Parse scraped documents

    Startup

    To start the parseScraped worker, execute

    node parseScrapedWorker.js --config test/config.json

    This will create a new Parser object which can receive remote requests to start parsing documents. See app.js for the construction of a new Parser object. Also See the parseScrapedWorker.js file in the project root for details

    var ParseScrapedWorker = require('./index')
    var config = require('nconf').defaults({
      seaport: {
        host: 'localhost',
        port: 4598
      }
    })
    var parser = new ParseScrapedWorker(config)

    Parsing

    The parseScrapedWorker object has a function parseScraped. This function should be called with a scraped document as the first parameter and a callback function as the second.

    var inspect = require('eyespect').inspector();
    var config = require('nconf').defaults({
      seaport: {
        host: 'localhost',
        port: 4598
      }
    })
    var parser = new ParseScrapedWorker(config)
    var scrapedDoc = {
      supplierCode: 'HES',
      payload: {
        billDate: '2011-02-23 00:00:00 +00:00',
        accountNumber: 'fooAccountNumber',
        billNumber: 'fooBillNumber',
        loginID: 'fooLoginID',
        supplierCode: 'HES',
        textPages: ['foo page 1', 'bar page 2'] // these are the extracted text pages from the bill pdf file
      }
    }
     
    parser.parseScraped(scrapedDoc, function (err, reply) {
      if (err) {
        inspect(err, 'error parsing scraped document')
        return
      }
      inspect(reply, 'parsed scraped document correctly')
    })

    Sockets

    A scraped document is parsed differently for each supplier. Therefore in the DocParse system, there need to parsing servers online for each supplier. When the ParseScrapedWorker is initiated, it binds to a request Axon socket for each supported supplier. See lib/getRemoteParseSockets.js for details. It also registers a service with seaport so that the supplier response sockets know where to connect

    The system currently supports HES (Hess), NST (NStar), NGE (NGrid Electric), and NGA (NGrid Gas). For HES, the ParseScrapedWorker would bind to a socket as follows

    var config = require('nconf').defaults({
      seaport: {
        host: 'localhost',
        port: 4598
      }
    })
    var seaConfig = config.get('seaport')
    var seaHost = seaConfig.host
    var seaPort = seaConfig.port
    var ports = seaport.connect(seaHost, seaPort)
    var role = 'pushScrapedParseJobHES' // all push roles are in the format "pushScrapedParseJob<supplier code>"
    var port = ports.register(role) // register with seaport so the remote HES parsing server knows where to connect
    var socket = axon.socket('req');
    socket.format('json')
    socket.bind(port)

    This binding happens for each supported supplier. The ParseScrapedWorker keeps a reference to each supplier-specific push socket in its self.sockets property. The sockets property is an object keyed by supplierCode When the ParseScrapedWorker.parseScraped method is called, it gets the appropriate supplier-specific socket and sends out a request to the remote parsing server. The supplier-specific socket is an Axon req socket.

    Parsing Scraped Documents

    Each instatiated ParseScrapedWorker object has a parseScraped function. This function is called with an unparsed scraped document and a callback. The parseScraped function gets the appropriate supplier-specific request socket and sends a new parsing request out the the remote supplier-specific parsing server

    Parser.prototype.parseScrape = function (doc, cb) {
      var self = this
      var config = self.config
      var supplierCode = doc.supplierCode
      var sockets = self.sockets
      var socket = sockets[supplierCode]
      socket.send(scrapedDoc, cb)
    })

    In the example above, the parseScrape function actually calls lib/parseRemote function. The functionality is the same, but parseRemote adds a some timeout logic in case the remote parsing server goes down request fails for some reason

    Keywords

    none

    install

    npm i docparse-parse-scraped-worker

    Downloadsweekly downloads

    17

    version

    1.0.9

    license

    none

    last publish

    collaborators

    • avatar