mreaper

0.0.5-3 • Public • Published

Reaper– a nodejs crawler extractor framework

Introduction

A crawler extractor essentially crawls through web pages and saves the data within it in some format such as json or csv.

A crawler extractor follows this sequence of steps

  • --Fetch a url
  • --Select some items (such as menu items) via xpath selectors
  • --Select submenus
  • --Click on submenus
  • --Load resultant page
  • --Page through search results
    • In each page
      • Select elements
      • Click on that
      • Go to resulting page
      • Extract elements
      • Clean extracted elements
  • --Save the results in json, or csv
  • --Before saving apply rules to remove/add/alter any field in each record

How to use

Create an instructions file, say reaper.json. Add this to config.js The Reaper can then be instantiated and run as

let {Reaper} = require("mreaper");

let config = require("./config").config;

new Reaper(config).run().then(()=>{

console.log("finished");

process.exit(0);

})

Example instruction jsons for some use cases are given in examples folder.

Design

This consists of one engine, multiple utility classes, and a hierarchical json configuration to wire everything up.

Also, a cleansers directory can have a set of cleanser functions to clean the data of specific fields. One cleanser is supported for one field now.

A recordhandler directory can have a set of recordhandler functions to apply simple rules to each row of data that is created. A recordhandler can add or remove fields, or perform transformations on some field.

Instructions and results concepts

Instructions are objects which specify actions to execute. Results are stored in a Results object which is hierarchichal, in the form of a tree. Results can also be stored as a csv file, if the configuration options is specified

The instructions file

The instructions file is designed to mimic the actions that a human user would perform if (theoretically) she were to navigate the site manually.

The steps that a user would follow are as mentioned in the introduction section. This structure seems to be hierarchichal – first go to a site, then perform sequence of clicks/hovers etc to reach a particular page, then extract items from that page. This process is repeated many times.

Some common patterns

  • --Multiple links of same type (such as menu items, or lists of items) are followed recursively.
  • --Multiple pages of search results with paginator are followed

" Instructions" is a big json object, consisting of an array of instructions. Each instruction consists of some action such as select, or follow link or extract. For performing an action, an action object needs some parameters.

Each instruction node would have

  • --A selector to decide which elements need to be selected for that action
  • --A action function to perform the action(s)
  • --A config JSON object to provide parameters to the action
  • --Optionally, an "andThen" element that consists of one or more instructions that need to be executed within the context of the instruction. This is useful when inner links need to be followed recursively and context needs to be preserved till the last instruction of the series is not finished.
  • --For extractor actions, optionally a cleanser can be specified for cleansing the data that is obtained after the action is completed.

If an action creates objects such as pages that need to be destroyed after the action is complete, it can store it in the context. All the items which are disposable will be cleared by the engine after completion of the action.

Each context lives within the scope of the instruction. If an instruction has "andThen" elements inside it, then the scope of the instruction is not changed, and the context is destroyed only after all the instructions within the "andThen" elements are completed.

Illustrative flow

Consider the following extraction flow

{ "instructions" : [

  {
     "action" : "Fetcher" ,
     "name" : "FirstFetcher" ,
     "addResultNode":true,
     "params" : {
       "url" : **"http://www.schoolbooks.ie"
    ** }
  },
  {
     **"action" : **"Selector" ,
     **"name" : **"selectMenuItems" ,
     **"params" : {
       **"selector" : **"/html/body/div[6]/div[3]/div/div[1]/ul/li"
    ** },
     **"andThen" :[
      {
         **"action" : **"Extractor" ,
         **"name" : **"extractLevel" ,
         **"cleanser" : **"cleanLevel" ,
         **"addResultNode" : **true** ,
         **"params" : {
           **"fieldName" : **"schoolLevel" ,
           **"selector" : **"a" ,
           **"relative" : **true
        ** }

      }
    ]

  }
]

}

please note that addResultNode should be set to true for Fetcher instruction always

The crawler starts, and executes the first instruction, a Fetcher which fetches the url http://www.schoolbooks.ie. When a page is fetched, it is stored in the context, and is closed after the scope of the instruction is over.

After fetch is complete, crawler executes a Selector which will select all the menu items of the page. For each of the items, the "andThen" option is executed. Note that whenever an inner instruction has to be executed for each element of a set of results of one instruction, the "andThen" element needs to be specified.

In this case, for each of the elements that are selected, the Extractor is executed. The extractor takes the content of the "a" element and store it in the result. Because the "addResultNode" is set to true, each element is stored in a separate node.

Adding nodes to Results tree

Traversing the tree is orthogonal to adding items to the result tree. Whenever an instruction needs to add a result to the result tree, it should be specified in the instruction.

Instruction options

Selector – an xpath syntax for addressing an element. Absolute or relative paths can be specified.

Action – name of action. Currently, only predefined actions can be used. Later versions will have options to inject actions. Currently available actions are Selector, extractor, ImageExtractor, LinkOpener

addResultNode : add a node to the results tree

popAfter: after this instruction is over, engine should make parent node of current element in results tree as current node.

Config file options

Sample config file is given below

exports. config = {

 "headless" : "true" ,
 "testMode" : "true" ,
 "reaperFile" : "test.json" ,
 "resultsFile" : "./results.json" ,
 "numberOfRecordsToSaveAfter" :25,
 "logFileName" : "reaper.log" ,
 "logLevel" : "info" ,
 "saveAsCSV** : true ,
 "csvFileName** : "./results.csv" ,
 "recordHandlerDirName** : "recordHandlers" ,
 "cleansersDirName** : "cleansers"

}

headless – to specify whether chrome ui is to be displayed or not

testMode – when set to true, only one element is returned by a selector or page iterator

reaperFile – the instructions file

resultsFile – the file in which results are stored. This is in json format

numberOfRecordsToSaveAfter – for periodic saving of records

logFileName – name of log file name

logLevel – check Winston logging package for log levels

saveAsCSV – whether a csv file is required

cleansersDirName – where cleanser functions are present. Each cleanser should in a separate file. Please check cleansers folder provided with framework for reference.

recordHandlerDirName – where user supplied recordhandler files are present. A test recordhandler and test cleanser is supplied along with the package

Selector Format

Selectors use the xpath format. In google chrome, Xpath selector of any element can be obtained by opening a web page, right clicking on an element and choosing "inspect element". In the inspector, right click on the element an choose copy/copy Xpath selector.

Engine

The Engine traverses the instructions file and executes action within each instruction in sequence. Some instructions may process multiple DOM nodes.

While processing one DOM node, the action may need to follow a link and reach a page which in turn may contain other DOM nodes.

Result

The Result object is a place for actions to store the results of the search. This is an internal object and the user of the library need not be aware of this.

Result object is a hierarchichal tree of nodes. When a node needs to be added to the result, the corresponding instruction should have an "addResultNode": true attribute

The instruction at which point a node is fully populated, should contain a "pop" attribute so that crawler changes context to parent node. When the crawler again descends to process the next element in the list, it will add the next child node.

State

State object is an internal object; it contains information required for Actions to perform their work.

State should have some structure such that each action can precisely locate the position to store the results of its processing. As traversal happens items should be stacked up on the State contents, when actions get over the stack should also get popped.

Asynchrony

Currently, the crawler process items sequentially. Subsequent versions may implement asynchrony.

Actions

Extractor – given field name and html element node, it will extract it from element.

Fetcher – Given url, it will fetch the page from url

Selector – it will iterate over a set of items and perform actions on them

PageIterator – this is to be used in a situation where there are multiple pages of search results. Given the selector which identifies the "next" element of the paginator, it will take a page of results, apply an action to it (probably an ElementIterator), then click on "next" item repeatedly performing this action.

Actions, Instructions, cleansers, recordHandlers

Actions are classes provided by the Framework. Instructions are json objects which consist of a selector (the xpath expression which specifies which element is the target), the action to take (The actual Action), action specific parameters.

Cleansers are user defined functions which are triggered after after every extractor action. They can be specified in the instruction. They are sent the results of extraction, and can change the results.

Recordhandlers are executed if "saveAsCSV" is specified in the configuration file. The tree is flattened, and the record handlers are executed in no particular order on all rows of the resulting array.

Actions are part of the framework, and there is no facility to plugin actions at this point of time. Cleansers and recordHandlers are to be defined by the user of the library and the path supplied in the config file. These will be injected into actions.

Target of actions

Action objects perform actions. The target of the actions may vary.

For Fetcher , the target would be the url taken from parameters object.

For Selector , the object on which the action happens would be the "currentElement". The notion of currentElement would be the element passed by parent instruction.

For Extractor, the selector parameter would be applied to current object and the innerHTML of the resulting object collected.

Package Sidebar

Install

npm i mreaper

Weekly Downloads

0

Version

0.0.5-3

License

ISC

Unpacked Size

304 kB

Total Files

67

Last publish

Collaborators

  • pradeepck