reach-ad-analyser-batch-processor

1.0.0 • Public • Published

REACH Batch Article URL Analyser

Table of Contents:

Local Development instructions:

The application uses dotenv framework to inject environment variables within the application. which are passed with .env files in the root folder. As it stands there is within this project the folowing env files:

file description
.env local development
.env.demo demo environment (currently not used)
.env.dev.Ad.Safety.analyser ad safety analyser environment (currently not used)
.env.dev.appnexus app nexus dev environment (currently not used)
.env.prod.appnexus demo environment (currently not used)
.env.dev.BERTHA bertha environment
.env.dev.stable stable environment
.env.prod.reach production environment
.env.sc. (currently not used)
.env.telegraph (currently not used)

Environment service configuration

Within each environment the following services must be configured:

  • NLU: ``

App Launch instructions:

node app.js or npm start

to lauch with a specific environment, such as prod: NODE_ENV=prod.appnexus node app.js or bertha: NODE_ENV=dev.BERTHA node app.js

Which starts a server: http://localhost:6003 (see server log for port number)

How To create a new environment file:

To create a new environment file the following credentials are required: Obtain credentials obtained from the respective watson service:

  • NLU:
    natural_language_understanding_apikey = << key from credentials in ibm console >> 
    natural_language_understanding_url = https://gateway-lon.watsonplatform.net/natural-language-understanding/api
    natural_language_understanding_version = 2019-02-01``
change url if there you the NLU instance is in different region than London.
  • VR:
    visual_recognition_apikey = << key from credentials in ibm console >>  
    visual_recognition_url = https://gateway.watsonplatform.net/visual-recognition/api  
    visual_recognition_version = 2019-02-01  
change url if there you the VR instance is in different region than London.
  • WDS:
    discovery_url = https://gateway-lon.watsonplatform.net/discovery/api  
    discovery_apikey = << key from credentials in ibm console >>  
    discovery_version = 2019-02-01  
    discovery_collectionid =  
    discovery_environmentid =
  • Cloud storage bucket:
    Create a set of credentials for IBM cloud storage, with HMAC credentials, and convert the values to base 64 string with the following:
    cloud_storage_enpoint_url = https://s3.eu-gb.cloud-object-storage.appdomain.cloud  
    cloud_storage_apikey = << key from credentials in ibm console >>
    cloud_storage_resource_instance_id = << key from credentials in ibm console >>  
    cloud_storage_access_key_id = << key from credentials in ibm console >>  
    cloud_storage_secret_access_key = << key from credentials in ibm console >>  
    cloud_storage_bucket = << input storage bucket >>  
    cloud_storage_reports = << output storage bucket >>
  • DB: Get the db credentials from the ibm console, connections are done as follows:
    postgreSQL_connectionString = postgres://user:password@0af45143-13f5-40ee-a847-2aea727b42fd.bmo1leol0d54tib7un7g.databases.appdomain.cloud:port/db?sslmode=verify-full
    postgreSQL_certificate_base64 = << pem ssl certificate string >>

Other environment variables

variable description
write_to_db enables writing nlu findings into the db, to be cached, the default value is present in the env file
read_from_db_cache enables reading cached nlu findings from the db, the default value is present in the env file
write_to_log create a CSV log file with a line for each analyzed article, containing the rating of the article
write_rules_to_log adds also the rating of each rule. Applies only if write_to_log = true
write_to_cache store result JSONs as files on server used as cache in future requests
analyze_images enable/disable(s) analyzing images identified in HTML articles
recalculation_rate recalculation rate for new rulesets
sleep_interval time interval in seconds that the processor waits between new input files lookups, default values are in the env files
selected_process_mode currently batch file processing mode is done through config in default.json, this defines the input/output format, and any filtering that might need doing. It defaults to default, and has with more modes
max_small_file_size file size threshold in which it will process the whole file in one go instead of a stream, if not present it defaults to 20kb
articles_parallel number of articles to process in parallel, if not present defaults to 30
NODE_ENV used to change psql file to use test db for e2e tests
LOCAL_DEV used to set the db to localhost, useful for local development

Processing modes:

Processing Mode config:

All existent processing modes are currently within the config/default.json under processMode.

The processing currently supports the following flags:

flag description
name name of the process mode
inputFormat expected file format for the input
outputFormat expected file format for the output file
saveArticles saves articles as a list with format selected as above
saveReport saves report from processed file, this includes the processed file, how many articles failed, total number processed, and status. Current saves report always in json
outputArticleErrors if output format is json, it can output the error, for each article that fails, along with the input used, allowing better debugging
removeUnmatchedFilteredArticle if true and there are matchers in articleFilterOutput and they return undefined (but not fail) it removes them from the output. the default is false
inputTransformation uses Jexl framework to apply transforms to the input
articleFilterOutput uses Jexl framework please see below for format.

Article Filter output format:

This part of the config allows allows to extract parts for the article response and filter it. It also allows to apply transforms to the data in the fields. It works by adding an array of key elements that should be present in the output. By adding them a key/value to output static values, or a matcher if we would like to filter parts of the original object only.

key description
key destination json key for the object (if format is csv, this value will be ommited)
value value if we would like to output a static value associated with the key
matcher Jexl matcher to filter objects
transforms Jexl supports transforms to be added to a matcher, therefore, we add a static list of transformer fucntions in jexlTransforms.js class, and load them into jexl, to use them in order, just match the name of that function into the array

Please see default.json for transforms, and matcher examples and Jexl page.

If we need to add more transformer functions to the config, we add a new function into jexlTransforms

    const urlExtraneousRemoval = () => {
      return {
        name: 'urlExtraneousRemoval',
        method: urlString => URLStringBuilder.buildRemovingExtraneous(urlString)
      };
    };

Where the name returned is the name of the transform fucntion, and the method is the method to added to Jexl List of transforms. Then add the new transformer as an export

    module.exports = {
      urlExtraneousRemoval
    };

Then this function is available to be used in the transforms array for the inputTransformation array, articleFilterOutput.

Code structure:

The processor

The batch processor service is an express.js server app with no routing, its single goal is to pool for new files to be processed from ibm cloud storage. This works by starting the services in ./app.js and doing processor.init(). The service gets objects from ibm cloud storage by checking their name, and etag to inspect changes (in case the same file is uploaded multiple times).

  • The server/processor/processor.js: is responsible for waiting and pooling new files from ibm cloudstorage to be analysed
  • The server/processor/processorOrchestrator.js: is responsible for reading a file with multiple urls from ibm cloud storage, prepare a batch and save it, by either using a stream or an object put.
  • The server/processor/controllers/processorOrchestrator.js: is responsible for reading a file with multiple urls from ibm cloud storage, and save it, by either using a stream or an object put.
  • The server/processor/controllers/report/: contains controllers to create report or article streams.
  • The server/processor/controllers/jsonFilter/: applies transfomrs and/or filtering to article objects as per processing mode configuration.
  • The server/processor/controllers/objectOutputBuilder.js: is used to convert objects to their desired output format.
  • The server/processor/controllers/storageCache.js: is used to build a local cache of processed cloud storage objects, as means to complement the db article_process table.

Once a batch ends up in the articleQueue file, it ends up being processed individually through brand-safety-tools orchestrator file.

Unit tests

Unit tests, are written using Jest framework, and can be run by doing npm test in the terminal. An HTML coverage report is available in link.

Folder structure goes as follows:

path description
test root test folder
test/data sample files to run the batch processor locally, or to be used in the unit tests. These are purely for development/demo purposes
test/e2e end to end tests, paths below attempt to mirror the path of the test file in server folder
test/helpers files helped to setup the tests
test/mocks reusable jest mock files
test/unit unit tests

### Tests structure: A unit test should attempt to test one condition of the class/module it is testing. tests name should follow:

test('<method name>() <condition to test>, <expected return value>')

an example:

test('filter() with config, articleFilterOutput and removeUnmatchedFilteredArticle set to false, throws an error', () => {}

describe() should be aggregate tests in the following order of preference:

  • aggregate tests within a test file.
  • aggregate a complex/finicky logical scenario.
  • aggregate tests around a method.
  • a test file should never test more or have describes that test more than one file.

Package Sidebar

Install

npm i reach-ad-analyser-batch-processor

Weekly Downloads

0

Version

1.0.0

License

ISC

Unpacked Size

11.4 MB

Total Files

381

Last publish

Collaborators

  • trinitymirrordigital-admin