nutch-web-api

1.0.0 • Public • Published

nutch-web-api

travis ci build status Coverage Status

What is it

nutch-web-api is a RESTFul API implementation for apache Nutch crawling application. This project is completely written in node.js and coffeescript with the goal of simplifying usage and for improved flexibility. The REST API is not a replacement for apache nutch application, it simply provides the web interface for the nutch commands.

Installation

Prerequisites

Apache Nutch Application

nutch-web-api requires that apache nutch application be installed and running on the same server. For more information about downloading and getting started for apache nutch, please refer to http://nutch.apache.org.

Node.js

node.js is required to get the web application up and running. For more information about installing node.js for your platform, please visit http://nodejs.org/download/.

###Downloading Source And Install Dependencies

Initial Project Setup

Environment Variables

By default, the project expects the following environment variables available in the environment:

  • NUTCH_HOME
  • JAVA_HOME

These environment variables can be overwritten in conf/env-.json file. For example, please refer to the configuration for test and dev environments respectively. Additionally, the standard NUTCH_OPT environment variable will be picked up as additional options required to run nutch application. This variable can also be overwritten by specifying it in conf/env-.json. Other variables used by nutch-web-api are as followed:

  • NUTCH_WEB_API_SERVER_HOST
  • NUTCH_WEB_API_SERVER_PORT
  • NUTCH_WEB_API_SOLR_URL
  • NUTCH_WEB_API_SEED_DIR (Directory where seed file is persisted in.)
  • NUTCH_WEB_API_DATA_DIR (Directory where the embedded database Nedb used for data storage)

Starting And Stopping The Server

Start nutch-web-api

Execute the npm command to start the web application:

npm start

Stop nutch-web-api

npm stop

Supported HTTP Operations

nutch-web-api supports the crawler job that performs all the nutch jobs in one call, and individual nutch job for clients who wants to invoke nutch job individually. For details about each API operation, please refer to the swagger document hosted on the server and port of the web application: e.g. http://localhost:4000/api-docs

Invoke Nutch Crawler Job

This API executes all the individual nutch jobs in the following order:

  • inject, generate, fetch, parse, updatedb, solr index, solr delete duplicates Any failure encountered during the processing of these jobs will result in the job failure.

  • HTTP Method: POST

  • REST Endpoint: http://localhost:4000/nutch/crawl

  • Sample Request Payload:

{
  "identifier" : "sampleCrawl", 
  "limit" : 5,
  "seeds" : [ "http://mysite1.com", "http://mysite2.com ]
}

Invoke Nutch Injector Job

{
  "identifier" : "sampleCrawl"
}

Invoke Nutch Generator

{
  "identifier" : "sampleCrawl",
  "batchId: "12134343"
}

Invoke Nutch Fetcher

{
  "identifier" : "sampleCrawl",
  "batchId: "12134343"
}

Invoke Nutch Parser

{
  "identifier" : "sampleCrawl",
  "batchId: "12134343"
}

Invoke Nutch UpdateDb

{
  "identifier" : "sampleCrawl"
}

Invoke Nutch SolrIndex

{
  "identifier" : "sampleCrawl"
}

Invoke Nutch Solr Delete Duplicates

Checking Nutch Job Status

By default, upon summiting a nutch job request, a HTTP status code of 202 is returned indicating the server has received the particular request. A typical response from the request would look like the following:

{
    "message": "injector job submitted successfully",
    "status": 202,
    "identifier": "testInjector"
}

The nutch job is executed asynchronously while the server continues to serve other requests. To check the status of a particular job, do one of the following:

  • Use the API to request for the current job status. The URL to get the up to date status of the current job is: http://localhost:4000/nutch/status?identifier=&jobName= A sample response from the request would look like the following:
{
        "identifier": "testInjector",
        "jobName": "INJECTOR",
        "status": SUCCESS,
        "date": 1415761722588
 }

Job Name and Status Reference

The following table describes the list of valid nutch job names.

Job Name Job Description
INJECTOR Nutch Injector
GENERATOR Nutch Generator
FETCHER Nutch Fetcher
PARSER Nutch Parser
DBUPDATE Nutch DB Updater
SOLRINDEX Nutch Solr Index
SOLRDELETEDUPS Nutch Solr Delete Duplicates

Readme

Keywords

none

Package Sidebar

Install

npm i nutch-web-api

Weekly Downloads

1

Version

1.0.0

License

none

Last publish

Collaborators

  • jasonstark