Network Pipe Manufacturer

    nutch-web-api

    1.0.0 • Public • Published

    nutch-web-api

    travis ci build status Coverage Status

    What is it

    nutch-web-api is a RESTFul API implementation for apache Nutch crawling application. This project is completely written in node.js and coffeescript with the goal of simplifying usage and for improved flexibility. The REST API is not a replacement for apache nutch application, it simply provides the web interface for the nutch commands.

    Installation

    Prerequisites

    Apache Nutch Application

    nutch-web-api requires that apache nutch application be installed and running on the same server. For more information about downloading and getting started for apache nutch, please refer to http://nutch.apache.org.

    Node.js

    node.js is required to get the web application up and running. For more information about installing node.js for your platform, please visit http://nodejs.org/download/.

    Downloading Source And Install Dependencies

    Initial Project Setup

    Environment Variables

    By default, the project expects the following environment variables available in the environment:

    • NUTCH_HOME
    • JAVA_HOME

    These environment variables can be overwritten in conf/env-.json file. For example, please refer to the configuration for test and dev environments respectively. Additionally, the standard NUTCH_OPT environment variable will be picked up as additional options required to run nutch application. This variable can also be overwritten by specifying it in conf/env-.json. Other variables used by nutch-web-api are as followed:

    • NUTCH_WEB_API_SERVER_HOST
    • NUTCH_WEB_API_SERVER_PORT
    • NUTCH_WEB_API_SOLR_URL
    • NUTCH_WEB_API_SEED_DIR (Directory where seed file is persisted in.)
    • NUTCH_WEB_API_DATA_DIR (Directory where the embedded database Nedb used for data storage)

    Starting And Stopping The Server

    Start nutch-web-api

    Execute the npm command to start the web application:

    npm start

    Stop nutch-web-api

    npm stop

    Supported HTTP Operations

    nutch-web-api supports the crawler job that performs all the nutch jobs in one call, and individual nutch job for clients who wants to invoke nutch job individually. For details about each API operation, please refer to the swagger document hosted on the server and port of the web application: e.g. http://localhost:4000/api-docs

    Invoke Nutch Crawler Job

    This API executes all the individual nutch jobs in the following order:

    • inject, generate, fetch, parse, updatedb, solr index, solr delete duplicates Any failure encountered during the processing of these jobs will result in the job failure.

    • HTTP Method: POST

    • REST Endpoint: http://localhost:4000/nutch/crawl

    • Sample Request Payload:

    {
      "identifier" : "sampleCrawl", 
      "limit" : 5,
      "seeds" : [ "http://mysite1.com", "http://mysite2.com ]
    }

    Invoke Nutch Injector Job

    {
      "identifier" : "sampleCrawl"
    }

    Invoke Nutch Generator

    {
      "identifier" : "sampleCrawl",
      "batchId: "12134343"
    }

    Invoke Nutch Fetcher

    {
      "identifier" : "sampleCrawl",
      "batchId: "12134343"
    }

    Invoke Nutch Parser

    {
      "identifier" : "sampleCrawl",
      "batchId: "12134343"
    }

    Invoke Nutch UpdateDb

    {
      "identifier" : "sampleCrawl"
    }

    Invoke Nutch SolrIndex

    {
      "identifier" : "sampleCrawl"
    }

    Invoke Nutch Solr Delete Duplicates

    Checking Nutch Job Status

    By default, upon summiting a nutch job request, a HTTP status code of 202 is returned indicating the server has received the particular request. A typical response from the request would look like the following:

    {
        "message": "injector job submitted successfully",
        "status": 202,
        "identifier": "testInjector"
    }

    The nutch job is executed asynchronously while the server continues to serve other requests. To check the status of a particular job, do one of the following:

    • Use the API to request for the current job status. The URL to get the up to date status of the current job is: http://localhost:4000/nutch/status?identifier=&jobName= A sample response from the request would look like the following:
    {
            "identifier": "testInjector",
            "jobName": "INJECTOR",
            "status": SUCCESS,
            "date": 1415761722588
     }

    Job Name and Status Reference

    The following table describes the list of valid nutch job names.

    Job Name Job Description
    INJECTOR Nutch Injector
    GENERATOR Nutch Generator
    FETCHER Nutch Fetcher
    PARSER Nutch Parser
    DBUPDATE Nutch DB Updater
    SOLRINDEX Nutch Solr Index
    SOLRDELETEDUPS Nutch Solr Delete Duplicates

    Keywords

    none

    Install

    npm i nutch-web-api

    DownloadsWeekly Downloads

    3

    Version

    1.0.0

    License

    none

    Last publish

    Collaborators

    • jasonstark