@the-grid/nikita

1.2.3 • Public • Published

Nikita: Content extraction from documents

TODO

  • Fix hardcoded temporary directory, clean up after upload
  • Test some PDF files
  • Add a wrapper graph which takes an object in, enriches then sends out again
  • Integrate with AMQP, add as worker in thegrid-apis

Later

  • Test how much faster Tika Java API is at XHTML + image extraction over cli tools
  • Avoid temporary files for images+html output if/when passing to NoFlo

Setup

Configuration is passed as environment variables:

AMAZON_API_ID: Amazon S3 API identifier
AMAZON_API_TOKEN: Amazon S3 API token/secret
AMAZON_API_REGION: Amazon S3 region, ex: 'us-west-2'
AMAZON_API_BUCKET: Amazon S3 bucket for uploaded files, ex: 'thegrid-user-content'

Design

Separate Heroku worker, integrated into TheGrid APIs.

Inputs:

  • URL to s3 backed document (Word,PDF)

Outputs:

  • Extracted HTML with img src referring to S3 backend

Notes:

  • Tika provides full XHTML document, where as Embed.ly gives only (and we expect)

Readme

Keywords

none

Package Sidebar

Install

npm i @the-grid/nikita

Weekly Downloads

7

Version

1.2.3

License

proprietary

Last publish

Collaborators

  • d4tocchini
  • grid-bot-ios
  • grid-bot-android
  • gridbot-ds
  • gridbot-web
  • gridbot-apis