Nikita: Content extraction from documents

TODO

Fix hardcoded temporary directory, clean up after upload
Test some PDF files
Add a wrapper graph which takes an object in, enriches then sends out again
Integrate with AMQP, add as worker in thegrid-apis

Later

Test how much faster Tika Java API is at XHTML + image extraction over cli tools
Avoid temporary files for images+html output if/when passing to NoFlo

Setup

Configuration is passed as environment variables:

AMAZON_API_ID: Amazon S3 API identifier
AMAZON_API_TOKEN: Amazon S3 API token/secret
AMAZON_API_REGION: Amazon S3 region, ex: 'us-west-2'
AMAZON_API_BUCKET: Amazon S3 bucket for uploaded files, ex: 'thegrid-user-content'

Design

Separate Heroku worker, integrated into TheGrid APIs.

Inputs:

URL to s3 backed document (Word,PDF)

Outputs:

Extracted HTML with img src referring to S3 backend

Notes:

Tika provides full XHTML document, where as Embed.ly gives only (and we expect)

@the-grid/nikita

Nikita: Content extraction from documents

TODO

Later

Setup

Design

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

Weekly Downloads

Version

License

Last publish

Collaborators

@the-grid/nikita

Nikita: Content extraction from documents

TODO

Later

Setup

Design

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

DownloadsWeekly Downloads

Version

License

Last publish

Collaborators

Weekly Downloads