@datagica/read-document

Extract plain text from any kind of document. Based on textract.

Current issues

read-document is not thread safe (because it uses textract, and textract is not apparently), so you will have to wait for each promise to complete before converting another document, for instance by chaining promises like this:

const read = require('@datagica/read-document');

const sequentialPromise = files.reduce((p, file) =>
  p.then(done =>
    read({ file: file }).then(doc => anotherAsyncPromise(doc))
  ),
  Promise.resolve(0)
)

Prerequisites

PDF extraction requires pdftotext be installed
DOC, RTF extraction requires catdoc be installed, unless on OSX in which case textutil (installed by default) is used.
PNG, JPG and GIF require tesseract to be available. Images need to be pretty clear, high - DPI and made almost entirely of just text for tesseract to be able to accurately extract the text.
DXF extraction requires drawingtotext be available

@datagica/read-document

@datagica/read-document

Current issues

Prerequisites

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

Weekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

@datagica/read-document

@datagica/read-document

Current issues

Prerequisites

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

DownloadsWeekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

Weekly Downloads