@datagica/read-document

0.1.2 • Public • Published

@datagica/read-document

Extract plain text from any kind of document. Based on textract.

Current issues

read-document is not thread safe (because it uses textract, and textract is not apparently), so you will have to wait for each promise to complete before converting another document, for instance by chaining promises like this:

const read = require('@datagica/read-document');

const sequentialPromise = files.reduce((p, file) =>
  p.then(done =>
    read({ file: file }).then(doc => anotherAsyncPromise(doc))
  ),
  Promise.resolve(0)
)

Prerequisites

  • PDF extraction requires pdftotext be installed
  • DOC, RTF extraction requires catdoc be installed, unless on OSX in which case textutil (installed by default) is used.
  • PNG, JPG and GIF require tesseract to be available. Images need to be pretty clear, high - DPI and made almost entirely of just text for tesseract to be able to accurately extract the text.
  • DXF extraction requires drawingtotext be available

Readme

Keywords

none

Package Sidebar

Install

npm i @datagica/read-document

Weekly Downloads

1

Version

0.1.2

License

GPL-3.0

Unpacked Size

762 kB

Total Files

16

Last publish

Collaborators

  • datagica