extract text from pdf documents
pdftxt extracts text from PDF documents, hence the name. this software was built to extract coherent indexable text and tries it's best to assemble lines and paragraphs, dehyphenate divised text and polish the bonnet.
pdftxt requires an unixoid operating system with pdf3json installed.
npm install pdftxt
var pdftxt = require"pdftxt";pdftext"file.pdf"if err return console.errorerr;console.logdata;;
Warning: This has changed again since the last version.
The data object is quite simple. It's an array with one item per page.
Every page item has it's
height as properties. The proprty
blocks contains an array of text block objects.
Every block object contains a
bbox bounding box array with its
height positions on the page, and an array of
lines within that block.
lines are trimmed and made of pure unicode with decoded entities.
"page": num"width": width"height": height"blocks":"bbox": left bottom width height"lines":"text""text"// ...