Noble Programming Mantra
    Share your code. npm Orgs help your team discover, share, and reuse code. Create a free org »

    pdftxtpublic

    pdftxt

    pdftxt extracts text from PDF documents, hence the name. this software was built to extract coherent indexable text and tries it's best to assemble lines and paragraphs, dehyphenate divised text and polish the bonnet.

    Requirements

    pdftxt requires an unixoid operating system with pdf3json installed.

    Install

    npm install pdftxt
    

    Usage

     
    var pdftxt = require("pdftxt");
     
    pdftext("file.pdf", function(err, data) {
     
        if (err) return console.error(err);
        console.log(data);
     
    });

    Data

    Warning: This has changed again since the last version.

    The data object is quite simple. It's an array with one item per page.

    Every page item has it's number, width and height as properties. The proprty blocks contains an array of text block objects.

    Every block object contains a bbox bounding box array with its left, bottom, width and height positions on the page, and an array of lines within that block.

    All lines are trimmed and made of pure unicode with decoded entities.

    [
        {
            "page": num,
            "width": width,
            "height": height,
            "blocks": [{
                "bbox": [left, bottom, width, height],
                "lines": [
                    "text",
                    "text"
                ]
            },{
                // ...
            }]
        }
    ]

    License

    Public Domain

    install

    npm i pdftxt

    Downloadsweekly downloads

    9

    version

    0.1.0

    license

    none

    repository

    githubgithub

    last publish

    collaborators

    • avatar