@haydenpierce/tika-http

0.1.1 • Public • Published

tika-http

An Apache Tika client that interfaces with Tika via Tika's HTTP interface. This is designed to work with the apache/tika Docker image where apache/tika runs in its own container and other containers can utilize it.


Example

const fs = require('fs').promises;
const path = require('path');
const { Tika, TikaConnection } = require('@haydenpierce/tika-http');

const tikaConnection = new TikaConnection('tika', '9998');
const tika = new Tika(tikaConnection);

const fn = async () => {
    try {
        const data = await fs.readFile(path.resolve('./var/10100001.pdf));
        console.log(await tika.tika(data));
    } catch (e) {
        console.error(e.message);
    }
}

fn();

Output

{
  "pdf:unmappedUnicodeCharsPerPage": [
    "0",
    "..."
  ],
  "pdf:PDFVersion": "1.6",
  "xmp:CreatorTool": "Microsoft® Word 2013",
  "pdf:docinfo:title": "Debate Transcription Word Template File",
  "pdf:hasXFA": "false",
  "access_permission:modify_annotations": "false",
  "access_permission:can_print_degraded": "true",
  "dc:creator": "Errett, Morgan E.",
  "dcterms:created": "2019-10-24T21:17:53Z",
  "dcterms:modified": "2019-11-27T16:39:41Z",
  "dc:format": "application/pdf; version=1.6",
  "xmpMM:DocumentID": "uuid:ab0af5ea-f13f-4d5d-bf55-6b0f937d4f2f",
  "pdf:docinfo:creator_tool": "Microsoft® Word 2013",
  "access_permission:fill_in_form": "false",
  "pdf:docinfo:modified": "2019-11-27T16:39:41Z",
  "pdf:hasCollection": "false",
  "pdf:encrypted": "true",
  "dc:title": "Debate Transcription Word Template File",
  "xmp:CreateDate": "2019-10-24T16:17:53Z",
  "pdf:hasMarkedContent": "true",
  "Content-Type-Override": "application/pdf",
  "Content-Type": "application/pdf",
  "xmp:ModifyDate": "2019-11-27T10:39:41Z",
  "pdf:docinfo:creator": "Errett, Morgan E.",
  "xmp:MetadataDate": "2019-11-27T10:39:41Z",
  "dc:language": "en-US",
  "pdf:producer": "Microsoft® Word 2013",
  "access_permission:extract_for_accessibility": "true",
  "access_permission:assemble_document": "false",
  "xmpTPg:NPages": "67",
  "pdf:hasXMP": "true",
  "pdf:charsPerPage": [
    "1762",
    "..."
  ],
  "access_permission:extract_content": "false",
  "access_permission:can_print": "true",
  "X-TIKA:Parsed-By": [
    "org.apache.tika.parser.DefaultParser",
    "org.apache.tika.parser.pdf.PDFParser"
  ],
  "X-TIKA:content": "<html xmlns=\"http://www.w3.org/1999/xhtml\"> ... </html>",
  "access_permission:can_modify": "false",
  "pdf:docinfo:producer": "Microsoft® Word 2013",
  "pdf:docinfo:created": "2019-10-24T21:17:53Z"
}

Classes

TikaConnection

A class containing the connection settings to find Tika.

TikaConnection#construct(string tikaHost, string tikaPort) : TikaConnection

tikaHost: The name of the server running Tika. Common examples: localhost, 127.0.0.1, or the name of the service in docker-compose.yml.

tikaPort: The port that the Tika server is listening on. Typically, this is the default Tika port of 9998.

Example:

const { TikaConnection } = require('@haydenpierce/tika-http');

const tikaConnection = new TikaConnection('tika', '9998');

TikaConnection^createFromEnv() : TikaConnection

Checks the system's environment variables to create a TikaConnection without passing values in code. This method looks for environment variables named:

TIKA_HOST: The name of the server running Tika. Common examples: localhost, 127.0.0.1, or the name of the service in docker-compose.yml.

TIKA_PORT: The port that the Tika server is listening on. Typically, this is the default Tika port of 9998.

Example:

const { TikaConnection } = require('@haydenpierce/tika-http');

const tikaConnection = TikaConnection.createFromEnv();

Tika

Tika#construct(TikaConnection tikaConnection) : Tika

tikaConnection: A valid TikaConnection.

Example:

const { Tika, TikaConnection } = require('@haydenpierce/tika-http');

const tikaConnection = new TikaConnection('tika', '9998');
const tika = new Tika(tikaConnection);

Tika#tika(Buffer data) : Promise<object>

Sends a file to Tika to be parsed.

data: The contents of a file to send to Tika for parsing.

Example:

const fs = require('fs').promises;
const path = require('path');
const { Tika, TikaConnection } = require('@haydenpierce/tika-http');

const tikaConnection = new TikaConnection('tika', '9998');
const tika = new Tika(tikaConnection);

const fn = async () => {
    try {
        const data = await fs.readFile(path.resolve('./var/10100001.pdf'));
        console.log(await tika.tika(data));
    } catch (e) {
        console.error(e.message);
    }
}

fn();

Tika#version() : Promise<string>

Gets the version for Tika.

Example:

const fs = require('fs').promises;
const path = require('path');
const { Tika, TikaConnection } = require('@haydenpierce/tika-http');

const tikaConnection = new TikaConnection('tika', '9998');
const tika = new Tika(tikaConnection);

const fn = async () => {
    try {
        console.log(await tika.version());
    } catch (e) {
        console.error(e.message);
    }
}

fn();

Output

Apache Tika 2.3.0

Tika#parsers() : Promise<string>

Enumerates the parsers from Tika

Example:

const fs = require('fs').promises;
const path = require('path');
const { Tika, TikaConnection } = require('@haydenpierce/tika-http');

const tikaConnection = new TikaConnection('tika', '9998');
const tika = new Tika(tikaConnection);

const fn = async () => {
    try {
        console.log(await tika.parsers());
    } catch (e) {
        console.error(e.message);
    }
}

fn();

Output

{
  "children" : [ 
      {
        "composite" : false,
        "name" : "org.apache.tika.parser.apple.AppleSingleFileParser",
        "decorated" : false
      },
      ...
  ]
}

CLI

tika-http also features a CLI tool for testing if Tika is running. It can be used via npx.

$ npx @haydenpierce/tika-http version
Apache Tika 2.3.0

$ npx @haydenpierce/tika-http parsers
{ 
   ...
}

$ npx @haydenpierce/tika-http tika var/10100001.pdf
{ 
   ...
}

By default, the CLI will create a TikaConnection instance from environment variables TIKA_HOST and TIKA_PORT, but you can also provide them as options.

$ npx @haydenpierce/tika-http version --tika-host=tika --tika-port=9998

Working on the library

This library uses Docker to perform testing. To get up and running run the following in this order:

# Start in the directory with docker-compose.yml and get the containers running.
$ docker-compose up
...

# We need the id of the node container for the next step. It's "ffef5007a12e"
$ docker container ls
CONTAINER ID   IMAGE         COMMAND                  CREATED      STATUS          PORTS                                       NAMES
ffef5007a12e   node          "docker-entrypoint.s…"   4 days ago   Up 13 minutes   0.0.0.0:9229->9229/tcp, :::9229->9229/tcp   tika-http_tika-http_1
35ed1a06c474   apache/tika   "/bin/sh -c 'exec ja…"   4 days ago   Up 13 minutes   0.0.0.0:9998->9998/tcp, :::9998->9998/tcp   tika-http_tika_1  

# Attach to the node container
$ docker exec -it ffef5007a12e  /bin/bash

# We're inside the node container now where the code is mounted to /home.
root@ffef5007a12e:/# cd /home/bin

# Run tika.js, but before we execute we'll wait for an IDE/Google Chrome to attach it's debugger. 
root@ffef5007a12e:/# node --inspect-brk=0.0.0.0 tika.js version
Apache Tika 2.3.0

Readme

Keywords

Package Sidebar

Install

npm i @haydenpierce/tika-http

Weekly Downloads

2

Version

0.1.1

License

MIT

Unpacked Size

13.5 kB

Total Files

7

Last publish

Collaborators

  • haydenpierce