tika-http
An Apache Tika client that interfaces with Tika via Tika's HTTP interface. This is designed to work with the apache/tika
Docker image where apache/tika runs in its own container and other containers
can utilize it.
Example
const fs = require('fs').promises;
const path = require('path');
const { Tika, TikaConnection } = require('@haydenpierce/tika-http');
const tikaConnection = new TikaConnection('tika', '9998');
const tika = new Tika(tikaConnection);
const fn = async () => {
try {
const data = await fs.readFile(path.resolve('./var/10100001.pdf));
console.log(await tika.tika(data));
} catch (e) {
console.error(e.message);
}
}
fn();
Output
{
"pdf:unmappedUnicodeCharsPerPage": [
"0",
"..."
],
"pdf:PDFVersion": "1.6",
"xmp:CreatorTool": "Microsoft® Word 2013",
"pdf:docinfo:title": "Debate Transcription Word Template File",
"pdf:hasXFA": "false",
"access_permission:modify_annotations": "false",
"access_permission:can_print_degraded": "true",
"dc:creator": "Errett, Morgan E.",
"dcterms:created": "2019-10-24T21:17:53Z",
"dcterms:modified": "2019-11-27T16:39:41Z",
"dc:format": "application/pdf; version=1.6",
"xmpMM:DocumentID": "uuid:ab0af5ea-f13f-4d5d-bf55-6b0f937d4f2f",
"pdf:docinfo:creator_tool": "Microsoft® Word 2013",
"access_permission:fill_in_form": "false",
"pdf:docinfo:modified": "2019-11-27T16:39:41Z",
"pdf:hasCollection": "false",
"pdf:encrypted": "true",
"dc:title": "Debate Transcription Word Template File",
"xmp:CreateDate": "2019-10-24T16:17:53Z",
"pdf:hasMarkedContent": "true",
"Content-Type-Override": "application/pdf",
"Content-Type": "application/pdf",
"xmp:ModifyDate": "2019-11-27T10:39:41Z",
"pdf:docinfo:creator": "Errett, Morgan E.",
"xmp:MetadataDate": "2019-11-27T10:39:41Z",
"dc:language": "en-US",
"pdf:producer": "Microsoft® Word 2013",
"access_permission:extract_for_accessibility": "true",
"access_permission:assemble_document": "false",
"xmpTPg:NPages": "67",
"pdf:hasXMP": "true",
"pdf:charsPerPage": [
"1762",
"..."
],
"access_permission:extract_content": "false",
"access_permission:can_print": "true",
"X-TIKA:Parsed-By": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.pdf.PDFParser"
],
"X-TIKA:content": "<html xmlns=\"http://www.w3.org/1999/xhtml\"> ... </html>",
"access_permission:can_modify": "false",
"pdf:docinfo:producer": "Microsoft® Word 2013",
"pdf:docinfo:created": "2019-10-24T21:17:53Z"
}
Classes
TikaConnection
A class containing the connection settings to find Tika.
TikaConnection#construct(string tikaHost, string tikaPort) : TikaConnection
tikaHost
: The name of the server running Tika. Common examples: localhost
, 127.0.0.1
, or the name of the service in
docker-compose.yml.
tikaPort
: The port that the Tika server is listening on. Typically, this is the default Tika port of 9998
.
Example:
const { TikaConnection } = require('@haydenpierce/tika-http');
const tikaConnection = new TikaConnection('tika', '9998');
TikaConnection^createFromEnv() : TikaConnection
Checks the system's environment variables to create a TikaConnection
without passing values in code. This method looks for
environment variables named:
TIKA_HOST
: The name of the server running Tika. Common examples: localhost
, 127.0.0.1
, or the name of the service in
docker-compose.yml.
TIKA_PORT
: The port that the Tika server is listening on. Typically, this is the default Tika port of 9998
.
Example:
const { TikaConnection } = require('@haydenpierce/tika-http');
const tikaConnection = TikaConnection.createFromEnv();
Tika
Tika#construct(TikaConnection tikaConnection) : Tika
tikaConnection
: A valid TikaConnection.
Example:
const { Tika, TikaConnection } = require('@haydenpierce/tika-http');
const tikaConnection = new TikaConnection('tika', '9998');
const tika = new Tika(tikaConnection);
Tika#tika(Buffer data) : Promise<object>
Sends a file to Tika to be parsed.
data
: The contents of a file to send to Tika for parsing.
Example:
const fs = require('fs').promises;
const path = require('path');
const { Tika, TikaConnection } = require('@haydenpierce/tika-http');
const tikaConnection = new TikaConnection('tika', '9998');
const tika = new Tika(tikaConnection);
const fn = async () => {
try {
const data = await fs.readFile(path.resolve('./var/10100001.pdf'));
console.log(await tika.tika(data));
} catch (e) {
console.error(e.message);
}
}
fn();
Tika#version() : Promise<string>
Gets the version for Tika.
Example:
const fs = require('fs').promises;
const path = require('path');
const { Tika, TikaConnection } = require('@haydenpierce/tika-http');
const tikaConnection = new TikaConnection('tika', '9998');
const tika = new Tika(tikaConnection);
const fn = async () => {
try {
console.log(await tika.version());
} catch (e) {
console.error(e.message);
}
}
fn();
Output
Apache Tika 2.3.0
Tika#parsers() : Promise<string>
Enumerates the parsers from Tika
Example:
const fs = require('fs').promises;
const path = require('path');
const { Tika, TikaConnection } = require('@haydenpierce/tika-http');
const tikaConnection = new TikaConnection('tika', '9998');
const tika = new Tika(tikaConnection);
const fn = async () => {
try {
console.log(await tika.parsers());
} catch (e) {
console.error(e.message);
}
}
fn();
Output
{
"children" : [
{
"composite" : false,
"name" : "org.apache.tika.parser.apple.AppleSingleFileParser",
"decorated" : false
},
...
]
}
CLI
tika-http also features a CLI tool for testing if Tika is running. It can be used via npx
.
$ npx @haydenpierce/tika-http version
Apache Tika 2.3.0
$ npx @haydenpierce/tika-http parsers
{
...
}
$ npx @haydenpierce/tika-http tika var/10100001.pdf
{
...
}
By default, the CLI will create a TikaConnection instance from environment variables TIKA_HOST
and TIKA_PORT
, but
you can also provide them as options.
$ npx @haydenpierce/tika-http version --tika-host=tika --tika-port=9998
Working on the library
This library uses Docker to perform testing. To get up and running run the following in this order:
# Start in the directory with docker-compose.yml and get the containers running.
$ docker-compose up
...
# We need the id of the node container for the next step. It's "ffef5007a12e"
$ docker container ls
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
ffef5007a12e node "docker-entrypoint.s…" 4 days ago Up 13 minutes 0.0.0.0:9229->9229/tcp, :::9229->9229/tcp tika-http_tika-http_1
35ed1a06c474 apache/tika "/bin/sh -c 'exec ja…" 4 days ago Up 13 minutes 0.0.0.0:9998->9998/tcp, :::9998->9998/tcp tika-http_tika_1
# Attach to the node container
$ docker exec -it ffef5007a12e /bin/bash
# We're inside the node container now where the code is mounted to /home.
root@ffef5007a12e:/# cd /home/bin
# Run tika.js, but before we execute we'll wait for an IDE/Google Chrome to attach it's debugger.
root@ffef5007a12e:/# node --inspect-brk=0.0.0.0 tika.js version
Apache Tika 2.3.0