tika

Apache Tika bridge. Text extraction, metadata extraction, mimetype detection and language detection.

node-tika

Provides text extraction, metadata extraction, mime-type detection, text-encoding detection and language detection. All via a native Java bridge with the Apache Tika content-analysis toolkit. Bundles Tika 1.7.

Depends on node-java, which itself requires the JDK and Python 2 (not 3) to compile.

Requires JDK 7. Run node version to check the version that node-java is using. If the wrong version is reported even if you installed JDK 1.7, make sure JAVA_HOME is set to the correct path then delete node_modules/java and rerun npm install.

var tika = require('tika');
 
var options = {
 
    // Hint the content-type. This is optional but would help Tika choose a parser in some cases. 
    contentType: 'application/pdf'
};
 
tika.text('test/data/file.pdf', options, function(errtext) {
    console.log(text);
});

We can even extract directly from the Web. If the server returns a content-type header, it will be passed to Tika as a hint.

tika.text('http://www.ohchr.org/EN/UDHR/Documents/UDHR_Translations/eng.pdf', function(errtextmeta) {
    // ... 
});

Or extract text using OCR (requires Tesseract).

tika.text('test/data/ocr/simple.jpg', {
    ocrLanguage: 'eng'
}, function(errtext) {
    // ... 
});

All methods that accept a uri parameter accept relative or absolute file paths and http:, https: or ftp: URLs.

The available options are the following.

  • contentType to provide a hint to Tika on which parser to use.
  • outputEncoding to specify the text output encoding. Defaults to UTF-8.
  • password to set a password to be used for encrypted files.
  • ocrLanguage to set the language used by Tesseract. This option is required to enable OCR.
  • ocrPath to set the path to the Tesseract binaries.
  • ocrMaxFileSize to set maximum file size in bytes to submit to OCR.
  • ocrMinFileSize to set minimum file size in bytes to submit to OCR.
  • ocrPageSegmentationMode to set the Tesseract page segmentation mode.
  • ocrTimeout to set the maximum time in seconds to wait for the Tesseract process to terminate.
  • pdfAverageCharTolerance see PDFTextStripper.setAverageCharTolerance(float).
  • pdfEnableAutoSpace to set whether the parser should estimate where spaces should be inserted between words (true by default).
  • pdfExtractAcroFormContent to set whether content should be extracted from AcroForms at the end of the document (true by default).
  • pdfExtractAnnotationText to set whether to extract text from annotations (true by default).
  • pdfExtractInlineImages to set whether to extract inline embedded OBX images.
  • pdfExtractUniqueInlineImagesOnly as multiple pages within a PDF file might refer to the same underlying image.
  • pdfSortByPosition to set whether to sort text tokens by their x/y position before extracting text.
  • pdfSpacingTolerance see PDFTextStripper.setSpacingTolerance(float).
  • pdfSuppressDuplicateOverlappingText to set whether the parse should try to remove duplicated text over the same region.
  • pdfUseNonSequentialParser to set whether to use PDFBox's non-sequential parser.

Extract both text and metadata from a file.

tika.extract('test/data/file.pdf', function(errtextmeta) {
    console.log(text); // Logs 'Just some text'. 
    console.log(meta.producer[0]); // Logs 'LibreOffice 4.1'. 
});

Extract text from a file.

tika.text('test/data/file.pdf', function(errtext) {
    console.log(text);
});

Get an XHTML representation of the text extracted from a file.

tika.xhtml('test/data/file.pdf', function(errxhtml) {
    console.log(xhtml);
});

Extract metadata from a file. Returns an object with names as keys and arrays as values.

tika.meta('test/data/file.pdf', function(errmeta) {
    console.log(meta.producer[0]); // Logs 'LibreOffice 4.1'. 
});

Detect the content-type (MIME type) of a file.

tika.type('test/data/file.pdf', function(errcontentType) {
    console.log(contentType); // Logs 'application/pdf'. 
});

Detect the character set (text encoding) of a file.

tika.charset('test/data/file.txt', function(errcharset) {
    console.log(charset); // Logs 'ISO-8859-1'. 
});

Detect the content-type and character set of a file.

The character set will be appended to the mime-type if available.

tika.typeAndCharset('test/data/file.txt', function(errtypeAndCharset) {
    console.log(typeAndCharset); // Logs 'text/plain; charset=ISO-8859-1'. 
});

Detect the language a given string is written in.

tika.language('This is just some text in English.', function(errlanguagereasonablyCertain) {
    console.log(language); // Logs 'en'. 
    console.log(reasonablyCertain); // Logs true or false. 
});

Developed by Matthew Caruana Galizia. Please feel free to submit an issue or pull request.

Copyright (c) 2013 Matthew Caruana Galizia. Licensed under an MIT-style license.

Apache Tika JAR distributed under the Apache License, Version 2.0.