@sugarcube/plugin-tika

0.42.1 • Public • Published

@sugarcube/plugin-tika

Use the Apache Tika toolkit to detect and extract metadata and text from over a thousand different file types.

Installation

npm install --save @sugarcube/plugin-tika

To use this plugin you need as well Java installed.

Plugins

tika_parse

Parse a list of file specified by the query type glob_pattern.

sugarcube -Q glob_pattern:files/**/*.pdf -p tika_parse

tika_links

This plugin iterates over all links in _sc_media and fetches the text and meta data for this link. This plugin ignores any errors that the fetch might throw.

tika_location

This plugin parses any location specified using the tika_location_field query type. This fetches the text and meta data of e.g. a url inside the unit.

sugarcube -Q google_search:Keith\ Johnstone \
          -Q tika_location_field:href \
          -p google_search,tika_location

The text and meta data are added into the _sc_media collection and placed directly on the unit as well, e.g. if the location field is href, the href_text and href_meta fields are added to the unit.

tika_export

Export the text and meta data that tika_location parses to a file.

sugarcube -Q google_search:Keith\ Johnstone \
          -p google_search,tika_location,tika_export \
          --tika.location_field href

Configuration Options:

  • tika.data_dir: Specify the target directory where to store all files. Defaults to ./data/tika_location.

License

GPL3 @ Christo

Dependents (0)

Package Sidebar

Install

npm i @sugarcube/plugin-tika

Weekly Downloads

50

Version

0.42.1

License

GPL-3.0

Unpacked Size

57.3 kB

Total Files

12

Last publish

Collaborators

  • critocrito