tesseract.js-node
A focused node-only version of tesseract.js.
Why?
tesseract.js is developed for both node and browser, and includes (in my opinion) bloated functionality like automatic downloading of traineddata-files in the background.
At the time of writing, it also does not have any tests for node-environment (only browser). Example issue where this matters: https://github.com/naptha/tesseract.js/issues/339.
I just wanted a way to use Tesseract 4.0 in a node project without all this extra functionality and background downloads from third-party servers.
Usage
Download traineddata-files from somewhere, e.g. officially:
mkdir tessdatacd tessdatacurl -O -L https://github.com/tesseract-ocr/tessdata_fast/raw/master/eng.traineddatacurl -O -L https://github.com/tesseract-ocr/tessdata_fast/raw/master/fin.traineddata
Then use the library in a node project:
const getWorker = ;const worker = await ;const text = await worker;
You can supply the input image in various ways:
// path to imageconst text = await worker;// Bufferconst text = await worker;// Buffer (from node-canvas)const text = await worker;
See tesseract.test.js for other examples.
Development
npm test
Useful resources:
- https://tesseract-ocr.github.io/4.0.0/a02186.html#a96899e8e5358d96752ab1cfc3bc09f3e
- https://github.com/naptha/tesseract.js-core/blob/v2.0.0-beta.11/examples/node/minimal/index.asm.js
- https://github.com/jeromewu/tesseract.js-utils/blob/b5fba24a8ffcdd88302b5709a1023330138a281e/src/readImage.js
Credits
Thanks to tesseract.js-core contributors for the groundwork!
License
Apache License 2.0