❀

bulk-files-ocr-search
TypeScript icon, indicating that this package has built-in type declarations

0.1.5Β β€’Β PublicΒ β€’Β Published

Bulk Files OCR Text Finder

Node.js CI npm package npm downloads license

πŸ” Find files that contain some text with OCR.

Supported file formats:

  • Images: JPEG, PNG, WebP
  • Documents: PDF

Unsupported file formats:

Tesseract OCR is used internally (Tesseract Documentation). For PDF to PNG conversion, Poppler is used.

This package uses worker threads to make use of your CPU cores and be faster.

Notes:

  • The OCR will provide bad results for rotated files/non-straight text.
    • 90/180 degrees rotations seems to output a good result
    • You may want to pre-process your files somehow to make the text straight!
  • Files will be matched if at least 1 of the words is found in the text contained in it.

Install

No matter how you decide to use this package, you need to install Tesseract OCR anyway. If you have some PDF files, they need to be converted with additional packages.

# OCR Package (non-linux, see https://github.com/tesseract-ocr/tesseract#installing-tesseract)
sudo apt install tesseract-ocr

# PDF to JPEG conversion command-line (for Windows, see https://stackoverflow.com/a/53960829 - MacOS `brew install poppler`)
# You can skip this if you don't plan to scan PDF files
sudo apt install poppler-utils

OCR Language

If you want to use another language than English, download then install the required language from the Tesseract OCR Languages Models repository.

# French language
wget https://github.com/tesseract-ocr/tessdata_fast/raw/main/fra.traineddata
sudo cp fra.traineddata /usr/share/tesseract-ocr/4.00/tessdata/

Use with CLI

This will install the ocr-search CLI.

pnpm i -g bulk-files-ocr-search
$ ocr-search --help

  πŸ” Find files that contain some text with OCR

  Usage
    $ ocr-search --words "<words_list>" <input_files>

  Required
    --words List of comma-separated words to search (if "MATCH_ALL", will match everything for mass OCR extraction)

  Options
    --ignoreExt         List of comma-separated file extensions to ignore
    --pdfExtractFirst   Range start of the pages to extract from PDF files (1-indexed)
    --pdfExtractLast    Range end of the pages to extract from PDF files, last page if overflow (1-indexed)
    --progressFile      File to save progress to, will start from where it
                        stopped last time by looking there (no file, use "none")  [default="progress.json"]
    --matchesLogFile    Log all matches to this file (no file, use "none") [default="matches.txt"]
    --no-console-logs   Silence all console logs
    --no-show-matches   Do not print matched files text content to the console [default="false"]
    --workers           Amount of worker threads to use (default is total CPU cores count - 2)

  OCR Options - See https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc
    --lang  Tesseract OCR LANG configuration [default="eng"]
    --oem   Tesseract OCR OEM configuration [default="1"]
    --psm   Tesseract OCR PSM configuration [default="1"]

  Examples
    Scan the "scanned-dir" directory and match all the files containing "system", "wiki" and "hello"
      $ ocr-search --words "system,wiki,hello" scanned-dir

    Scan the glob-matched files "*" and match all files (mass OCR text extraction)
      $ ocr-search --words MATCH_ALL *

    Skip .pdf and .webp files
      $ ocr-search --words "wiki,hello" --ignoreExt "pdf,webp" scanned-dir

    Extract only page 3 to 6 in all PDF files (1-indexed)
      $ ocr-search --words "wiki,hello" --pdfExtractFirst 3 --pdfExtractLast 6 scanned-dir

    Use a specific Tesseract OCR configuration
      $ ocr-search --words "wiki,hello" --lang fra --oem 1 --psm 3 scanned-dir

  https://github.com/rigwild/bulk-files-ocr-search

Another CLI is provided to easily remove all extracted PDF pages images.

$ ocr-search-clean-extracted --help

  πŸ—‘οΈ Find and remove all images from PDF pages extractions

  Usage
    $ ocr-search-clean-extracted <input_files>

  https://github.com/rigwild/bulk-files-ocr-search

Use with provided runner

git clone bulk-files-ocr-search
cd bulk-files-ocr-search
pnpm install # or npm install -D
pnpm build

Put all your files/directories in the data directory. They can be in subfolders.

The progress will be printed to the console and saved in the progress.json file.

The list of files that match at least one of the provided words and their content will be saved to the matches.txt file.

node run.js

See run.js.

Use Programatically

Install

pnpm i bulk-files-ocr-search

Directory scan

import path from 'path'
import { scanDir, TesseractConfig } from 'bulk-files-ocr-search'

// The list of options
export type ScanOptions = {
  /**
   * List of words to search (if one is matched, the file is matched)
   *
   * If not provided, every files will get matched (useful to do mass OCR and save the result)
   */
  words?: string[]

  /** Should the logs be printed to the console? (default = false) */
  shouldConsoleLog?: boolean

  /** Should the matches file content be printed to the console? (default = true) */
  shouldConsoleLogMatches?: boolean

  /**
   * If provided, the progress will be saved to a file
   *
   * When stopped, the process will start from where it stopped last time by looking there
   */
  progressFile?: string

  /** If provided, every file path and their text content that were matched are logged to this file */
  matchesLogFile?: string

  /** File extensions to ignore when looking for files */
  ignoreExt?: Set<string>

  /* Extract PDF files starting at this page, first page is 1 (1-indexed) (default = 1) */
  pdfExtractFirst?: number

  /* Extract PDF files until this page, last page if overflow (1-indexed) (default = last page of PDF file) */
  pdfExtractLast?: number

  /**
   * Amount of worker threads to use (default = your total CPU cores - 2)
   *
   * Note: Using all your available cores may slow down the process!
   */
  workerPoolSize?: number

  /**
   * Tesseract OCR config, will default `{ lang: 'eng', oem: 1, psm: 1 }`
   *
   * @see https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc
   */
  tesseractConfig?: TesseractConfig
}

const scannedDir = path.resolve(__dirname, 'data')
const words = ['hello', 'match this', '<<<<<']
const tesseractConfig: TesseractConfig = { lang: 'fra', oem: 1, psm: 1 }

console.time('scan')

await scanDir(scannedDir, {
  words,
  shouldConsoleLog: true,
  tesseractConfig
})

console.log('Scan finished!')
console.timeEnd('scan')

Perform OCR on a single file

import path from 'path'
import { ocr } from 'bulk-files-ocr-search'

const file = path.resolve(__dirname, '..', 'test', '_testFiles', 'sample.jpg')

// Tesseract configuration
const tesseractConfig: TesseractConfig = { lang: 'eng', oem: 1, psm: 1 }

// Should the string be normalized? (lowercase, accents removed, whitespace removed)
const shouldCleanStr: boolean | undefined = true

const text = await ocr(file, tesseractConfig, shouldCleanStr)
console.log(text)

PDF to images conversion

Convert PDF pages to PNG. Files are generated on the file system, 1 file per PDF page.

import path from 'path'
import { pdfToImages } from 'bulk-files-ocr-search'

const filePdf = path.resolve(__dirname, '..', 'test', '_testFiles', 'sample.pdf')

// Extract from page 1 to page 3 (1-indexed)
const res = await pdfToImages(filePdf, 1, 3)
console.log(res) // Paths to generated PNG files

License

The MIT License

Package Sidebar

Install

npm i bulk-files-ocr-search

Weekly Downloads

0

Version

0.1.5

License

MIT

Unpacked Size

108 kB

Total Files

22

Last publish

Collaborators

  • rigwild