charabia-js
TypeScript icon, indicating that this package has built-in type declarations

0.2.0 • Public • Published

charabia-js

charabia-js is a WebAssembly binding for the charabia multilingual text tokenizer used by Meilisearch.

Supported scripts / languages

  • Latin
  • Latin - German
  • Greek
  • Cyrillic - Georgian
  • Chinese CMN 🇨🇳
  • Hebrew 🇮🇱
  • Arabic
  • Japanese 🇯🇵
  • Korean 🇰🇷
  • Thai 🇹🇭
  • Khmer 🇰🇭

More information about the supported scripts and languages can be found in the here.

Installation

npm install charabia-js

Usage

Segmentation

import { segment } from "charabia-js";

console.log(segment("Hello, world!")); // [ 'Hello', ', ', 'world', '!' ]
console.log(segment("你好,世界!")); // [ '你好', ',', '世界', '!' ]
console.log(segment("Hello, 世界!")); // [ 'Hello', ', ', '世界', '!' ]

Tokenization

import { tokenize, TokenKind } from "charabia-js";
import assert from "node:assert";

const tokens = tokenize(
  "The quick (\"brown\") fox can't jump 32.3 feet, right? Brr, it's 29.3°F"
);

let token = tokens[0];
assert.equal(token.lemma, "the");
assert.equal(token.kind, TokenKind.Word);

token = tokens[1];
assert.equal(token.lemma, " ");
assert.equal(token.kind, TokenKind.SoftSeparator);

token = tokens[2];
assert.equal(token.lemma, "quick");
assert.equal(token.kind, TokenKind.Word);

License

This project is licensed under the MIT License - see the LICENSE file for details.

Package Sidebar

Install

npm i charabia-js

Weekly Downloads

2

Version

0.2.0

License

MIT

Unpacked Size

76.8 MB

Total Files

6

Last publish

Collaborators

  • codytseng