@nahanil/zh-tokenizer

0.1.3 • Public • Published

@nahanil/zh-tokenizer

pipeline status coverage report NPM version    

Tokenizes Chinese texts into words using CC-CEDICT.

Extended from https://github.com/takumif/cedict-lookup

Installation

Use npm to install:

npm install @nahanil/zh-tokenizer --save

Updated Usage

Make sure to provide the CC-CEDICT data. Will not work with simplified characters

const tokenizer = require('@nahanil/zh-tokenizer')('./cedict.txt')
console.log(tokenizer.tokenize('我是中国人。'))

Usage

Make sure to provide the CC-CEDICT data.

const tokenizer = require('@nahanil/zh-tokenizer')('./cedict.txt')
console.log(tokenizer.tokenize('我是中国人。'))
const tokenizer = require('@nahanil/zh-tokenizer')('./cedict.txt', 'traditional')
console.log(tokenizer.tokenize('我是中國人。'))

Output:

[ { traditional: '我',
    simplified: '我',
    pinyin: 'wo3',
    pinyinPretty: 'wǒ',
    english: 'I/me/my' },
  { traditional: '是',
    simplified: '是',
    pinyin: 'shi4',
    pinyinPretty: 'shì',
    english: 'is/are/am/yes/to be\nvariant of 是[shi4]/(used in given names)' },
  { traditional: '中國人',
    simplified: '中国人',
    pinyin: 'zhong1 guo2 ren2',
    pinyinPretty: 'zhōng guó rén',
    english: 'Chinese person' },
  { traditional: '。',
    simplified: '。',
    pinyin: null,
    pinyinPretty: null,
    english: null } ]

Package Sidebar

Install

npm i @nahanil/zh-tokenizer

Weekly Downloads

5

Version

0.1.3

License

MIT

Unpacked Size

7.29 kB

Total Files

6

Last publish

Collaborators

  • nahanil