is-han

Unicode-aware Han characters (hanzi, kanji, hanja) detection

npm i @scriptin/is-han

Usage

Note You need to use Unicode-aware methods/operators in JavaScript - Array.from(str) and for/of loops - in order to process all Han characters. Some of them have code points which don't fit into 16 bits, and JavaScript uses UTF-16.

Examples of correct usage:

import { isHan } from "@scriptin/is-han";

for (const char of "漢字") {
  console.log(isHan(char));
}

// or

Array.from("漢字").filter(isHan)

Incorrect usage:

'𠀋'.split('').filter(isHan); // -> empty array
// because code point of '𠀋' is '2000B' which is more than 16 bit long,
// so it is split into a surrogate pair
console.log('𠀋'.split('')); // -> ['\uD840', '\uDC0B']

// Compare to:
console.log(Array.from('𠀋')); // -> ['𠀋']

API

isHan(char: string): boolean - Checks if a character is a Han script character: hanzi, kanji, hanja
isHanExt(char: string): boolean - Checks if a character is an "extended" Han script character. Useful when you're looking for obscure characters which contain Han script, e.g. symbols like 🈲, 🈯, 🈳, 🉐, 🉑, ㊄, ㋋, ㏾, ㍰, etc. "Extended" means all Unicode characters which:
- contain Han characters with additional wrappers, such as characters inside brackets, circles, etc.
- contain multiple "compacted" Han characters, such as Japanese "square era names", etc.
- contain parts of Han characters, such as CJK strokes
- 々 IDEOGRAPHIC ITERATION MARK (see below)
- 〆 IDEOGRAPHIC CLOSING MARK (see below)
isIterationMark(char: string): boolean - Checks if character is 々 IDEOGRAPHIC ITERATION MARK. This mark means "repeat previous character". Can be useful if you want to replace this mark with the character it repeats/represents. See Wiktionary article about 々
isClosingMark(char: string): boolean - Checks if character is 〆 IDEOGRAPHIC CLOSING MARK. This mark is used in place of another Han character. See Wiktionary article about 〆
Some constants are also exported in case you need to extend the functionality.

FAQ

❓ Why do I have to use `Array.from(str)` and `for/of`?

Because JavaScript (and TypeScript) use UTF-16 for strings, and some of more recent additions into Unicode don't fit into 16 bit. In such cases, characters are represented with surrogates. Array.from() and for/of were added in more recent versions of ECMAScript and are Unicode-aware.

This library cannot change this JavaScript feature, so you have to use these two methods, and avoid using Array.split(), String.codePointAt(), String.charCodeAt(), etc.

❓ Can I detect language (Chinese/Japanese/Korean) for a given Han character?

No. Because of the Han unification most of CJK characters are represented with shared code points. Each code point can be associated with multiple versions/variants of the same character, including regional, stylistic, and other variations. In order to determine a language, you need to know some context. For example, language can be set as an attribute of a web page or a PDF document, or as a setting in an operating system.

This library doesn't provide methods to distinguish between languages.

❓ Can I distinguish between Traditional and Simplified Chinese characters?

In some cases, yes. In others, traditional and simplified variants share the same code points. See this article. For a sufficiently big text, you can determine if it's traditional or simplified by looking for specific code points.

This library doesn't provide methods to distinguish between traditional and simplified scripts.

@scriptin/is-han

is-han

Usage

API

FAQ

❓ Why do I have to use `Array.from(str)` and `for/of`?

❓ Can I detect language (Chinese/Japanese/Korean) for a given Han character?

❓ Can I distinguish between Traditional and Simplified Chinese characters?

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

Weekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

@scriptin/is-han

is-han

Usage

API

FAQ

❓ Why do I have to use Array.from(str) and for/of?

❓ Can I detect language (Chinese/Japanese/Korean) for a given Han character?

❓ Can I distinguish between Traditional and Simplified Chinese characters?

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

DownloadsWeekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

❓ Why do I have to use `Array.from(str)` and `for/of`?

Weekly Downloads