is-han
Unicode-aware Han characters (hanzi, kanji, hanja) detection
npm i @scriptin/is-han
Usage
Note You need to use Unicode-aware methods/operators in JavaScript -
Array.from(str)
andfor/of
loops - in order to process all Han characters. Some of them have code points which don't fit into 16 bits, and JavaScript uses UTF-16.
Examples of correct usage:
import { isHan } from "@scriptin/is-han";
for (const char of "漢字") {
console.log(isHan(char));
}
// or
Array.from("漢字").filter(isHan)
Incorrect usage:
'𠀋'.split('').filter(isHan); // -> empty array
// because code point of '𠀋' is '2000B' which is more than 16 bit long,
// so it is split into a surrogate pair
console.log('𠀋'.split('')); // -> ['\uD840', '\uDC0B']
// Compare to:
console.log(Array.from('𠀋')); // -> ['𠀋']
API
-
isHan(char: string): boolean
- Checks if a character is a Han script character: hanzi, kanji, hanja -
isHanExt(char: string): boolean
- Checks if a character is an "extended" Han script character. Useful when you're looking for obscure characters which contain Han script, e.g. symbols like🈲 ,🈯 ,🈳 ,🉐 ,🉑 , ㊄, ㋋, ㏾, ㍰, etc. "Extended" means all Unicode characters which:- contain Han characters with additional wrappers, such as characters inside brackets, circles, etc.
- contain multiple "compacted" Han characters, such as Japanese "square era names", etc.
- contain parts of Han characters, such as CJK strokes
- 々 IDEOGRAPHIC ITERATION MARK (see below)
- 〆 IDEOGRAPHIC CLOSING MARK (see below)
-
isIterationMark(char: string): boolean
- Checks if character is 々 IDEOGRAPHIC ITERATION MARK. This mark means "repeat previous character". Can be useful if you want to replace this mark with the character it repeats/represents. See Wiktionary article about 々 -
isClosingMark(char: string): boolean
- Checks if character is 〆 IDEOGRAPHIC CLOSING MARK. This mark is used in place of another Han character. See Wiktionary article about 〆 -
Some constants are also exported in case you need to extend the functionality.
FAQ
❓ Why do I have to use Array.from(str)
and for/of
?
Because JavaScript (and TypeScript) use UTF-16 for strings, and some of more recent
additions into Unicode don't fit into 16 bit. In such cases, characters are represented
with surrogates.
Array.from()
and for/of
were added in more recent versions of ECMAScript and are Unicode-aware.
This library cannot change this JavaScript feature, so you have to use these two methods,
and avoid using Array.split()
, String.codePointAt()
, String.charCodeAt()
, etc.
❓ Can I detect language (Chinese/Japanese/Korean) for a given Han character?
No. Because of the Han unification most of CJK characters are represented with shared code points. Each code point can be associated with multiple versions/variants of the same character, including regional, stylistic, and other variations. In order to determine a language, you need to know some context. For example, language can be set as an attribute of a web page or a PDF document, or as a setting in an operating system.
This library doesn't provide methods to distinguish between languages.
❓ Can I distinguish between Traditional and Simplified Chinese characters?
In some cases, yes. In others, traditional and simplified variants share the same code points. See this article. For a sufficiently big text, you can determine if it's traditional or simplified by looking for specific code points.
This library doesn't provide methods to distinguish between traditional and simplified scripts.