UTF32Char
A minimalist, dependency-free implementation of immutable 4-byte-width (UTF-32) characters for easy manipulation of characters and glyphs, including simple emoji.
Also includes an immutable unsigned 4-byte-width integer data type, UInt32
and easy conversions from and to UTF32Char
.
Motivation
If you want to allow a single "character" of input, but consider emoji to be single characters, you'll have some difficulty using basic JavaScript string
s, which use UTF-16 encoding by default. While ASCII characters all have length-1...
console.log"?".length // 1
...many emoji have length > 1
console.log"💩".length // 2
...and with modifiers and accents, that number can get much larger
console.log"!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞".length // 17
As all Unicode characters can be expressed with a fixed-length UTF-32 encoding, this package mitigates the problem a bit, though it doesn't completely solve it. Note that I do not claim to have solved this issue, and this package accepts any group of one to four bytes as a "single UTF-32 character", whether or not they are rendered as a single grapheme. See this package if you want to split text into graphemes, regardless of the number of bytes required to render each grapheme.
If you just want a simple, dependency-free API to deal with 4-byte strings, then this package is for you.
This package provides an implementation of 4-byte, UTF-32 "characters" UTF32Char
and corresponding unsigned integers UInt32
. The unsigned integers have an added benefit of being usable as safe array indices.
Installation
Install from npm with
$ npm i utf32char
Or try it online at npm.runkit.com
Use
Create new UTF32Char
s and UInt32
s like so
You can convert to basic JavaScript types
console.logindex.toNumber // 42console.logchar.toString // 😮
Easily convert between characters and integers
console.logindexAsChar.toString // *console.logcharAsUInt.toNumber // 3627933230
...or skip the middleman and convert integers directly to strings, or strings directly to integers:
console.logindex.toString // *console.logchar.toNumber // 3627933230
Edge Cases
UInt32
and UTF32Char
ranges are enforced upon object creation, so you never have to worry about bounds checking:
// range error: UInt32 has MIN_VALUE 0, received -1 // range error: UInt32 has MAX_VALUE 4294967295 (2^32 - 1), received 4294967296 // invalid argument: cannot convert empty string to UTF32Char // invalid argument: lossy compression of length-3+ string to UTF32Char
Because the implementation accepts any 4-byte string
as a "character", the following are allowed
console.lognum // 6815849console.logchar.toString // hiconsole.logUTF32Char.fromNumbernum.toString // hi
Floating-point values are truncated to integers when creating UInt32
s, like in many other languages:
console.logpi.toNumber // 3 console.logsqueeze.toNumber // 4294967295
Compound emoji -- created using variation selectors and joiners -- are often larger than 4 bytes wide and will therefore throw errors when used to construct UTF32Char
s:
// invalid argument: lossy compression of length-3+ string to UTF32Char console.log"👩❤️💋👩".length // 11
...but many basic emoji are fine:
// emojiTest.ts for of emoji
$ npx ts-node emojiTest.ts✅: 😂✅: 😭✅: 🥺✅: 🤣✅: ❤️✅: ✨✅: 😍✅: 🙏✅: 😊✅: 🥰✅: 👍✅: 💕✅: 🤔❌: 👩❤️💋👩
Arithmetic, Comparison, and Immutability
UInt32
provides basic arithmetic and comparison operators
console.logincreased.toNumber // 61 console.logcomp // true
Verbose versions and shortened aliases of comparison functions are available
lt
andlessThan
gt
andgreaterThan
le
andlessThanOrEqualTo
ge
andgreaterThanOrEqualTo
Since UInt32
s are immutable, plus()
and minus()
return new objects, which are of course bounds-checked upon creation:
// range error: UInt32 has MIN_VALUE 0, received -39
Contact
Feel free to open an issue with any bug fixes
or a PR with any performance improvements
.
Support me @ Ko-fi!
Check out my DEV.to blog!