multibyte
TypeScript icon, indicating that this package has built-in type declarations

1.0.0-beta.1 • Public • Published

multibyte

NPM Link Build Status Code Coverage ISC License

multibyte provides common string functions that respect multibyte Unicode characters.

npm install multibyte

The problem and the solution

On one hand, JavaScript strings use UTF-16 encoding, and on the other hand, JavaScript strings behave like an Array of code points. Unicode characters that take more than 2 bytes (like newer emoji) get split into 2 code points in many situations.

If you display Unicode text from a UTF-8 source, you need these multibyte functions that take advantage of the fact that Array.from() is Unicode safe.

import {
  charAt,
  codePointAt,
  length,
  slice,
  split,
  truncateBytes,
} from 'multibyte';

// JavaScript String.prototype.charAt() is not Unicode aware
'a🚀c'.charAt(1); // "\ud83d" ❌
charAt('a🚀c', 1); // "🚀" ✅

// JavaScript String.prototype.codePointAt() does not ignore the UTF-8 BOM
'\uFEFFa🚀c'.codePointAt(1); // 97 ❌
codePointAt('\uFEFFa🚀c', 1); // 128640 ✅

// JavaScript returns length in bytes, not Unicode characters
'a🚀c'.length; // 4 ❌
length('a🚀c'); // 3 ✅

// JavaScript slices along bytes, not Unicode characters
'a🚀cdef'.slice(2, 3); // "\ude80" ❌
slice('a🚀cdef', 2, 3); // "c" ✅

// JavaScript slices along bytes, not Unicode characters
'a🚀c'.split(''); // ["a", "\ud83d", "\ude80", "c"] ❌
split('a🚀c', ''); // ["a", "🚀", "c"] ✅

// JavaScript String length is not related to UTF-8 character length
'a🚀cdef'.slice(0, 2); // "a\ud83d" ❌
truncateBytes('a🚀cdef', 2); // "a" ✅

BOM (Byte order mark) - U+FEFF

Under the hood, all these function strip a leading BOM if present.

Package Sidebar

Install

npm i multibyte

Weekly Downloads

23

Version

1.0.0-beta.1

License

ISC

Unpacked Size

30.8 kB

Total Files

55

Last publish

Collaborators

  • kensnyder