Unicode Default Word Boundary
Implements the Unicode UAX #29 §4.1 default word boundary specification, for finding word breaks in multilingual text.
Use this to split words in text! Using UAX #29 is a lot smarter than the
character classes like
\d only work on ASCII
Import the module and use the
const split = split;console;
[ 'The', 'quick', '(', '“', 'brown', '”', ')', 'fox', 'can’t', 'jump', '32.3', 'feet', ',', 'right', '?' ]
But that's not all! Try it with non-English text, like Russian:
[ 'В', 'чащах', 'юга', 'жил', 'бы', 'цитрус', '?', 'Да', ',', 'но', 'фальшивый', 'экземпляр', '!' ]
[ 'איך', 'בלש', 'תפס', 'גמד', 'רוצח', 'עז', 'קטנה', '?' ]
[ 'ᑕᐻ', 'ᒥᔪ ᑭᓯᑲᐤ', 'ᐊᓄᐦᐨ', '᙮' ]
...and many more!
More advanced use cases will want to use the
What doesn't work
Languages that do not have obvious word breaks, such as Chinese, Japanese, Thai, Lao, and Khmer. You'll need to use statistical or dictionary-based approaches to split words in these languages.
There are two exported function:
split(text: string): string
split() splits the text at word boundaries, returning an array of all
"words" from the text that contain characters other than whitespace.
See above for examples.
findSpans(text: string): Iterable<BasicSpan>
findSpans() is a generator that yields successive basic spans from
the text. A basic span is a chunk of text that is guaranteed to
start at a word boundary and end at the next word boundary. In other
words, basic spans are indivisible in that there are no word
boundaries contained within a basic span.
A basic span has the following properties:
Note that unlike,
findSpans() does yield spans that
Will yield spans with the following properties:
start: 0 end: 5 length: 5 text: 'Hello'start: 5 end: 6 length: 1 text: ','start: 6 end: 7 length: 1 text: ' 'start: 7 end: 12 length: 5 text: 'world'start: 12 end: 14 length: 2 text: '🌎'start: 14 end: 15 length: 1 text: '!'
shown above. The objects that
findSpans() yield will adhere to the
BasicSpan interface, however what
findSpans() actually yields may
differ from simple objects.
Contributing and Maintaining
When maintaining this package, you might notice something strange.
index.ts depends on
./src/gen/WordBreakProperty.ts, but this file
does not exist! It is a generated file, created by reading Unicode
property data files, downloaded from Unicode's website.
These data files have been compressed and committed to this repository
libexec/ ├── WordBreakProperty-12.0.0.txt.gz ├── compile-word-break.js └── emoji-data-12.0.0.txt.gz
compile-word-break.js actually creates
How to generate
When you have just cloned the repository, this file will be generated
when you run
If you want to regenerate it afterwards, you can run the build script:
npm run build
TypeScript implementation © 2019 Eddie Antonio Santos. MIT Licensed.
The algorithm comes from UAX #29: Unicode Text Segmentation, an integral part of the Unicode Standard, version 12.0.