Count words, with Unicode! Uses Unicode 9.0.0 character classes for improved clarity of implementation.
const wordCount =console // 5
Specifically, we consider a word a run of 1 or more characters in these sets:
The tests make it pretty clear what it's doing:
Many more naive implementations match just
\w but that only get's you
(some) English and even then things like possessives and, depending on how
you look at, contractions get over counted.
To the best of my knowledge this should successfully count words in any language that uses word-separators. Counting words in languages without word-separators is rather harder and the heuristics are language specific.
If you happen to give this a run of, say, Chinese characters, it will consider each group outside of punctuation to be a word, massively under counting. So yeah, use a language specific counter:
\wplus some ranges of CJK characters. CJK characters are each counted as one-word-per-character.
word-count, it adds ranges for Cyrillic.
\Swhile allowing double quoted strings.
\s. Has a run-time dependency on coffee-script.