trigram-utils
Trigram language statistics utility functions, in their own repository to make
sure trigrams
(trigram info for the universal declaration of
human rights) and franc
(language detection) use the same cleaning
and classification methods.
Install
This package is ESM only: Node 12+ is needed to use it and it must be import
ed
instead of require
d.
npm:
npm install trigram-utils
Use
import {clean, trigrams, asDictionary, asTuples, tuplesAsDictionary} from 'trigram-utils'
clean(' t@rololol ') // => 't rololol'
trigrams(' t@rololol ')
// => [' t ', 't r', ' ro', 'rol', 'olo', 'lol', 'olo', 'lol', 'ol ']
asDictionary(' t@rololol ')
// => {'ol ': 1, lol: 2, olo: 2, rol: 1, ' ro': 1, 't r': 1, ' t ': 1}
var tuples = asTuples(' t@rololol ')
// => [
// ['ol ', 1],
// ['rol', 1],
// [' ro', 1],
// ['t r', 1],
// [' t ', 1],
// ['lol', 2],
// ['olo', 2]
// ]
tuplesAsDictionary(tuples)
// => {olo: 2, lol: 2, ' t ': 1, 't r': 1, ' ro': 1, rol: 1, 'ol ': 1}
API
This package exports the following identifiers: clean
, trigrams
,
asDictionary
, asTuples
, tuplesAsDictionary
.
There is no default export.
clean(value)
Clean a given string: strips some (for language detection) useless punctuation, symbols, and numbers. Collapses white space, trims, and lowercases.
trigrams(value)
Get clean, padded trigrams (see n-gram
).
asDictionary(value)
Get clean trigrams as a dictionary (Object<string, number>
): keys are
trigrams, values are occurrence counts.
asTuples(value)
Get clean trigrams with occurrence counts as a tuple ([string, number][]
):
first index (0
) the trigram, second (1
) the occurrence count.
tuplesAsDictionary(tuples)
Transform an Array
of trigram–occurrence tuples (as returned by
asTuples()
) to a dictionary (as returned by
asDictionary()
)