For example, in tech, you might see 'node js' or 'NodeJS' or 'node.js' and want them understood as the same term. That’s lemmatization.
npm install "@clipperhouse/jargon@latest"
Then create a file, preferably TypeScript.
// demo.ts import jargon from '@clipperhouse/jargon'; import stackexchange from '@clipperhouse/jargon/stackexchange'; // a dictionary const text = 'I ❤️ Ruby on Rails and vue'; const lemmas = jargon.Lemmatize(text, stackexchange); console.log(lemmas.toString()); // I ❤️ ruby-on-rails and vue.js
// demo.js const jargon = require('@clipperhouse/jargon'); const stackexchange = require('@clipperhouse/jargon/stackexchange'); const text = 'I ❤️ Ruby on Rails and vue'; const lemmas = jargon.Lemmatize(text, stackexchange); console.log(lemmas.toString()); // I ❤️ ruby-on-rails and vue.js
What’s it doing?
jargon tokenizes the incoming text, identifying punctuation and spaces. It understands tech-ish terms as single words, such as asp.net and TCP/IP, and #hangtags and @handles (other tokenizers would see two words).
Those tokens go to the lemmatizer, with a
dictionary. The lemmatizer passes over tokens, and asks the dictionary if it recognizes them. It handles multi-token phrases like 'Ruby on Rails', converting it a single
It is insensitive to spaces, hyphens, dots, slashes and case -- so it handles a lot of variation that would be difficult to get right with simple search-and-replace or regex.
These rules are defined in a Dictionary. In the above examples,
stackexchange is the dictionary, and it knows about react vs react.js. It also understands synonyms, such as ecmascript
Another example is the
contractions dictionary. It'll split tokens like
it'll into two tokens