Morpheme Splitter - Nepali
Script adapted from this code in ipython notebook.
Nepali words are composed of various morphemes which can be broadly divided into two categories: Vowels and Consonants. A given word can be resolved into its morphemes by some elementary rules. While these rules are relatively straightforward, the unicode representation make it a little bit non-trivial to work with. Consider these scenarios:
- क is actually a single character in Unicode, while it is two morphemes, क् + अ in Nepali.
- क + ् in Unicode representation translates to क्, a single morpheme in Nepali.
- क + ि in Unicode representation translates to क् + इ in Nepali.
In this script, we define rules for the separation of morphemes in Nepali Unicode representation. This shall serve as a building block as we later construct systems for separating syllables from multi-syllables words in Nepali.
- If any character is a vowel, leave it as it is
- If any character is a single unicode consonant क - ह
- If this is a last letter, the previous character as well as this character make two morpheme, where the latter is the independent vowel अ.
- If next character is a halanta u(्), the previous character is a single morpheme.
- If next character is a vowel, the previous character as well as this vowel make two morphemes (क् + ि).
- If next character is a consonant, the previous character as well as this character make two morpheme, where the latter is the independent vowel अ.