I received some PDFs using Myanmar Unicode characters, but also empty codepoints representing different Myanmar diacritics, and other combination characters. Converting to full Unicode order is painstaking, and we need it in several applications, so I am putting it into a module.
We received a PDF where the name "Mohnyin Township" appears like this:
but when you copy and paste the actual characters, you get this:
Here are its issues:
The first character မိုး is missing the ု because an empty codepoint is used. This separates out the next diacritic း
မြို့န is written မိုန့ - the ြ diacritic is an empty codepoint that is placed before the character that it modifies. The ့ diacritic is placed after the character န instead of the character that it modifies.
In other text samples, there are multiple diacritics in a nonstandard order.
On the web
Include the my-diacritic.js file. Then pass it some text:
It doesn't convert back. wontfix.
npm install my-diacritic-sort
var sortDiacritics = ;;
Open source under MIT license