Eats invalid unicode for breakfast. Replaces unmatched surrogate pairs by U+FFFD
There's a bug in V8 which means that in rare circumstances Node strings can contain invalid Unicode.
As this is an abhorrent situation, you should use
unicode-dragon to ensure
that any untrusted user input is actually valid unicode before saving it to
You can install unicode-dragon with npm. If you're really desperate, you can also copy-pasted the function out of index.js into whereever you need.
npm install unicode-dragon
unicodeDragon = require'unicode-dragon';var validString = unicodeDragonbuffertoString'utf-8';
V8 strings are internally UCS-2, which is a legacy Unicode encoding that only works with Unicode 1.0. In Unicode 2.0, 2048 code-points (U+D800 - U+DFFF) were removed from Unicode and are used by UTF-16 in surrogate pairs to represent code-points higher than U+FFFF (which is the highest UCS-2 can support).
That said, the UTF-8 codecs in V8 can support characters greater than U+FFFF as though the internal strings were UTF-16. This is awesome, but there's a small mistake, and V8 will accidentally parse invalid UTF-8 that happens to include one of the removed surrogate code-points.
In the case that there is a valid surrogate pair, this is fine, as internally the UCS-2 buffer is now valid UTF-16, and so the encoder will re-encode it to valid UTF-8. (UTF-8 with surrogate pairs is more properly known as CESU-8, and is a frowned upon hack). This is definitely not fine however in the case were the surrogates appear individually, or incorrectly paired. As the UTF-8 encoder will output another invalid surrogate.
And if that makes your brain hurt, stay safe, and use
unicodeDragon to guard
yourself from invalid user input.
Unicode-dragon is licensed under the MIT licence. Bug-reports and contributions are welcome.