LZ-UTF8
A high-performance string compression library and stream format:
- Fast, especially decompression (rates are for a low-end desktop PC processing 1MB files):
- Javascript: 3-14MB/s compression , 20-80MB/s decompression (detailed benchmarks and comparison to other Javascript libraries can be found in the technical paper).
- C++: 30-40MB/s compression, 300-500MB/s decompression (currently unreleased, figures may improve in the future).
- Reasonable compression ratio - very good for shorter strings (<32k), but less efficient for longer ones.
- Conceived with web and mobile use cases in mind. Designed for and implemented in Javascript from the very beginning.
- Simple and easy-to-use API that's consistent across all platforms, both in the browser and in Node.js.
- 100% patent-free.
Technical objectives and properties:
- Based on LZ77. An efficient decompressor implementation should run virtually in realtime as the decompression process only involves the copying of raw memory blocks.
- Compresses UTF-8 and 7-bit ASCII strings only. Doesn't support arbitrary binary content or other string encodings.
- Byte aligned, meaning individually compressed blocks can be freely concatenated and intermixed with each other and yield a valid compressed stream that decompresses to the equivalent concatenated strings.
- Fully compatible with UTF-8. Any valid UTF-8 bytestream is also a valid LZ-UTF8 stream (but not vice versa). This special property allows both compressed and plain UTF-8 streams to be freely concatenated and decompressed as single unit (or with any arbitrary partitioning). Some possible applications:
- Sending static pre-compressed data followed by dynamically generated uncompressed data from a server (and possibly appending a compressed static "footer", or repeating the process several times).
- Appending both uncompressed/compressed data to a compressed log file/journal without needing to rewrite it.
- Joining multiple source files, where some are possibly pre-compressed, and serving them as a single concatenated file without additional processing.
- Compression always results in a byte count smaller or equal to the source material size (a consequence of not applying an entropy coder).
Javascript implementation:
- Tested on most popular browsers and platforms - Chrome, Firefox, IE10+, Android 4+, Safari 5+ and Node.js 0.10+ (IE8 and IE9 may work with a typed array polyfill.
- Allows compressed data to be efficiently packed in plain UTF-16 strings (see the
BinaryString
encoding) when binary storage is not available or desired (e.g. when using LocalStorage or older IndexedDB). - Can operate asynchronously, both in Node.js and in the browser. Uses web workers when available (and takes full advantage of transferable objects if supported) and falls back to async iterations when not.
- Supports Node.js streams.
- Well structured code written in TypeScript.
Quick start
- Try the online demo to test and benchmark different inputs.
- Download the latest build (or the minified version).
- Run the automated tests.
- Run the core benchmarks.
- Read the technical paper.
Table of Contents
- API Reference
- Release history
- License
API Reference
Getting started
Browser:
note: the id
attribute and its exact value are necessary for the library to make use of web workers.
Node.js:
npm install lzutf8
var LZUTF8 = ;
Type Identifier Strings
"ByteArray"
- An array of bytes. As of 0.3.2
, always a Uint8Array
. In versions up to 0.2.3
the type was determined by the platform (Array
for browsers that don't support typed arrays, Uint8Array
for supporting browsers and Buffer
for Node.js).
IE8/9 and support was dropped at 0.3.0
though these browsers can still be used with a typed array polyfill.
"Buffer"
- A Node.js Buffer
object.
"BinaryString"
- A string
containing binary data encoded to only use the lowest 15 bits of each character.
"Base64"
- A base 64 string.
Core Methods
LZUTF8.compress(..)
var output = LZUTF8;
Compresses the given input data.
input
can be either a String
or UTF-8 bytes stored in a Uint8Array
or Buffer
options
(optional): an object that may have any of the properties:
outputEncoding
:"ByteArray"
(default),"Buffer"
,"BinaryString"
or"Base64"
returns: compressed data encoded by encoding
, or ByteArray
if not specified.
LZUTF8.decompress(..)
var output = LZUTF8;
Decompresses the given compressed data.
input
: can be either a Uint8Array
, Buffer
or String
(where encoding scheme is then specified in inputEncoding
)
options
(optional): an object that may have the properties:
inputEncoding
:"ByteArray"
(default),"BinaryString"
or"Base64"
outputEncoding
:"String"
(default),"ByteArray"
or"Buffer"
to return UTF-8 bytes
returns: decompressed bytes encoded as encoding
, or as String
if not specified.
Asynchronous Methods
LZUTF8.compressAsync(..)
LZUTF8;
Asynchronously compresses the given input data.
input
can be either a String
, or UTF-8 bytes stored in an Uint8Array
or Buffer
.
options
(optional): an object that may have any of the properties:
outputEncoding
:"ByteArray"
(default),"Buffer"
,"BinaryString"
or"Base64"
useWebWorker
:true
(default) would use a web worker if available.false
would use iterated yielding instead.
callback
: a user-defined callback function accepting a first argument containing the resulting compressed data as specified by outputEncoding
(or ByteArray
if not specified) and a possible second parameter containing an Error
object.
On error: invokes the callback with a first argument of undefined
and a second one containing the Error
object.
Example:
LZUTF8;
LZUTF8.decompressAsync(..)
LZUTF8;
Asynchronously decompresses the given compressed input.
input
: can be either a Uint8Array
, Buffer
or String
(where encoding is set with inputEncoding
).
options
(optional): an object that may have the properties:
inputEncoding
:"ByteArray"
(default),"BinaryString"
or"Base64"
outputEncoding
:"String"
(default),"ByteArray"
or"Buffer"
to return UTF-8 bytes.useWebWorker
:true
(default) would use a web worker if available.false
would use incremental yielding instead.
callback
: a user-defined callback function accepting a first argument containing the resulting decompressed data as specified by outputEncoding
and a possible second parameter containing an Error
object.
On error: invokes the callback with a first argument of undefined
and a second one containing the Error
object.
Example:
LZUTF8;
General notes on async operations
Web workers are available if supported by the browser and the library's script source is referenced in the document with a script
tag having id
of "lzutf8"
(its src
attribute is then used as the source URI for the web worker). In cases where a script tag is not available (such as when the script is dynamically loaded or bundled with other scripts) the value of LZUTF8.WebWorker.scriptURI
may alternatively be set before the first async method call.
Workers are optimized for various input and output encoding schemes, so only the minimal amount of work is done in the main Javascript thread. Internally, conversion to or from various encodings is performed within the worker itself, reducing delays and allowing greater parallelization. Additionally, if transferable objects are supported by the browser, binary arrays will be transferred virtually instantly to and from the worker.
Only one worker instance is spawned per page - multiple operations are processed sequentially.
In case a worker is not available (such as in Node.js, IE8, IE9, Android browser < 4.4) or desired, it will iteratively process 64KB blocks while yielding to the event loop whenever a 20ms interval has elapsed. Note: In this execution method, parallel operations are not guaranteed to complete by their initiation order.
Lower-level Methods
LZUTF8.Compressor
var compressor = ;
Creates a compressor object. Can be used to incrementally compress a multi-part stream of data.
returns: a new LZUTF8.Compressor
object
LZUTF8.Compressor.compressBlock(..)
var compressor = ;var compressedBlock = compressor;
Compresses the given input UTF-8 block.
input
can be either a String
, or UTF-8 bytes stored in a Uint8Array
or Buffer
returns: compressed bytes as ByteArray
This can be used to incrementally create a single compressed stream. For example:
var compressor = ;var compressedBlock1 = compressor;var compressedBlock2 = compressor;var compressedBlock3 = compressor;
LZUTF8.Decompressor
var decompressor = ;
Creates a decompressor object. Can be used to incrementally decompress a multi-part stream of data.
returns: a new LZUTF8.Decompressor
object
LZUTF8.Deompressor.decompressBlock(..)
var decompressor = ;var decompressedBlock = decompressor;
Decompresses the given block of compressed bytes.
input
can be either a Uint8Array
or Buffer
returns: decompressed UTF-8 bytes as ByteArray
Remarks: will always return the longest valid UTF-8 stream of bytes possible from the given input block. Incomplete input or output byte sequences will be prepended to the next block.
Note: This can be used to incrementally decompress a single compressed stream. For example:
var decompressor = ;var decompressedBlock1 = decompressor;var decompressedBlock2 = decompressor;var decompressedBlock3 = decompressor;
LZUTF8.Deompressor.decompressBlockToString(..)
var decompressor = ;var decompressedBlockAsString = decompressor;
Decompresses the given block of compressed bytes and converts the result to a String
.
input
can be either a Uint8Array
or Buffer
returns: decompressed String
Remarks: will always return the longest valid string possible from the given input block. Incomplete input or output byte sequences will be prepended to the next block.
Node.js only methods
LZUTF8.createCompressionStream()
var compressionStream = LZUTF8;
Creates a compression stream. The stream will accept both Buffers and Strings in any encoding supported by Node.js (e.g. utf8
, utf16
, ucs2
, base64
, hex
, binary
etc.) and return Buffers.
example:
var sourceReadStream = fs;var destWriteStream = fs;var compressionStream = LZUTF8; sourceReadStrem;
On error: emits an error
event with the Error
object as parameter.
LZUTF8.createDecompressionStream()
var decompressionStream = LZUTF8;
Creates a decompression stream. The stream will accept and return Buffers.
On error: emits an error
event with the Error
object as parameter.
Character encoding methods
LZUTF8.encodeUTF8(..)
var output = LZUTF8;
Encodes a string to UTF-8.
input
as String
returns: encoded bytes as ByteArray
LZUTF8.decodeUTF8(..)
var outputString = LZUTF8;
Decodes UTF-8 bytes to a String.
input
as either a Uint8Array
or Buffer
returns: decoded bytes as String
LZUTF8.encodeBase64(..)
var outputString = LZUTF8;
Encodes bytes to a Base64 string.
input
as either a Uint8Array
or Buffer
returns: resulting Base64 string.
remarks: Maps every 3 consecutive input bytes to 4 output characters of the set A-Z
,a-z
,0-9
,+
,/
(a total of 64 characters). Increases stored byte size to 133.33% of original (when stored as ASCII or UTF-8) or 266% (stored as UCS-2/UTF-16).
LZUTF8.decodeBase64(..)
var output = LZUTF8;
Decodes UTF-8 bytes to a String.
input
as String
returns: decoded bytes as ByteArray
remarks: the decoder cannot decode concatenated base64 strings. Although it is possible to add this capability to the JS version, compatibility with other decoders (such as the Node.js decoder) prevents this feature to be added.
LZUTF8.encodeBinaryString(..)
var outputString = LZUTF8;
Encodes binary bytes to a valid UTF-16 string.
input
as either a Uint8Array
or Buffer
returns: String
remarks: To comply with the UTF-16 standard, it only uses the bottom 15 bits of each character, effectively mapping every 15 input bits to a single 16 bit output character. This Increases the stored byte size to 106.66% of original.
LZUTF8.decodeBinaryString(..)
var output = LZUTF8;
Decodes a binary string.
input
as String
returns: decoded bytes as ByteArray
remarks: Multiple binary strings may be freely concatenated and decoded as a single string. This is made possible by ending every sequence with special marker (char code 32768 for an even-length sequence and 32769 for a an odd-length sequence).
Release history
0.3.x
: Removed support to IE8/9. Removed support for Array inputs. All"ByteArray"
outputs are nowUint8Array
objects. A separate"Buffer"
encoding setting can be used to returnBuffer
objects.0.2.x
: Added async error handling. Added support for TextEncoder and TextDecoder when available.0.1.x
: Initial release.
License
Copyright (c) 2014-2016, Rotem Dan <rotemdan@gmail.com>.
Source code and documentation are available under the MIT license.