unicode-tokenizer

Unicode Tokenizer

This is a tokenizer that tokenizes text according to the line breaking classes defined by the Unicode Line Breaking algorithm (tr14). It also annotates each token with its line breaking action. This is useful when performing Natural Language Processing or doing manual line breaking.

Usage:

var ut = require('unicode-tokenizer'),
    tokenizer = ut.createTokenizerStream();

tokenizer.on('token', function(token, type, action) {
    ...
});

tokenizer.write('Hello World!');
tokenizer.end();

Note that in order to receive the token type and break action, you'll need to listen to the token event. The token parameter is a string containing the token, the type is a number representing the token type, and the action is also a number representing the line break action. Both the token types and line breaking actions are available as enumerations on the object returned by require('unicode-tokenizer'). If, for example, you would like to do something special for tokens with class AL that are also an explicit break you can implement the above callback as shown below:

tokenizer.on('token', function(token, type, action) {
    if (type === ut.Token.AL && action = ut.Break.EXPLICIT) {
        // Do something special
    }
});

The Tokenizer returned by createTokenizerStream is also a valid Node.js Stream so it can be used with other streams:

process.stdin.pipe(tokenizer);
tokenizer.pipe(process.stdout);
process.stdin.resume();

Unicode support

The full range of Unicode code points are supported by this tokenizer. If you however only want to tokenize selected portions of the Unicode standard, such as the Basic Multilingual Plane, you can subset the supported Unicode range. To generate a subsetted tokenizer, modify the included-ranges.txt and excluded-classes.txt files, and use the --include-ranges and --exclude-classes command line options on the generate-tokens script.