node package manager


A library for helping you build recursive descent parsers and regex-based lexers

Canto 34 is a library for building recursive-descent parsers. It can be used both on node.js, and in the browser as an AMD module.

When it comes to writing a parser, you get two main choices. Write your own, or use a parser generator like PEGJS or ANTLR.

I've never really had much success with parser generators, and it's always seemed pretty easy to write a recursive-descent parser yourself, if you have some basic tools like a regex-based lexer, and some basic functions for matching tokens and reporting errors. Canto34.js gives you the functions you need to write a regex-based lexer and a recursive descent parser yourself.

Here's a simple example of a language which just defines name-value pairs;

foo 1, bar 5, baz 8.

We'll see how to parse this language. First, you write a lexer which identifies the different kinds of token -- names, integers, commas, and periods;

var lexer = new canto34.Lexer({ languageName: "nameValuePairs" });

// add a token for whitespace
    name: "ws",       // give it a name
    regexp: /[ \t]+/, // match spaces and tabs
    ignore: true      // don't return this token in the result

// add a token type for names, defines as strings of lower-case characters
lexer.addTokenType({ name: "name", regexp: /^[a-z]+/  });

// bring in some predefined types for commas, period, and integers.
var types = canto34.StandardTokenTypes;

And here's how you use it;

var tokens = lexer.tokenize("foo 1, bar 2.");

which returns a set of tokens;

    { content: "foo", type: "name",    line: 1, character: 1 },
    { content: 1,     type: "integer", line: 1, character: 5 },
    { content: ",",   type: "comma",   line: 1, character: 6 },
    { content: "bar", type: "name",    line: 1, character: 8 },
    { content: 2,     type: "integer", line: 1, character: 12 },
    { content: ".",   type: "period",  line: 1, character: 13 }

Now you feed these tokens into a parser. Here's a parser for our language;

var parser = new canto34.Parser();
parser.listOfNameValuePairs = function() {
    this.result = [];
    while (!this.eof() && this.la1("comma")) {

parser.nameValuePair = function() {
    var name = this.match("name").content;
    var value = this.match("integer").content;
    this.result.push({ name:name, value: value });

And it's used like this;

var tokens = lexer.tokenize("foo 1, bar 2, baz 3.");

parser.result now contains the value;

    {name:"foo", value:1},
    {name:"bar", value:2},
    {name:"baz", value:3},

And we're done! That's a basic lexer and parser, written in thirty lines of code.

Canto 34 is designed for quickly writing parsers for straightforward domain- specific languages (DSLs). Please don't use it for writing parsers for general-purpose programming languages -- it's not designed to be that fast or powerful.

Starting in v0.0.5, I've added a function to let you generate a .tmLanguage file. This is the syntax highlighting system of TextMate, Sublime Text, Atom, and Visual Studio code.

This is actually pretty sweet; it makes your language feel a bit more 'first class' when you get the richer IDE experience.

You first generate the file content, then save it wherever your text editor demands. For example, To use it with Sublime Text 3 on windows, call code like so, swapping {myusername} with the name of your user account;

var content = canto34.tmLanguage.generateTmLanguageDefinition(lexer);
require('fs').writeFileSync("C:\\Users\\{myusername}\\AppData\\Roaming\\Sublime Text 3\\Packages\\User\\myparser.tmLanguage", content);

This will write out a syntax highlighting file into your User folder, which is a personal 'dumping ground' for configuration bits and pieces on your machine for Sublime Text. When you restart ST, you'll see your language in the list of language options, displayed in the bottom-right.

You'll need to configure the tokens in your lexer a bit more. For example, if you have a variable name token, you'll need to tell the lexer that this should be highlighted one way, and if you have a keyword like 'for', you'll use another. Do this with the roles option on a token type;

lexer.addTokenType({ name: "comment", ignore: true, regexp: /^#.*/, role: ["comment", "line"] });
lexer.addTokenType(types.constant("for","for", ["keyword"]));
lexer.addTokenType(types.constant("[", "openSquare", ["punctuation"]));
lexer.addTokenType(types.constant("]", "closeSquare", ["punctuation"]));

The list of roles isn't clear to me - I've just figured out some key ones like keyword, comment, and punctuation from reverse engineering existing tmLanguage files.