Universal Lexer
Lexer which can parse any text input to tokens, according to provided regular expressions.
In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an assigned and thus identified meaning). A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, though scanner is also a term for the first stage of a lexer. A lexer is generally combined with a parser, which together analyze the syntax of programming languages, web pages, and so forth.
Features
- Allow named regular expressions, so you don't have to work with it a lot
- Allow post-processing tokens, to get more information you require
How to install
Package is available as universal-lexer
in NPM, so you can use it in your project using
npm install universal-lexer
or yarn add universal-lexer
What are requirements?
Code itself is written in ES6 and should work in Node.js 6+ environment.
If you would like to use it in browser or older development, there is also transpiled and bundled (UMD) version included.
You can use universal-lexer/browser
in your requires or UniversalLexer
in global environment (in browser):
// Load libraryconst UniversalLexer = // Create lexerconst lexer = UniversalLexer // ...
How it works
You've got two sets of functions:
// Load libraryconst UniversalLexer = // Build code for this lexerconst code1 = UniversalLexerconst code2 = UniversalLexer // Compile dynamically a function which can be usedconst func1 = UniversalLexerconst func2 = UniversalLexer
There are two ways of passing rules to this lexer: from file or array of definitions.
Pass as array of definitions
Simply, pass definitions to lexer:
// Load libraryconst UniversalLexer = // Create token definitionconst Colon = type: 'Colon' value: ':' // Build array of definitionsconst definitions = Colon // Create lexerconst lexer = UniversalLexer
A definition is more complex object:
// Required fields: 'type' and either `regex` or `value` // Token name type: 'String' // String value which should be searched on beginning on string value: 'abc' value: '(' // Regular expression to validate // if current token should be parsed as this token // Useful i.e. when you require separator after sentence, // but you don't want to include it. valid: '"' // Regular expression flags for 'valid' field validFlags: 'i' // Regular expression to find current token // You can use named groups as well (?<name>expression): // Then it will attach this information to token. regex: '"(?<value>([^"]|\\.)+)"' // Regular expression flags for 'regex' field regexFlags: 'i'
Pass YAML file
// Load libraryconst UniversalLexer = const lexer = UniversalLexer
YAML file for now should contain only Tokens
property with definitions.
Later it may have more advanced stuff like macros (for simpler syntax).
Example:
Tokens: # Whitespaces - type: NewLine value: "\n" - type: Space regex: '[ \t]+' # Math - type: Operator regex: '[+-*/]' # Color # It has 'valid' field, to be sure that it's not i.e. blacker # Now, it will check if there is no text after - type: Color regex: '(?<value>black|white)' valid: '(black|white)[^\w]'
Processing data
Processing input data, after you created a lexer is pretty straight-forward with for
method:
// Load libraryconst UniversalLexer = // Create lexerconst tokenize = UniversalLexer // Build processorconst tokens = tokens
Post-processing tokens
If you would like to make more advanced parsing on parsed tokens, you can do it with addProcessor
method:
// Load libraryconst UniversalLexer = // Create lexerconst tokenize = UniversalLexer // That's 'Literal' definition:const Literal = type: 'Literal' regex: '(?<value>([^\t \n;"'',{}()\[\]#=:~&\\]|(\\.))+)' // Create processor which will replace all '\X' to 'X' in value { if tokentype === 'Literal' tokendatavalue = tokendatavalue return token} // Also, you can return a new token { if tokentype !== 'Literal' return token return type: 'Literal' data: value: tokendatavalue start: tokenstart end: tokenend } // Get all tokens...const tokens = tokens
Beautified code
If you would like to get beautified code of lexer,
you can use second argument of compile
functions:
UniversalLexerUniversalLexer
Possible results
On success you will retrieve simple object with array of tokens:
tokens: type: 'Whitespace' data: value: ' ' start: 0 end: 5 type: 'Word' data: value: 'some' start: 5 end: 9
When something is wrong you will get error information:
error: 'Unrecognized token' index: 1 line: 1 column: 2
Examples
For now, you can see example of JSON semantics in examples/json.yaml
file.
CLI
After installing globally (or inside of NPM scripts) universal-lexer
command is available:
Usage: universal-lexer [options] output.js
Options:
--version Show version number [boolean]
-s, --source Semantics file [required]
-b, --beautify Should beautify code? [boolean] [default: true]
-h, --help Show help [boolean]
Examples:
universal-lexer -s json.yaml lexer.js build lexer from semantics file
Changelog
Version 2
- 2.0.6 - bugfix for single characters
- 2.0.5 - fix mistake in README file (post-processing code)
- 2.0.4 - remove unneeded
benchmark
dependency - 2.0.3 - add unit and E2E tests, fix small bugs
- 2.0.2 - added CLI command
- 2.0.1 - fix typo in README file
- 2.0.0 - optimize it (even 10x faster) by expression analysis and some other things
Version 1
- 1.0.8 - change that current position in syntax error starts from 1 always
- 1.0.7 - optimize definitions with "value", make syntax errors developer-friendly
- 1.0.6 - optimized Lexer performance (20% faster in average)
- 1.0.5 - fix browser version to be put into NPM package properly
- 1.0.4 - bugfix for debugging
- 1.0.3 - add proper sanitization for debug HTML
- 1.0.2 - small fixes for README file
- 1.0.1 - added Rollup.js support to build version for browser