Bloody Roots
A JavaScript engine for recursive descent parsing of context-free grammars, written in CoffeeScript.
Usage
{ Parser } = require('bloodyroots')
- Create a new class extending from
Parser
, defining one or more productions. TheDocument
production is required, as it will be the parse engine's entry point into the grammar. - Instantiate the derived parser class.
- Execute the parse method on this instance. It will return a DOM tree on successful parse, or undefined on failure.
Terminology
Context free grammars use productions V of the form α→β, where α is a single non-terminal and β is a sequence of terminals and non-terminals; throughout the rest of this document, β refers to some tree of grammar operations, defined below. Please see http://en.wikipedia.org/wiki/Context-free_grammar for more detailed information on context-free grammars.
vdata
refers to an object that tracks parametric information for parsing according to the β tree of a production V, typically containing the capture group information for previous regular expression matches.
Grammar operations
It is useful to understand regular expressions, as the grammar operations in this parser correspond to regular expression operations.
-
@alternation(β_1, β_2, ..., β_n)
or@alternation( [β_1, β_2, ..., β_n], β_suffix)
: Equivalent to the...|...|...
from regular expressions: given a sequence of βs, it produces a parse tree from the firstβ_i
that parses. Unlike regular expressions, however, the default behavior is to commit to a particularβ_i
from the given sequence and not backtrack if what follows does not parse. So, for example, while(aaa|aa)ab
will match the stringaaab
, the β tree@seq(@alternation(@re('aaa'), @re('aa')), @re('ab'))
will not parse
aaab
because@re('aaa')
will have matched and the alternation considered committed before checking whether@re('ab')
matches. The regular expression-like behavior can be achieved by specifying the alternation's β sequence as an array in the first argument and the suffix β tree as the second argument:@alternation([@re('aaa'), @re('aa')], @re('ab'))
Note, however, that this will result in backtracking and so is less efficient than writing a different set of productions for the same grammar that does not require backtracking.
When
β_suffix
is specified, returns a DOM tree node oftype: 'seq'
with the output from parsing according to the matchingβ_i
as the first element and the output of parsing according toβ_suffix
as the second element of an array under keyseq
; whensuffix
is not specified, returns precisely what parsingβ_i
returned. -
@range(β, min=0, max, greedy=true, β_suffix)
: Equivalent to(...){n,m}
from regular expressions: it matches at leastn
and at mostm
repetitions of the givenβ
. As with@alternation
, one needs to specify aβ_suffix
to force backtracking if the range match should not commit after finding the minimum>= min
(non-greedy) or the maximum<= max
(greedy) number of consecutive matches. There are several shorthands based on@range
:@at_least_one(β, β_suffix)
is equivalent to@range(β, 1, undefined, true, β_suffix)
: think+
from regular expressions.@zero_or_more(β, β_suffix)
is equivalent to@range(β, 0, undefined, true, β_suffix)
: think*
from regular expressions.@zero_or_one(β, β_suffix)
is equivalent to@range(β, 0, 1, true, β_suffix)
: think?
from regular expressions (as applied to a terminal, not as applied to force non-greediness)
Note that using non-greediness without a suffix always returns either
min
matches or fails to parse.Returns a DOM tree node of
type: 'seq'
with an array of the output from eachβ
parsing iteration (plus the matchedβ_suffix
, if any) under keyseq
. -
@re(re_str, match_name)
: Matches the given regular expression. Ifmatch_name
is specified, then the resulting match array (index0
being the full matched string, indexn>=1
being then
th capture group) is assigned the given name. See@var_re
and@backref
for the use of this named match array.Returns a DOM tree node of
type: 're'
with the entire string under keymatch
and the named groups (including the entire string again under index 0) as an array under keygroups
. -
@seq(β_1, β_2, ..., β_n)
: Parses according to the givenβ_i
s in order.Returns a DOM tree node of
type: 'seq'
with the array of matching βs under keyseq
. -
@transform(f, β)
: Transforms the DOM tree resulting from parsing according toβ
using the functionf
, which takes the DOM tree,vdata
, and the string index of the parsed substring as arguments.this
is set to the instance of the parser.The transform should not modify the input parse tree, as the original may be part of a cached entry (see Optimizations for more info) and so cause unintended behavior elsewhere in the parser.
Returns whatever the transform returns;
undefined
is considered a failure to parse. -
@v(α_name, argf)
: Parses the production namedα_name
, as defined by@define_production(α_name, ...)
.If
argf
is specified, callsargf.call(this, vdata)
and assigns the return value tovdata.arg
(which is passed to the grammar operations within the β tree for this production). The purpose of this second argument is to make the receiving productions variable functions of this parameter, specifically to pass backreferences from an earlier@re
match to another production.If, for instance, one wishes to parse XML without needing to know all schema-valid tags in advance, one needs to match a close tag to the tag that opened it. This can be performed with a non-greedy alternation, but a more efficient way to accomplish this is to use greediness combined with a forward-looking negative regular expression match on the open tag, such as in the BBCode parser from the test code:
@define_production('Element', @seq( @re('\\[([A-Za-z]*)(?:=([^\\]]*))?\\]', 'opentag'), @zero_or_more( @alternation( @v('Element'), @v('NotSpecificCloseTagOrText', @backref('opentag[1]'))))), @var_re('\\[/\\=opentag[1]\\]')))) @define_production('NotSpecificCloseTagOrText', @transform(text, @var_re('\\[/(?!\\=arg[0]\\])[^\\]]*\\]|\\[(?:[^/][^\\]]*)?\\]|[^\\[]+')))
In the first step of the
@seq
,Element
searches for an open tag and assigns the name of the tag tovdata.opentag
. In the second step it then parses zero or moreElement
s orNotSpecificCloseTagOrText
s: the first case parses a full subelement (such as[b]...[/b]
); the second case, however, parses any text or close tag except the close tag that matchesvdata.opentag[1]
(asvdata.arg[0]
fromNotSpecificCloseTag
'svdata
). In that case, the@zero_or_more
is done and the@var_re
matches the specific close tag in the final step of the@seq
.Clearly, this could all be done within the
Element
production itself, but one can imagine another use ofNotSpecificCloseTagOrText
that might motivate breaking it out into its own production.See
@var_re
for information on how the argument is inserted into the regular expression.Returns precisely what the production for the given
α_name
would return. -
@var_re(re_str, match_name)
: Same as@re
, but replaces all instances of=arg[idx]
withvdata.arg[idx]
. Note that this regular expression cannot be precompiled and so@var_re
is less efficient than@re
.
Other methods
-
@backref(match_name[idx])
: Returns the value ofvdata.match_name[idx]
, i.e., either:- The capture group with index
idx
within the match array namedmatch_name
by an@re
or@var_re
call within the β tree for this production; or - When
match_name == 'arg'
, the valuevdata.arg[idx]
wherevdata.arg
was set by the second argument to@v
.
Currently, the only use for this operation is as the second argument to
@v
, to pass a backreference into another production. - The capture group with index
The DOM tree
An object tree representing the parser output. In each node, standard keys always present are pos
and length
, indicating respectively the starting index of and the length of the input string whose parsed output is represented by this subtree; and type
, indicating the type of element represented by this node. Other keys for a node depend on the output of the grammar operation used to parse it.
Suggestions
Specify @debug = true
in your class definition to turn on debug logging for instances of that parser: this should help with development of tricky grammars.
See test/parser.coffee
for a good example of how to implement a grammar: BBCodeParser
is an early version of what I wrote this code for. It parses a variant of BBCode that I use in my old Perl-based forum software that is now long in the tooth and probably full of security holes. (Sssh! Don't tell anyone.)
Optimizations
If, as a result of backtracking, a particular production is parsed multiple times at the same string position with the same vdata
, the second and subsequent attempts will return the previously cached result instead of re-parsing the string. This caching is evident in the full debug log.
To do
- Clean up the code and structure as I learn Node.js better. Please provide feedback if something is not working as it should within the ecosystem: it has been difficult to piece together the "right" way to do things through the results of Google searches.
License
The MIT License
Copyright (c) 2013 Kyle Rose
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.