Mazard Scanner
This scanner converts a Markdown document into an array of tokens. These tokens can then be interpreted by a parser into an expression tree. Much inspiration has been taken from Robert Nystrom's Crafting Interpreters as well as Alfred Aho's The Theory of Parsing, Translation, and Compiling.
Tokens types
Type | Description | Example |
---|---|---|
SYMBOL | An alphanumeric string that closely resembles a variable name in other languages | Foo, foo, foo-bar, foo_bar |
RUNE | Similar to a symbol, but these strings contain non-alphanumeric content | Foo#, -foo, _foo, 1foo, fo>o |
NUMBER | An integer, decimal, or a number in exponential notation | 1, 1.0, +1, -1, 1.0e1 |
SPACE | One ore more space characters. The literal value is the number of spaces encountered. | |
TAB | A "\t" or " " at the start of a line. | |
BR | One or more line break characters. | |
COLON | A, well, colon | : |
COLON_COLON | Two colons in sequence, likely indicated an Obsidian metadata value | :: |
FRONTMATTER_START | The triple-dash at the start of a frontmatter section | --- |
FRONTMATTER_END | The triple-dash at the end of a frontmatter section | --- |
FRONTMATTER_KEY | A frontmatter key | The foo in foo: bar
|
FRONTMATTER_VALUE | A frontmatter value | The bar in foo: bar
|
FRONTMATTER_BULLET | A dash at the beginning of a line | The - in - bar
|
CODE_START | The triple-backtick at the start of a code section | ``` |
CODE_LANGUAGE | The language specified after the triple backticks of a CODE_START | The typescript in \ ``typescript` |
CODE_KEY | Similar to frontmatter, code blocks can have keys and values after the CODE_START | The foo in foo: bar
|
CODE_VALUE | A metadata code value | The bar in foo: bar
|
CODE_SOURCE | The source code inside of a code block | |
CODE_END | The triple-backtick at the end of a code section | ``` |
HHASH | A one- to six-legged hash tag at the beginning of a line | The ### in ### Foo
|
HGTHAN | A > at the beginning of a line |
The > in > Foo
|
L_BRACKET | A single left bracket | [ |
LL_BRACKET | Two left brackets | [[ |
R_BRACKET | A single right bracket | [ |
RR_BRACKET | Two right brackets | ]] |
LL_BRACE | Two left braces | {{ |
RR_BRACE | Two right braces | }} |
ASTERISK | A single asterisk | * |
ASTERISK_ASTERISK | Two asterisks | ** |
EQUALS_EQUALS | Two equals signs | == |
ORDINAL | A number with an ordinal suffix | 1st, 2nd, 3rd, 4th |
PIPE | A bar pipe | | |
TAG | A symbol prefixed with a hashtag | #tag, #tag-foo #tag1 |
TILDE_TILDE | Two tildes | ~~ |
ESCAPE | A backslash followed by any character | | |
L_PAREN | A left parenthesis | ( |
R_PAREN | A right parenthesis | ) |
BACKTICK | A single backtick | ``` |
DOLLAR | A dollar sign | $ |
DOLLAR_DOLLAR | Two dollar signs | $$ |
PERCENT_PERCENT | Two percent signs | %% |
COMMENT | The content of a comment |
A comment in %% A comment
|
HTML_TAG | An html tag |
<div> , </div> , <p />
|
HR | A horizontal rule |
--- , *** , ___
|
BULLET | A dash or asterisk at the beginning of a line | The - in - foo
|
N_BULLET | A numbered bullet at the beginning of a line | The 1. in 1. foo
|
CHECKBOX | A checkbox at the beginning of a line | The - [ ] in - [ ] foo
|
URL | A url | https://www.google.com |
EOF | The very end of the string or file |
Some examples
const tokens = scanTokens([
"# Mazard Scanner",
"",
"This scanner converts a Markdown document into an array of tokens.",
]);
printTokens(tokens);
No | Type | Lexeme | Literal | Line | Column |
---|---|---|---|---|---|
0 | HHASH | "#" | 1 | 0 | 0 |
1 | SPACE | " " | 1 | 0 | 1 |
2 | SYMBOL | "Mazard" | "Mazard" | 0 | 2 |
3 | SPACE | " " | 1 | 0 | 8 |
4 | SYMBOL | "Scanner" | "Scanner" | 0 | 9 |
5 | BR | "\n\n" | 2 | 0 | 16 |
6 | SYMBOL | "This" | "This" | 2 | 0 |
7 | SPACE | " " | 1 | 2 | 4 |
8 | SYMBOL | "scanner" | "scanner" | 2 | 5 |
9 | SPACE | " " | 1 | 2 | 12 |
10 | SYMBOL | "converts" | "converts" | 2 | 13 |
11 | SPACE | " " | 1 | 2 | 21 |
12 | SYMBOL | "a" | "a" | 2 | 22 |
13 | SPACE | " " | 1 | 2 | 23 |
14 | SYMBOL | "Markdown" | "Markdown" | 2 | 24 |
15 | SPACE | " " | 1 | 2 | 32 |
16 | SYMBOL | "document" | "document" | 2 | 33 |
17 | SPACE | " " | 1 | 2 | 41 |
18 | SYMBOL | "into" | "into" | 2 | 42 |
19 | SPACE | " " | 1 | 2 | 46 |
20 | SYMBOL | "an" | "an" | 2 | 47 |
21 | SPACE | " " | 1 | 2 | 49 |
22 | SYMBOL | "array" | "array" | 2 | 50 |
23 | SPACE | " " | 1 | 2 | 55 |
24 | SYMBOL | "of" | "of" | 2 | 56 |
25 | SPACE | " " | 1 | 2 | 58 |
26 | RUNE | "tokens." | "tokens." | 2 | 59 |
27 | EOF | "" | "" | 2 | 66 |
const tokens = scanTokens("here's a *line* with some ~~formatting~~.");
printTokens(tokens);
No | Type | Lexeme | Literal | Line | Column |
---|---|---|---|---|---|
0 | RUNE | "here's" | "here's" | 0 | 0 |
1 | SPACE | " " | 1 | 0 | 6 |
2 | SYMBOL | "a" | "a" | 0 | 7 |
3 | SPACE | " " | 1 | 0 | 8 |
4 | ASTERISK | "*" | "*" | 0 | 9 |
5 | SYMBOL | "line" | "line" | 0 | 10 |
6 | ASTERISK | "*" | "*" | 0 | 14 |
7 | SPACE | " " | 1 | 0 | 15 |
8 | SYMBOL | "with" | "with" | 0 | 16 |
9 | SPACE | " " | 1 | 0 | 20 |
10 | SYMBOL | "some" | "some" | 0 | 21 |
11 | SPACE | " " | 1 | 0 | 25 |
12 | TILDE_TILDE | "~~" | "~~" | 0 | 26 |
13 | SYMBOL | "formatting" | "formatting" | 0 | 28 |
14 | TILDE_TILDE | "~~" | "~~" | 0 | 38 |
15 | RUNE | "." | "." | 0 | 40 |
16 | EOF | "" | "" | 0 | 41 |
const tokens = scanTokens([
"- [x] Finish the scanner.",
"- [ ] Write some reasonable documentation",
]);
printTokens(tokens);
No | Type | Lexeme | Literal | Line | Column |
---|---|---|---|---|---|
0 | CHECKBOX | "- [x]" | true | 0 | 0 |
1 | SPACE | " " | 1 | 0 | 5 |
2 | SYMBOL | "Finish" | "Finish" | 0 | 6 |
3 | SPACE | " " | 1 | 0 | 12 |
4 | SYMBOL | "the" | "the" | 0 | 13 |
5 | SPACE | " " | 1 | 0 | 16 |
6 | RUNE | "scanner." | "scanner." | 0 | 17 |
7 | BR | "\n" | 1 | 0 | 25 |
8 | CHECKBOX | "- [ ]" | false | 1 | 0 |
9 | SPACE | " " | 1 | 1 | 5 |
10 | SYMBOL | "Write" | "Write" | 1 | 6 |
11 | SPACE | " " | 1 | 1 | 11 |
12 | SYMBOL | "some" | "some" | 1 | 12 |
13 | SPACE | " " | 1 | 1 | 16 |
14 | SYMBOL | "reasonable" | "reasonable" | 1 | 17 |
15 | SPACE | " " | 1 | 1 | 27 |
16 | SYMBOL | "documentation" | "documentation" | 1 | 28 |
17 | EOF | "" | "" | 1 | 41 |