PyLex
Implements a Parser class for modeling the high-level control functions of Python programs.
Parser
The high level structure of a Python file can be represented as a parse tree. Consider the following snippet of Python code:
class Bot(object):
def __init__(self, id):
this.name = id
def work():
print("Beep, Boop .-.");
b = Bot()
while True:
b.work()
This can be imagined as the following parse tree, where each node is enclosed in a box:
Calling the parse()
function of an initialized Parser returns such a
syntax-tree of the control flow of a Python program. A Parser can be
initialized at instantiation, or when calling the parse()
method. The Parser
accepts two optional arguments when initializing:
-
text?: string
: The text string to parse. -
tabFmt?: TabInfo
: A tab information descriptor (see Data Types).
When an argument is omitted, the previously passed value for that field will be reused.
By preserving the state this way, parse()
can be called repeatedly without
arguments to get the same tree more than once, and any new text passed without
a tabFmt
will be assumed to use the same format.
Once a Parser object has been initialized, calling its context(lineNumber: number)
method will return a path of nodes from the leaf node containing the specified line
number to the root. Take for example, the print()
statement inside of
work()
. The returned context would be
and can be read as:
The line
print("Beep, boop .-.")
is inside the functionwork()
inside the classBot
inside the root of the document
NOTE: It is important to recognize that the print statement is not an actual node in practice, but nonetheless it is helpful to think of it being "inside" the leaf.
Data Types
Node
Each parser node is of type LexNode
which extends vscode.TreeItem
and has
the following fields:
class LexNode {
readonly label: string // Text label for node e.g., "function foo", "while True", "class Bot"
readonly collapsibleState: vscode.TreeItemCollapsibleState // None (0), Collapsed (1), or Expanded (2)
readonly token?: LineToken // Token associated with this node.
private _children?: LexNode[] // Child nodes. Accessed with children()
private _parent?: LexNode // Parent. Accessed with parent()
}
Additionally:
- Use the
hasChildren()
method to check for children - Use the
rootPath()
method to return a path of nodes starting from the current node and ending at the root. Internally thecontext()
method ofParser
uses the root path of the leaf node "containing" the specified line number.
LineToken
A line token represents a single line of a Python file:
class LineToken {
readonly type: Symbol // Type of token
readonly linenr: number // Line number of this token (0-indexed)
readonly indentLevel: number // Indent level of this line
readonly attr?: any // Any additional things a token might need (class name, control condition)
}
TabInfo
A descriptor class to specify a type of tab for the Lexer.
class TabInfo {
public size?: number = 4; // width of one tab
public hard?: boolean = false; // whether to use tab characters
}
Symbol
Each symbol type represents either a Python construct, an indentation symbol, or EOF. Indentation symbols are used to track indentation inside of blocks.
enum Symbol {
function,
class,
if,
else,
elif,
for,
while,
try,
except,
finally,
with,
indent,
eof
}
Find example programs in the examples
sub-directory.