Lexigram
About
Lexigram is a lexer/parser generator. It is written in Rust and generates Rust code.
It currently generates a predictive, non-recursive LL(1) parser. However, you may write a non-LL(1) grammar, as long as Lexigram can transform it into LL(1). Left-recursive and ambiguous rules are supported, and left factorization is performed automatically. These transformations are made as transparent as possible in the generated code.
The listener pattern allows your code to interact with the parser by executing user-implemented methods when rules are processed, as in ANTLR. It is also somewhat similar to the code written next to each rule in tools like Yacc and Bison, but it’s cleaner and more flexible.
The project is hosted on GitHub.
This Documentation
This documentation consists of several chapters.
The introduction gives a general overview of parsers, their components, how they are designed, how they can be used, and the main parser types. If you’re not familiar with the design of parsers and the limitations of LL(1) grammars, it is recommended to read it, otherwise it can be skipped.
The tutorial shows how to build a small parser step by step. It helps you familiarize yourself with Lexigram using a hands-on approach.
The quick overview of the crates presents the main crates of this project.
The lexer generator reference and the parser generator reference describe more formally the lexicon and grammar language, the grammar transformations, and the generated code.
The Lexigram Book covers version 0.9.1 and has been last updated on 30 March 2026.
Installation and Dependencies
The lexigram crate contains the executable. You’ll find information on how to install it and its main command-line options in the Lexigram Crates chapter.
The lexigram-core crate is the only dependency you need with the generated code. There are more details about it in the relevant Lexigram Crates chapter.
Generating the Code
lexigram -h shows the available options.
For example, to create a parser from the combined lexicon/grammar in ./grammar/calc.lg, and to write the generated code to ./src/lexer.rs and ./src/parser.rs, use this command:
lexigram -c grammar/calc.lg -l src/lexer.rs -p src/parser.rs --log
You can also generate a parser from Rust code instead of using the command-line tool. In that case, you need the full library; see the Lexigram Crates chapter and the tutorial for more information.
Don’t miss the examples in the repository, under the examples directory (they’re briefly listed at the end of the tutorial).
Introduction
This chapter gives a gentle introduction to parsers and how the tool can help you build one. Its only purpose is to make sure the user is aware of how lexers and parsers work, and what the limitations of the grammar are.
If you’re already familiar with those notions, you can skip this introduction.
Components of a Parser
A parser is split in two main components: the lexical analyzer and the syntax analyzer. We also call them respectively lexer, or scanner, and parser. The term “parser” is sometimes used more loosely to refer to both the lexer and the parser, as we did before.
To give an intuitive view, the lexer reads the input and transforms it into a sequence of tokens, which are the basic symbols of the language. Those words can be variable, like a number, or they can be fixed, like a keyword or a punctuation symbol. The parser gets the sequence of tokens from the lexer and verifies that they fit the grammatical rules.
Here is a simple example to illustrate the difference:
Id: [A-Za-z][A-Za-z0-9]*;
Num: [0-9]+;
Space: [ \n\r\t]+ -> skip;
assign -> "let" Id "=" value;
value -> Id | Num;
For the vocabulary, the grammar rules are called production rules or, more simply, productions. The part on the left side of the arrow is the head, and the part on the right is the body. The tokens are also called terminals, especially when they’re considered in a production. The production names, like “assign” and “value”, are referred to as nonterminals, as opposed to terminals.
The tokens recognized by the lexer are Id, Num, "let", and "=". The first two are a little more complicated since their regular expressions can match many different strings. The others are given literally.
The Space terminal is a little special. We must often use something to separate the tokens, because they could otherwise be merged; for example, if the input starts with “letvelocity” instead of “let velocity”, it’ll match Id instead of matching the sequence "let", Id. By introducing Space, we allow the other tokens to be separated by space characters, even allowing for tabs and new lines. The -> skip action makes the lexer discard those tokens, so that the parser gets "let", Id instead of "let", Space, Id, which would be tedious to handle.
The productions used by the parser are assign -> … and value -> …. They specify how the tokens coming from the lexer may be arranged syntactically in the source language. When a production matches a sequence of tokens, we say that it derives the input from the nonterminal on the left-hand side. Some of those terms are somewhat abstract and formal, but we mention them because they’re often used in the context of parsers.
We use the convention that the top nonterminal is the first to be defined. In this case, assign is the top nonterminal.
Let’s see with an example how an input gets parsed:
let velocity = 125
The lexer outputs the following sequence of tokens:
"let", Id("velocity"), "=", Num("125")
The parser determines the parse tree that makes the productions correspond to the input:
- The top nonterminal is
assign. The three leftmost terminals match the input tokens"let",Id, and"=". - The fourth element is the nonterminal
value, which contains two alternative productions:value -> Idorvalue -> Num. The derivationvalue -> Nummatches the input.
The parse tree, also called production graph, is only a representation of the order in which the productions are arranged to match the input (to derive the input). It’s not necessarily an actual object built by the parser, but rather a representation of the parser flow through the productions and the received tokens. Quite commonly, the parser helps build an abstract syntax tree, which better corresponds to the needs of the application. We’ll see how in the next section.
When the parser can’t match the input with the grammar, it generates a syntax error, ideally with enough information to tell the user where it fails and why. Similarly, when the lexer can’t extract any token from the input, it generates a lexical error.
How Parsers Are Used
Parsers are used in any application that needs to read an input in a specific language. It could be a programming language, a configuration file, or an HTML page. Sometimes, they’re written manually, but for more complex parsers, or types of parsers that are hard to write by hand, a parser generator is preferably used.
Parsers are usually known as the first part of a compiler, which reads the source of a program and produces an executable. In a nutshell, a compiler has the following parts:
- a lexer and a parser
- a semantic analyzer, that uses the productions of the parser to build an abstract syntax tree (AST), a representation easier to process
- an intermediate code generator, which, based on the AST, creates a sequence of basic code that we call intermediate representation (IR)
- an optional optimizer, that produces a better sequence of IR
- a code generator, which transforms the IR into machine code for the target architecture (possibly followed by an optimizer for that particular architecture)
We have subtly avoided so far to tell exactly what the parser is supposed to output. In truth, unlike the lexer, the parser doesn’t produce an output in the form of data. In order to build the abstract syntax tree, or more generally, the object that represents the parsed input, the parser interacts with your code each time it has recognized which production applies to the next bit. Your code can then produce the desired output.
This interaction can take several forms, depending on the parser, but it generally comes in one of two common patterns: the listener and the visitor.
In the listener pattern, each time the parser has correctly matched a production, it calls a user-defined action routine. In parser generators like Yacc and Bison, those routines are written in the grammar itself, which can sometimes reduce readability. In other generators, like ANTLR, the generated code provides a base class that can be specialized (inherited) by the user and where each method corresponds to a nonterminal: here, something like exit_assign() and exit_value(). Other methods corresponding to other phases of the parsing may be available, too, like when a nonterminal is about to be parsed, which can be used to initialize some data.
In the visitor pattern, the generated code also provides a base class with methods that correspond to the nonterminals. The top nonterminal visitor is called at some point, and the user code can then inspect the data matched by the parser in that production, and call the visitor of the nonterminals to go deeper. In our example above, the user would call visit_value() from visit_assign(), so that the whole tree is visited by the user code.
The listener approach is more automatic and natural, while the visitor approach allows the user to skip entire parts of the tree that are not relevant. For example, if there are several parsing passes, the first visitor implementation will collect all the declarations, but it will disregard the fine details of the definitions (e.g. not calling visit_value()). In the second pass, it will read everything, already knowing all the names that have been declared, even if that declaration comes later in the input.
At this time, Lexigram generates the code for a listener only, creating a custom trait that you can implement for a listener object. This allows you to intercept the parser flow with the trait methods and to build an abstract syntax tree very efficiently.
Lexicon and Grammar
We saw earlier the specification of a simple language:
Id: [A-Za-z][A-Za-z0-9]*;
Num: [0-9]+;
assign -> "let" Id "=" value;
value -> Id | Num;
You may have already seen the specification of a programming language in one form or another. One that we often see is the Backus–Naur form (BNF), which originated from the collaborative design of ALGOL 58 and evolved to the syntax we know today. Several variants of Extended Backus-Naur form (EBNF) exist, which add regular-expression-like niceties such as repetitions and optional items.
We’ll be using our own language to describe the tokens the lexer must extract, which is the lexicon, and the productions the parser must use, which is the grammar. The syntax we’re using is inspired by ANTLR; it tries to be flexible and easy to read.
Here is what the lexicon of the above example would be; note the names of the terminals are upper camel case:
Let: "let";
Equal: "=";
Id: [A-Za-z][A-Za-z0-9]*;
Num: [0-9]+;
And here is the grammar, in which the names of the nonterminals are snake case, to distinguish them from the terminals:
assign: Let Id Equal value;
value: Id | Num;
We like to separate both when we generate a parser with Lexigram, for several reasons:
- Using string literals for keywords like “let” and punctuation like “,” or “;” introduces a risk of discrepancy if there’s a typo somewhere. If each statement is ended by a semicolon, you may have to repeat that string literal in several productions, and if you decide to change that later, there’s a risk to miss some of them.
- Using string literals doesn’t give them a name, which is annoying when you have to process terminals in the code, as you’ll see later.
- Using string literals doesn’t specify in which order they’re defined. This can be limiting when the order is used to clarify ambiguities. Above, we see that
LetandIdoverlap since “let” matches[A-Za-z][A-Za-z0-9]*, but sinceLetis defined first, the lexer knows that “let” must be extracted as theLettoken. Having a full lexicon improves the clarity and the precision. - The tool allows to generate only a lexer; later, it will allow the user to use another, external lexer, so having separate lexicons and grammars is necessary.
But for the sake of conciseness, we’ll often use the simplified form with string literals in small examples, where the concerns above don’t really apply. When we do that, we’re using -> in the grammar instead of : (you may have spotted that clue earlier already).
Where to Draw the Line?
Knowing where the lexer’s job ends and where the parser’s job is taking over isn’t always easy when you’re not used to it. For instance, if you declare a variable, it’s possible to rely on the lexer to scan the whole declaration:
Decl: 'let' [ \t]+ [_a-z][_0-9a-z0-9]* [ \t]* ':' [ \t]* [_a-z][_0-9a-z0-9]*;
All there is to do, once the parser (if you still need one) receives a Decl token, is to read the input attributes that come with it, remove the “let”, and do a split on space and colon characters—there are efficient methods for that in the standard library, after all.
Of course, you’ll have to forbid any comment inside that section of source code.
Another way is to slice it into smaller chunks and rely on the parser to assemble them:
// lexicon
Colon: ':';
Let: 'let';
Id: [_a-z][_0-9a-z0-9]*;
SkipSpaces: [ \t]+ -> skip;
// grammar
declaration: Let Id Colon Id;
At first glance, it’s more work, but you won’t have to split anything in the parser. In the listener’s exit_declaration() method, you’ll receive the strings for the variable and the type, and you don’t have to worry about the rest. If you decide your statements can be split across several lines, it’s fine: just add “\n” once to SkipSpaces. And if you want to support comments everywhere, just add this to the lexicon:
SkipComment: '/*' .*? '*/' -> skip;
There are no formal rules that tell how to separate what should be in the lexicon and what should be in the grammar, but you can use a few guidelines:
- As a general rule of thumb, split the source into tokens that represent simple words of the language, words that are easy to manipulate for the parser.
- If a token contains characters that must be discarded later by the parser, you should consider reworking that token’s definition and let the lexer discard what’s possible.
- If a variable pattern is repeated in a token, is there a way to split it, like we did with
[_a-z][_0-9a-z0-9]*above? If not, we’ll see a way to define those patterns so they can be reused in the lexicon when we discuss that part more in detail. - If a token is not useful in a grammar, maybe it can be skipped entirely by the lexer. You have to make sure it’s not used to distinguish two grammar productions, though.
- If several tokens are processed identically by the parser, like integers and floating points if there’s only one numerical type, or like strings between quotes or double quotes (e.g. in Python), it’s better to regroup them in the lexer and issue a single token to reduce the work in the parser.
Don’t take those rules too literally; always consider the work on both sides and in all the situations. For example, if you allow escape codes and Unicode values in your string literals, how are you going to do? As single characters, they may look like this:
- ‘\n’, ‘\r’, ‘\’
- ‘\u{20}’, ‘\u{00A9}’
But they could also be in the middle of a string:
- “Value:\n\u{25ba} “
When they’re single characters, you could decide to return different tokens, depending on whether they’re regular characters, escape codes, or Unicode hex values. You could strip the unnecessary and just return a Unicode token with the hex value in attribute, in a string that’s ready to be converted to an integer by the parser.
But that would complicate the parser. Instead of handing the character as a token, you’ll need several alternatives:
char_value: Char | Escape | Unicode;
Then, you’ll need to implement its listener method:
let chr = match context {
CtxCharValue::V1 { char } => { ... }
CtxCharValue::V2 { escape } => { ... }
CtxCharValue::V3 { unicode } => { ... }
}
In fact, this isn’t too bad since each branch knows how to handle the attribute: take as-is, unescape, or convert the hex code to a Unicode character. If you process only single characters, it’s a good way to do it, but what if you must handle string literals, too?
Another way to deal with it, since there is some processing that goes beyond the job of the lexer, is to let those codes in string and character literals, and have a Rust method that unescapes those codes. That method can be used for both characters and strings. If you decide to add more escape codes, you’ll just have to add them in one line in the lexicon, and to update that method; the parser won’t require an additional alternative in the rule and a new branch in the listener implementation.
Note
At this point, you may have been asking yourself how Lexigram was parsing the lexicon and the grammar. That’s right, it has a lexer and a parser of its own! Two, actually, since the lexicon and the grammar are two different languages.
Processing the escape codes and Unicode values is actually done as described above; you can see how it’s done by checking Lexi’s lexicon, grammar, and listener implementation (Lexi is the lexer generator; Gram is the parser generator).
And, before you ask, yes: Lexigram generates its own lexer and parser.
There’s Parser and Parser
We have another important matter to tackle: there are different parser types, which support different classes of grammars, and so different subsets of possible languages. It’s worth knowing the basics in order to choose the right parser for the desired language, or, conversely, to design the language so that its grammar is supported by a type of parser.
There are several ways to classify parsers, but if we consider a practical perspective, the derivation order, there are three classes of parsers for context-free grammars: top-down, bottom-up, and universal. The most common parser types are top-down and bottom-up because they’re more efficient.
Currently, Lexigram generates a top-down parser. More precisely, a table-driven LL(1) parser, which belongs to the category of top-down parsers. As we’ll see, those parsers can’t process left recursion (a -> a B | C) nor alternative productions beginning with the same symbols (a -> B C | B D), but it’s possible to systematically transform the grammar to remove them. Thankfully, Lexigram performs these transformations automatically and as transparently as possible, so that you don’t have to bother about those limitations.
In practice, top-down parsers are widely used in compilers because they’re relatively simple to make and to maintain. They’re often written manually using a recursive implementation, in which case they’re called recursive descent parsers. Some generators, like ANTLR, also write a recursive descent parser instead of a non-recursive, table-driven one.
The next two sections give a general idea of each parser type.
Top-Down Parsers
A top-down parser starts from the top production and deduces the items from the grammar, looking at the next symbol(s) to predict the production when there are several alternatives.
Let’s illustrate that with an example:
Id: [A-Za-z][A-Za-z0-9]*;
Num: [0-9]+;
statement -> assign ";" | print ";";
assign -> "let" Id "=" value;
print -> "print" value;
value -> Id | Num;
which we can rewrite with each production alternative on its own line:
Id: [A-Za-z][A-Za-z0-9]*;
Num: [0-9]+;
statement -> assign ";";
statement -> print ";";
assign -> "let" Id "=" value;
print -> "print" value;
value -> Id;
value -> Num;
A top-down parser processes the input “let a = 10;” as follows:
- The top nonterminal is
statement, but there are two alternatives. The first token is “let”, sostatement -> assign ";"is predicted sinceassignstarts with that token. The symbolsassignand";"of the production are queued to check that the incoming tokens match them. - The next symbol in the queue is the nonterminal
assign, and the pending token is still"let", so the parser predicts the productionassign -> "let" Id "=" value.assignis removed from the queue and replaced with the symbols of the production body, so the queue is now"let",Id,"=",value,";". - The next 3 tokens match the terminals in the queue:
"let",Id("a"), and"=". They’re removed from the queue, leavingvalueand";". - Since the next symbol in the queue is a nonterminal,
value, the parser looks at the incoming token,Num("10"), and predicts that the next production isvalue -> Num.valueis replaced byNumin the queue. - The
Numsymbol of the queue matches the incoming tokenNum("10"), so it’s removed. The queue now contains";". - The next token from the lexer is
";". It matches the last item in the queue, so it’s removed. - The queue is empty, and there are no more tokens from the lexer, so the parser has finished and the result is successful.
The equivalent parse tree is shown below:
You can see that it’s indeed built from the top down, by incrementally attaching nonterminal branches and terminal leaves following a preorder, depth-first search.
Another way to see it is that the input is derived from left to right, matching the tokens in that order with the productions.
The “LL” in “LL(k) parser” stands for “scanning Left-to-right, Leftmost derivation”, and k is the number of tokens the parser can look ahead to predict the next derivation. Similarly, an LL(k) grammar needs to look at maximum k following tokens to the right to predict a unique derivation. So an LL(k) parser can parse an LL(k) grammar (and also any LL(i) grammar, for i < k). When k isn’t mentioned, it normally means LL(1).
Note
A more restrictive subclass of grammars is sometimes mentioned: SLL, for Simple LL. Substantially, an SLL grammar requires no empty production (
a -> A | ε), and the production alternatives of a given nonterminal shouldn’t start with the same k terminals (a -> A b | A c). The SLL parsers are somewhat easier to build, but the restrictions are significant, so we won’t spend more time on them.
The grammar above can be parsed by looking only at the next token, so it’s an LL(1) grammar, which can be parsed by an LL(1) parser. Easy enough.
Implementations
LL(k) parsers can be implemented in different ways, the most common being either recursive-descent or table-driven parsers.
- Recursive-descent parsers
This implementation has one method for each nonterminal. Each method begins by inspecting as many lookahead tokens as it requires to predict the production when there are several alternatives. Then, for each symbol of the production:
- if it’s a terminal, the method consumes the next token and verifies that it matches;
- if it’s a nonterminal, it calls the corresponding method.
Here is an example of pseudocode for the recursive-descent implementation of the first two nonterminals:
#![allow(unused)]
fn main() {
impl Parser {
fn statement(&mut self) -> Statement {
let stmt = match self.peek_token() {
Terminal::Let => self.assign(),
Terminal::Print => self.print(),
_ => panic!("syntax error"),
};
self.match_token(Terminal::Semicolon);
stmt
}
fn assign(&mut self) -> Statement {
self.match_token(Terminal::Let);
let id = self.match_token(Terminal::Id);
self.match_token(Terminal::Equal);
let value = self.value();
// process id, value
Statement::Assign { .. }
}
// ...
}
}
Many compiler projects start with a recursive-descent parser because it’s intuitive to write, and it’s easy to increase the lookahead capability without changing much. The Rust compiler is written that way. Another advantage is that you don’t depend on a tool.
The drawbacks are the lack of visibility of the grammar, the difficulty to change it, and the risk of overflowing the stack. It also requires a mechanism to exit gracefully if you meet syntax errors or if you want to abort the parsing.
Table-driven parsers try to address those issues by using a stack-based finite state machine (also called pushdown automaton) and a single loop. The grammar is transformed into a parsing table, which predicts the next production as a function of the nonterminal on the stack and the next token. A local stack stores the pending terminals that must match the incoming tokens and the nonterminals that must be transformed by predicting their production.
- If a terminal is on the stack, it’s popped and must match the next token.
- If a nonterminal is on the stack, it’s popped, and the table is used to predict the production, whose symbols are then pushed on the stack.
Here is the parsing table for the grammar above:
| Id | Num | “;” | “let” | “=” | “print” | $ | |
|---|---|---|---|---|---|---|---|
| statement | . | . | . | 0 | . | 1 | . |
| assign | . | . | . | 2 | . | . | . |
| . | . | . | . | . | 3 | . | |
| value | 4 | 5 | . | . | . | . | . |
where the numbers are the indices of the production alternatives:
0: statement -> assign ";";
1: statement -> print ";";
2: assign -> "let" Id "=" value;
3: print -> "print" value;
4: value -> Id;
5: value -> Num;
Here’s the pseudocode of the simplified parser loop, supposing the lexer sends a Terminal::End at the end of the input (often represented as $):
#![allow(unused)]
fn main() {
impl Parser {
fn parse(&mut self) {
let mut stack = vec![Terminal::End, self.top_nonterminal];
while let Some(stack_sym) = stack.pop() {
match stack_sym {
Terminal(t) => {
self.match_terminal(stack_sym);
}
Nonterminal(nt) => {
let predicted = self.get_table(stack_sym, nt);
stack.extend(predicted);
}
}
}
}
}
}
In reality, Lexigram’s loop is a little more complex because it handles the re-synchronization after encountering syntax errors and other types of symbols pushed on the stack to call the action routines.
A website that calculates the table of an LL(1) parser can be found at this URL. It also shows step by step how the input is parsed.
Limitations
Because the parser relies on the next token to decide what to do, top-down parsers suffer from a few limitations.
- Elimination of left recursion
Left recursion is not supported. In list -> list "," Id | Id, the parser can’t predict an alternative when list is the current nonterminal: in order to determine it, it must first know what to predict for its first symbol, which is list itself. This generates an infinite, recursive loop that it can’t solve.
The grammar must be changed to work around the problem using right recursion, which is fine for LL parsers:
list -> Id list_1;
list_1 -> "," Id list_1 | ε;
where ε is the symbol for “nothing”, meaning here the end of the “loop”.
- Left factorization
An LL(1) can only look at the next token to predict the next production. If several alternatives begin with the same terminal, like a -> A B | A C, the parser can check that the next token is A but it can’t see beyond; it would require an LL(2) parser.
It’s also possible to transform the grammar to work around the limitation:
a -> A a_1;
a_1 -> B | C;
- Ambiguity
When there is both left and right recursion, the grammar is ambiguous for any class of parser. The classic example is the expression grammar:
e -> e * e | e - e | Id;
Firstly, there is no information about the associativity of -. For example, with the input “a - b - c”, it is unclear whether the correct derviation is
-
e -> (e -> Id("a")) - (e -> (e -> Id("b")) - (e -> Id("c")))more simply,
a - (b - c) -
e -> (e -> (e -> Id("a")) - (e -> Id("b"))) - (e -> Id("c"))more simply:
(a - b) - c
Semantically, they’re not the same! 4 - (3 - 2) = 3, but (4 - 3) - 2 = -1.
Secondly, there is no information about the relative precedence of * and -. The input “a - b * c” can be either
a - (b * c)(a - b) * c
The problem of ambiguity is more delicate to solve in the case of top-down parsers. Several techniques exist, that depend on the implementation:
- Handwritten parsers usually apply a technique called precedence climbing, or Pratt parsing, which is equivalent but more general. Both methods use bottom-up parsing for the ambiguous productions, inside a top-down parser.
- Table-driven parsers use Clarke’s technique, which modifies the grammar. That’s how Lexigram deals with ambiguous rules, since it better fits the table implementation than precedence climbing/Pratt.
This is only a general rule; there are exceptions: the ANTLR generates a recursive descent parser where the ambiguities are solved using Clarke’s technique.
As we said earlier, Lexigram modifies the grammar automatically in those three situations. Other parser generators, like ANTLR, do it too, but some generators will expect you to provide an LL(k) grammar. Either way, it’s worth keeping the grammar limitations in mind, because those transformations, even if they’re automatic, create extra nonterminals.
We’ll come back on those transformations and on the performance aspects of a grammar when we discuss how Lexigram works.
Bottom-Up Parsers
A bottom-up parser starts from the tokens and deduces the production from them. It accumulates the incoming tokens until it recognizes a production and reduces that part of the input to a nonterminal. Then, it checks if further reductions can be made, consuming more tokens if necessary.
Let’s take the same grammar again:
Id: [A-Za-z][A-Za-z0-9]*;
Num: [0-9]+;
statement -> assign ";";
statement -> print ";";
assign -> "let" Id "=" value;
print -> "print" value;
value -> Id;
value -> Num;
A bottom-up parser processes the input “let a = 10;” as follows:
- The parser needs to consume
Let,Id("a"),"=", andNum("10")before it can make its first reduction:value -> Num("10"). - The stack is
Let,Id("a"),"=",value. The parser recognizes the productionassign -> "let" Id "=" value. - The stack is
assign. There are no further reductions, so the parser takes the next token. - The stack is
value,";". The parser recognizesstatement -> assign ";". - The stack is
statement, and the end of the input has been reached.statementis an acceptable result, so the parser has finished and the result is successful.
This time, you can see that the syntax tree is built from the bottom up, by discovering the terminal leaves first and incrementally creating nonterminal branches to attach them.
We can also see it as the reduction of the input, where symbols matching the right-hand side of a production are replaced by the left-hand side.
The main classes of bottom-up parsers are
- LR(k), where the first L stands for scanning Left to right, and the R for Rightmost derivation. k is the number of look-ahead tokens, which is in practice 0 or 1. So, in comparison to the LL parser which starts at the top of the syntax tree and always derives the leftmost nonterminal first, the LR parser starts at the bottom and always derives (in reverse) the rightmost nonterminal first.
- SLR(k) is a Simple LR. This class of parsers has fewer states than the equivalent LR for the same grammar, but they can handle a smaller set of grammars.
- LALR stands for Look Ahead LR. Without going into details, this class of parsers improves on SLR by using local look-ahead, allowing them to handle a larger set of grammars with the same number of states as the equivalent SLR parsers. In practice, LALR parsers are often preferred to the other classes.
The bottom-up parsers are table-driven, following a mechanism very similar to table-driven LL parsers. Computing the table of an LL parser is a little tedious, but it’s a lot of work for bottom-up parser, even LALR, so this task is almost always done by a tool.
A website that calculates the table can be found at this URL for an LR(1) parser, and at this URL for an LALR(1) parser. You can see the difference in the number of state for the default grammar shown as an example.
Since Lexigram currently generates top-down parsers only, we won’t talk any further about bottom-up parsers. We’ll add more when Lexigram supports them.
Tutorial
This tutorial explains step by step how to implement a parser with Lexigram.
In the overview, we briefly see what’s generated and how it should be used in a broad outline. It’s not a reference, but it contains a number of concepts and methods that may seem too abstract at the beginning, so feel free to skip some parts and come back later if you prefer a more hands-on approach.
In the next part, a practical example of a parser is developed. It doesn’t cover all the available features, but it gives you a good starting point if you want to develop your own parser.
Overview
Developing a parser is generally done in several steps:
- writing the grammar
- generating the code
- implementing the listener
- integrating it into your top-level parser
Designing the Grammar
Writing the grammar is an important step, and one that shouldn’t be underestimated. If the language you need to parse already exists, there’s not a great deal of latitude, but there are still an infinite number of grammars that can describe that language. If you’re using an LL(1) parser, you’ll need to design an LL(1) grammar, and if the language can’t be described with an LL(1) grammar at all, you must use another parser. But many programming languages can, so hopefully it shouldn’t be the case.
If you’re designing the language, you’ll find that designing the grammar may sometimes push you to adapt the language in order to make it easier (or possible) to parse.
We’ll start with this simplified grammar to show how the parts come together:
s -> Id "=" val | "exit" | "return" val;
val -> Id | Num;
Note
Notation
Like in the introduction, we’re often going to use a simplified notation for the grammar which doesn’t represent the full lexicon, for the sake of clarity.
In this simplified notation, the terminals are either fixed string literals or capitalized identifiers that represent a variable terminal, like a numeric value. The lowercase identifiers are the nonterminals. In the full notation required by Lexigram, string literals aren’t allowed in the grammar; terminal names must be used instead.
To distinguish that simplified notation from the full notation, we use
->in the production rules instead of:.Here’s the full notation that Lexigram would expect, either in a combined form as below or with the lexicon and the grammar in separate locations:
lexicon Example; Equal: '='; Exit: 'exit'; Return: 'return'; Id: [a-z][_0-9a-z]*; Num: [0-9]+; grammar Example; s: Id Equal val | Exit | Return val; val: Id | Num;
Generating the Code
Lexigram generates the code of a lexer/parser based on their “specifications”: a lexicon for the lexer, and a grammar for the parser. There are two main options to generate the code: with the command-line executable or programmatically.
If you prefer the command-line option, you should install the lexigram executable.
Alternatively, you could generate the code programmatically in a test unit or a binary crate that you can launch directly from the IDE, without having to retype the same command each time. The build method is described briefly in the lexi-gram crate overview, and you can see several examples in the repository, like examples/gen_microcalc. There, you’ll find one version with the lexicon and the grammar in separate files, and one version with a combined lexicon/grammar.
Or you could create a build script that generates the code. Again, there’s a choice: the script may either launch the command-line executable, call its public method lexigram::execute, or use the build method as described in the previous paragraph.
Finally, there’s the more advanced possibility of creating the lexicon and/or the grammar dynamically in a Rust program by using the lower-level API of the lexigram-lib crate, but that’s completely outside the scope of this documentation.
Tip
If you generate the parser programmatically without a build.rs script, it’s recommended to proceed with two crates: one for the generator and one for the final application that includes the generated code. While it’s possible to put both in the same crate, it would have two drawbacks:
- When the generated code is written to its destination file, it likely prevents you from compiling the code until you have completed the parser or updated it with the latest changes in the grammar. Before that, if you need to modify the grammar again, the code that generates the parser won’t be able to compile.
- The dependencies of the generator are bigger, since you need all the crates (through
lexi-gram). You could put the generator in a test config, as we did for the examples of the repository, so only the test dependencies ([dev-dependencies]) would requirelexi-gram, and the production dependency ([dependencies]) would just belexigram-core. This, however, might create conflicts sincelexi-gramhaslexigram-corein its own dependencies. Theoretically, there’s no such problem between test code and production code, but you’ll have to be careful not to bindlexigram-coredirectly in your test code; instead, you must bindlexi_gram::lexigram_lib::lexigram_core.That’s why we split all the examples in two crates:
gen_microcalcandmicrocalc,gen_watcherandwatcher, and so on.
Let’s assume that the combined lexicon/grammar are in a file, example.lg:
lexicon Example;
Equal: '=';
Exit: 'exit';
Return: 'return';
Id: [a-z][_0-9a-z]*;
Num: [0-9]+;
grammar Example;
s: Id Equal val | Exit | Return val;
val: Id | Num;
The parser can be generated with this command:
$ lexigram -c example.lg -l lexer.rs -p parser.rs \
--listener listener.txt --types types.txt
55 note(s)
23 info(s)
0 warning(s)
0 error(s)
OK
It creates 4 files:
- lexer.rs: the lexer
- parser.rs: the parser and the wrapper, which is the interface between the parser and the user-defined listener
- listener.txt: a template of the listener implementation, that you can use as a starting point
- types.txt: a template of the nonterminal types, that you can also copy and modify
Tip
Adding those two templates to your repository can prove helpful. Each time the grammar is modified and the code regenerated, you can more easily spot the differences and copy the new bits into your code.
Implementing the Listener
The code generated by Lexigram is a trait that you must implement.
The generated code includes, among other things, the listener trait and the custom parameter types:
#![allow(unused)]
fn main() {
pub enum CtxS {
/// `s -> Id "=" val`
V1 { id: String, val: SynVal },
/// `s -> "exit"`
V2,
/// `s -> "return" val`
V3 { val: SynVal },
}
pub enum CtxVal {
/// `val -> Id`
V1 { id: String },
/// `val -> Num`
V2 { num: String },
}
enum EnumSynValue { S(SynS), Val(SynVal) }
pub trait ExampleListener {
// ...
fn init_s(&mut self) {}
fn exit_s(&mut self, ctx: CtxS) -> SynS;
fn init_val(&mut self) {}
fn exit_val(&mut self, ctx: CtxVal) -> SynVal;
}
}
The principle is simple:
- Each time a production is predicted to come next, the
init_<nonterminal>(…)method is called by the parser. - Each time the production has been fully parsed,
exit_<nonterminal>(…)is called, giving you the contextual information about the parsed elements inctx.
Implementing these trait methods allows you to process the input in a structured way.
Here, we assumed that both s and val had a value, which is represented by the types SynS and SynVal. It’s up to you to define their type, and which value they take in the exit_*(…) methods. You can declare that some nonterminals don’t have a value, if you want. Here, if we declare that val has no value, exit_val(…) won’t return any type, and val won’t appear in the CtxS::V1 and CtxS::V3 variants. Similarly, terminals that don’t hold a value, like "=", are not present in the context variants.
Lexigram optionally produces a template to help with the implementation, that you can copy and complete (the attributes and some details have been omitted):
#![allow(unused)]
fn main() {
/// User-defined type for `s`
pub struct SynS();
/// User-defined type for `val`
pub struct SynVal();
struct Listener {
log: BufLog,
}
impl ExampleListener for Listener {
// ...
fn exit_s(&mut self, ctx: CtxS, spans: Vec<PosSpan>) -> SynS {
match ctx {
// s -> Id "=" val
CtxS::V1 { id, val } => {}
// s -> "exit"
CtxS::V2 => {}
// s -> "return" val
CtxS::V3 { val } => {}
}
SynS()
}
fn exit_val(&mut self, ctx: CtxVal, spans: Vec<PosSpan>) -> SynVal {
match ctx {
// val -> Id
CtxVal::V1 { id } => {}
// val -> Num
CtxVal::V2 { num } => {}
}
SynVal()
}
}
}
If you want to store local variables Id, you can add a field to Listener that holds them and their variable ID. Here’s the implementation of a very simplified AST generator:
#![allow(unused)]
fn main() {
pub enum SynS {
Assign(usize, SynVal),
Exit,
Return(SynVal),
}
pub enum SynVal {
Id(usize),
Val(u32),
Error,
}
struct Listener {
log: BufLog,
vars: HashMap<String, usize>,
}
#[allow(unused)]
impl ExampleListener for Listener {
// ...
fn exit_s(&mut self, ctx: CtxS, spans: Vec<PosSpan>) -> SynS {
match ctx {
// s -> Id "=" val
CtxS::V1 { id, val } => {
let var_id = self.vars.len();
self.vars.insert(id, var_id);
SynS::Assign(var_id, val)
}
// s -> "exit"
CtxS::V2 => SynS::Exit,
// s -> "return" val
CtxS::V3 { val } => SynS::Return(val),
}
}
fn exit_val(&mut self, ctx: CtxVal, spans: Vec<PosSpan>) -> SynVal {
match ctx {
// val -> Id
CtxVal::V1 { id } =>
if let Some(var_id) = self.vars.get(&id) {
SynVal::Id(*var_id)
} else {
self.get_log_mut().add_error(
format!("{id} is undefined"));
SynVal::Error
}
// val -> Num
CtxVal::V2 { num } =>
match u32::from_str(&num) {
Ok(value) => SynVal::Val(value),
Err(e) => {
self.get_log_mut().add_error(
format!("cannot parse {num:?} as an integer: {e:?}"));
SynVal::Error
}
}
}
}
}
}
Top Integration
In the generated code, you’ll find the following components:
- a
build_lexer()method that returns a lexer of typeLexer. That object is generic, but the tables are specific to the lexicon. - a
build_parser()method that returns the table-driven parser of typeParser. Like the lexer, that object is generic, but the tables are specific to the lexicon and the grammar. - a
Wrapper<T>type, a wrapper for your listener of type T. It’s very specific, and its code is entirely based on the grammar.
Note
We use the term “Parser” to design either the low-level parser, so the table-driven LL(1) parser of type
Parserdefined inlexigram-core, or the “top-level” parser which is your own, complete application. The context should normally indicate which is which, but we’ll use “low-level” and “top-level” when it needs clarifying.
The typical way to implement the top-level parser is to create an object that keeps the lexer, the parser, the wrapper and the listener (the listener being itself a field of the wrapper):
#![allow(unused)]
fn main() {
pub struct ExampleParser<'l, 'p> {
lexer: Lexer<'l, Cursor<String>>,
parser: Parser<'p>,
wrapper: Wrapper<ExampleListener>,
}
impl ExampleParser<'_, '_> {
pub fn parse_text(text: String) -> Result<BufLog, BufLog> {
let mut mcalc = ExampleParser::new();
mcalc.parse(text)
}
pub fn new() -> Self {
let lexer = build_lexer();
let parser = build_parser();
let wrapper = Wrapper::new(ExampleListener::new(), VERBOSE_WRAPPER);
ExampleParser { lexer, parser, wrapper }
}
pub fn parse(&mut self, text: String) -> Result<BufLog, BufLog> {
let stream = CharReader::new(Cursor::new(text));
self.lexer.attach_stream(stream);
let tokens = self.lexer.tokens().keep_channel0();
if let Err(e) = self.parser.parse_stream(&mut self.wrapper, tokens) {
self.wrapper.get_listener_mut().get_log_mut().add_error(e.to_string());
}
let log = std::mem::take(&mut self.wrapper.get_listener_mut().log);
if log.has_no_errors() {
Ok(log)
} else {
Err(log)
}
}
}
}
Before Going Any Further…
In the following subsections, we’re giving an overview of the top-level components, including the listener. If you’d like to get a better idea of how the components are integrated, you may find it useful.
If you’d rather see more practical examples before tackling those aspects, you may just have a quick glance or skip it entirely for now.
Lifetimes
The lexer type has two generic parameters:
pub struct Lexer<'a, R: Read>
'ais simply the lifetime of the table references, which are static if you usebuild_lexer()since the tables are static in the code.R: Readis the input. If you read from a string, it’s usuallyCursor<String>. If the input is itself a reference, like&str, you may have aCursor<&str>and so another lifetime in your top parser type. You’ll find an example of that in examples/typedef.
The parser has only a lifetime generic parameter:
pub struct Parser<'a>
Like the lexer, this lifetime is for the table references, which are static with build_parser().
Interaction Between Components
The lexer extracts the tokens "return" and Id("x"), that it sends to the parser on demand. Lexer::tokens(…) returns an iterator, so you can intercept the tokens if you like. In the code example, we used a keep_channel0(…) method before giving the tokens to the parser:
#![allow(unused)]
fn main() {
let tokens = self.lexer.tokens().keep_channel0();
}
The lexer indeed allows you to output tokens on several channels by using the -> channel(<channel ID>) action after defining a token. It can be used to keep the comments or other ancillary information apart from the grammar. Here, we don’t use that, so we simply filter out everything that doesn’t belong to the default channel (ID = 0).
The (low-level) generic table-driven parser consumes the tokens, predicts the production rules, and calls the wrapper at key moments, like when a production has been predicted and is about to be parsed, or when it has been fully parsed.
The wrapper’s code is generated according to the grammar. It calls the trait methods of the listener, so that you can get the values of all the symbols in a production once it’s fully parsed. For example, the values of Id and val once s -> Id "=" val is parsed.
Those values are either the content of variable terminals like Id or the value of nonterminals like val and s, which you determine in exit_val(…) and exit_s(…). They are kept on a value stack between those calls. Symbols that have no value, like fixed terminals ("=") and valueless nonterminals are not managed by the wrapper.
Let’s see what happens using a short example of input, using the same grammar as before:
return x
Here is how the components interact with one another:
- The parser first expects a nonterminal
s, which is the starting nonterminal of the grammar. The first token is"return", so the parser predicts the productions -> "return" val, and queues"return",val.- The parser calls the wrapper with this information: we enter that production.
- The wrapper calls your listener method
init_s(…).
- The next item in the queue is
"return", which matches the token (normal, since it was used to predict the production!), so it’s consumed. - The next item is
val. Since the next token isId("x"), the productionval -> Idis predicted.Idis queued.- The parser calls the wrapper.
- The wrapper calls your
init_val(…).
- The next item is
Id, which matches the next token. - At this point, the production
valis complete.- The parser calls the wrapper with the information: we exit that production, and we get an
Id("x"). - The wrapper calls
exit_val(CtxVal::V1 { id: "x" }), which returns a value forvalof typeSynVal, and the wrapper keeps it on the value stack.
- The parser calls the wrapper with the information: we exit that production, and we get an
- The production
sis complete, too.- The parser calls the wrapper: we exit that production.
- The wrapper calls
exit_s(CtxS::V3 { val }), wherevalis the value popped from the value stack, previously returned byexit_val(…).
The Wrapper
The wrapper implements a few public methods, among which you’ll recognize a few in the code excerpt of the top-level we gave earlier (T is the listener type):
-
new(listener: T, verbose: bool) -> SelfCreates a new wrapper around the listener. The listener is moved inside the wrapper before parsing any input, and it may be moved back (by destroying the wrapper) later.
-
get_listener(&self) -> &T -
get_listener_mut(&mut self) -> &mut TGives a reference to the listener. You can use that after parsing the input to get access to the log, or to any method of your listener.
-
give_listener(self) -> TGives the listener back and destroys the wrapper.
-
set_verbose(&mut self, verbose: bool)Changes the verbosity of the wrapper; this is only a debug tool.
Since the listener is owned by the wrapper while the input is parsed, it’s easier to keep all the data you need during that phase in the listener itself. Then, you can take the result back in the top parser when the parsing is complete.
If you want to parse several inputs, you have the choice of how long the wrapper and the listener live: you can create them each time, or you can keep the same listener and move it back and forth each time, or you can keep the same wrapper/listener.
The Listener
Besides the specific nonterminal enter/exit methods already discussed before, the listener trait provides a few general methods:
-
get_log_mut(&mut self) -> &mut impl LoggerGives access to the log to the parser and the wrapper. The log object itself, or at least a mutable reference to it, must be owned by your listener. The log must implement the
Loggertrait; theBufLogtype of thelexigram-corecan be used, or you can create one of your own.We could have instantiated the log inside the wrapper, but putting it into the listener allows you to use it to add your own messages. It also allows you to destroy the wrapper and keep only the listener after the input is parsed, since it contains all the relevant data.
-
exit(&mut self, program: SynProgram) {}Is called after the exit method of the starting nonterminal, once everything is parsed. You can implement this method if you want to separate the post-processing from what is done in
exit_<starting nonterminal>(…).The method isn’t called if the parsing is aborted (see below), or if a lexical/syntax error is detected.
-
check_abort_request(&self) -> Terminate { Terminate::None }Checks if an abort is requested by the listener. The method is called after each step of the low-level parser; typically after calling each nonterminal’s enter/exit method. If you detect an irrecoverable error in the listener, you can set an
abortflag and return the appropriate value with that method:#![allow(unused)] fn main() { struct Listener { log: BufLog, vars: HashMap<String, usize>, abort: bool, } // ... impl ExampleListener for Listener { fn check_abort_request(&self) -> Terminate { if self.abort { Terminate::Abort } else { Terminate::None } } fn exit_s(&mut self, ctx: CtxS, spans: Vec<PosSpan>) -> SynS { match ctx { // s -> Id "=" val CtxS::V1 { id, val } => { self.abort = val == SynVal::Error; let var_id = self.vars.len(); self.vars.insert(id, var_id); SynS::Assign(var_id, val) } // ... } } // ... } }Another return value,
Terminate::Conclude, makes the parser stop as if everything had been successfully parsed. It can be used if you intend to parse an input that may be longer than what you need and if it’s too difficult to cut that input before giving it to the parser. There’s an example of that in examples/watcherKeep in mind that aborting the parsing, even with
Terminate::Concludewill not call the exit methods of the pending nonterminals, nor the finalexit()method. Instead, it will callabort(). For example, if you aborted inexit_val(…), theexit_s(…)method wouldn’t be called. -
abort(&mut self, terminate: Terminate) {}Is called when
check_abort_request(…)has returned something else than the defaultTerminate::None. It lets you do any post-processing required in that eventuality.Note that, as we said before, if you conclude the parsing with
Terminate::Conclude,abort(…)is called instead ofexit(…), even though the parsing may be considered as successful. However, the terminate code is given as argument, so you’re able to decide what to do, even callexit(…)yourself if you want to. -
intercept_token(&mut self, token: TokenId, text: &str) -> TokenId { token }Is called each time a token is taken from the lexer. By default, this is optimized away, but you can use it if you need to change the value of tokens dynamically. This is an advanced feature, so we won’t elaborate on it in this tutorial.
-
fn hook(&mut self, token: TokenId, text: &str) -> TokenId { token }Is an optional method generated if hooks were declared in the lexicon. Like
intercept_token(…), it allows to intercept tokens, but only at selected spots in the grammar, which is more efficient. This is also an advanced feature, so we won’t elaborate on it in this tutorial either.
A Practical Example
We’re going to make a parser for a configuration file. That file contains the options to generate a parser with Lexigram.
There are currently two ways of setting the options when you generate a parser with Lexigram: the command-line options arguments and the option’s fluent methods used when generating the parser programmatically.
Here is an example of options to generate the little example discussed in the tutorial overview:
lexigram --indent 4 -c example.lg -l parser.rs tag example_lexer\
-p parser.rs tag example_parser\
--lib "super::user_types::*"
And here is the equivalent if the parser is generated programmatically:
#![allow(unused)]
fn main() {
static GRAMMAR_FILENAME: &str = "example.lg";
static PARSER_FILENAME: &str = "parser.rs";
let options = OptionsBuilder::new()
.indent(4)
.combined_spec(genspec!(filename: GRAMMAR_FILENAME))
.lexer_code(gencode!(filename: PARSER_FILENAME, tag: "example_lexer"))
// grammar is combined with lexicon, no need to define parser_spec
.parser_code(gencode!(filename: PARSER_FILENAME, tag: "example_parser"))
.libs(["super::user_types::*"])
.build()
.expect("should have no error");
match try_gen_parser(action, options) {
Ok(log) => {
if action == Action::Generate {
println!("Code generated");
}
assert!(
log.has_no_warnings(),
"unexpected warning(s):\n{}",
log.get_warnings().join("\n"));
}
Err(build_error) => panic!("{build_error}"),
}
}
What we would like is a text-based input that specifies the same options, and which could be stored in a configuration file, or maybe combined with the lexicon and the grammar:
def GRAMMAR_FILENAME = "example.lg";
def PARSER_FILENAME = "parser.rs";
lexer {
combined: GRAMMAR_FILENAME,
output: PARSER_FILENAME ["example_lexer"]
}
parser {
output: PARSER_FILENAME ["example_parser"]
}
options {
libs: { "super::user_types::*" },
indent: 4
}
Grammar
Let’s start with the top structure of the grammar. In this section, we’ll build the grammar and get a peek at the generated code. It’s a good opportunity to show off some features and to provide a little guidance. We’ll implement the listener for the generated code in the next section, once the grammar is fully defined.
The top level is a succession of four groups:
- definitions (
"def") - lexer specifications (
"lexer") - parser specifications (
"parser") - general options (
"options")
The definitions contain zero or more items that all have the same format, so it’s pretty simple. We want to be able to use those “constants” for any value, so there are different types, like a string literal (filename or tag name), an integer literal (indent level), and other possibilities we’ll discover as we come across all the options.
The lexer, parser, and general options contain between brackets a list of comma-separated options. We’ll see the inner options when we tackle those nonterminals.
How are the top-level items organized? They could either follow one another in the order shown above, or there could just be a succession of those items in no particular order and with possible repetitions. The latter would offer more flexibility, but we’ll choose the former because it’s cleaner.
Optional Items
The starting nonterminal, that we’re naming config, simply lists the four parts:
config:
definitions
lexer
parser
options
;
lexer is mandatory, but parser and options are optional; we can require either a lexer or the whole lexer + parser + wrapper code. Since Lexigram offers a few elements of syntactic sugar for optional and repetitive items, such as α?, α*, and α+, we can use that in the definition of parser and options: (we have omitted the actual content of the grammatical rules for now)
definitions: …;
lexer: …;
parser: (…)?;
options: (…)?;
The first task of Lexigram is to normalize the rules, transforming
α?toα | ε(whereεmeans “empty”, nothing),α*to a new nonterminalx, wherex -> α x | εα+to a new nonterminalx, wherex -> α x | α
This is transformed into the final productions below:
config -> definitions lexer parser options definitions -> … lexer -> … parser -> … parser -> ε options -> … options -> ε
parser and options have an empty alternative, which is fine. It’s reflected directly in the listener trait methods as two variants of an enum context type, so the user knows immediately whether the option is used or not.
There’s another way to write the grammar: it could be tempting to put the ? directly into config:
config:
definitions
lexer
parser?
options?
;
definitions: …;
lexer: …;
parser: …;
options: …;
And it would work, but it wouldn’t be optimal. Why?
Each of those options would split the original production when it’s normalized, which would end up as a list of all the possible combinations:
config -> definitions lexer parser options config -> definitions lexer parser config -> definitions lexer options config -> definitions lexer
It’s not too long—though we’d have eight alternatives if the lexer specs were also optional—but it’s cumbersome. And that’s not all; since those alternatives need to be left-factorized, the rule would be transformed again later:
config -> definitions lexer config_1 . config_1 -> parser config_2 . config_1 -> options . config_1 -> ε . config_2 -> options . config_2 -> ε definitions -> … lexer -> … parser -> … options -> …
The user implementation of the trait method would reflect the same pattern, as shown below. It’s possible to work with it, but it’s annoying to handle the repetitions, especially if we need to handle nonterminal values for definitions, lexer, parser, and options.
#![allow(unused)]
fn main() {
// user-code template:
impl ConfigListener for Listener {
fn exit_config(&mut self, ctx: CtxConfig) -> SynConfig {
match ctx {
// config -> definitions lexer parser options
CtxConfig::V1 { definitions, lexer, parser, options } => {}
// config -> definitions lexer parser
CtxConfig::V2 { definitions, lexer, parser } => {}
// config -> definitions lexer options
CtxConfig::V3 { definitions, lexer, options } => {}
// config -> definitions lexer
CtxConfig::V4 { definitions, lexer } => {}
}
SynConfig()
}
// ...
}
}
In this case, we probably don’t need a value for those nonterminals. In each option category, we’ll store the parsed result for each of them in the listener instead, so it’s not a big problem, but it’s not great from the perspective of the final LL(1) grammar.
Compared to the first solution, which yields 1 production in the first and the “non-?” nonterminals, and 2*n productions in the ? nonterminals, we get 2^n productions in the first (plus left factorizations!) and 1 production in the others. So we see why it’s better to use the ? in the secondary nonterminals:
- For 2 optional items out of 4, we get 7 productions instead of 10.
- For 3 optional items out of 4, we get 8 productions instead of 16.
The take-away is to avoid using this particular syntactic sugar several times in the same alternative, because the desugaring makes it grow exponentially when it’s distributed:
x: α? β? γ?;
becomes
x -> (α | ε) (β | ε) (γ | ε) -> α β γ | α β | α γ | α | β γ | β | γ | ε
If it’s used in separate alternatives, it’s fine because it only adds one ε, which is shared between them:
x: α? | β? | γ?;
becomes
x -> α | β | γ | ε
Loops
Now that we have config, let’s see the secondary nonterminals, starting with definitions.
Each definition has the syntax below, where we give a temporary definition for value:
lexer Config;
Equal : '=';
Semicolon : ';';
Id : [a-zA-Z][a-zA-Z_0-9]*;
NumLiteral : [0-9]+;
StrLiteral : '"' Char+ '"';
parser Config;
definition:
Def Id Equal value Semicolon
;
value:
NumLiteral
| StrLiteral
| Id
;
Or, if we’re using the simple grammar notation to make terminals like "def" and "=" more visible, instead of using Def, Equal, and so on:
definition -> "def" Id "=" value ";"
value -> NumLiteral | StrLiteral | Id;
But how can we express 0 or more repetitions of definition? It’s time to see the different ways of doing grammar loops.
Right Recursion
The right recursion is the only valid loop in an LL(1) grammar.
If we use this rule for definitions:
definitions -> "def" Id "=" value ";" definitions | ;
The missing symbols on the right of | simply means it’s empty. Since the syntax is recursive, we need an alternative to end the recursion: here, we use an empty production. In practice, the parser will check the next token after ";" to predict which alternative of productions follows.
- If the token is
"def", it predicts the left, recursive alternative. - If it’s
"lexer", it predicts the right, empty alternative. Indeed, when the alternative is “nothing”, the parser looks at the next possibilities in theconfignonterminal. Here,configexpectslexerafterdefinitions, and thelexersyntax begins with the terminal"lexer"(even though we haven’t written that explicitly yet).
There is no transformation for definitions,
definitions -> "def" Id "=" value ";" definitions definitions -> ε
We get to implement this listener method:
#![allow(unused)]
fn main() {
// user-code template:
impl ConfigListener for Listener {
fn init_definitions(&mut self) {}
fn exit_definitions(&mut self, ctx: CtxDefinitions) -> SynDefinitions {
match ctx {
// definitions -> "def" Id "=" value ";" definitions
CtxDefinitions::V1 { id, value, definitions } => {}
// definitions -> ε
CtxDefinitions::V2 => {}
}
SynDefinitions()
}
// ...
}
}
This appears to be fine, but there is a catch: since it’s a right-recursive rule, we’ll get the last item first!
Let’s consider the following input, which has three definitions:
def a = 0;
def b = 1;
def c = 2;
The reductions are performed as follows:
definitions
─────┬─────
│
─────────────────────┴──────────────────────
"def" Id("a") "=" value("0") ";" definitions
─────┬─────
│
─────────────────────┴──────────────────────
"def" Id("b") "=" value("1") ";" definitions
─────┬─────
│
─────────────────────┴──────────────────────
"def" Id("c") "=" value("2") ";" definitions
─────┬─────
│
ε
Since the exit_definitions(…) method is called after all the items have been parsed in the production, we must wait until all the productions have been parsed. The successive calls are, if we simplify the notation somewhat:
init_definitions()exit_definitions(CtxDefinitions::V2), returnswexit_definitions(CtxDefinitions::V1 { id: "c", value: "2", definitions: w }) -> xexit_definitions(CtxDefinitions::V1 { id: "b", value: "1", definitions: x }) -> yexit_definitions(CtxDefinitions::V1 { id: "a", value: "0", definitions: y }) -> z
And the value of definitions in exit_config(…), once it’s called, will be z.
If we accumulated the values in a vector, we would ultimately get the items in reverse order, unless we used a VecDeque to push the items in front. It could also be annoying if you wanted to check that an indentifier hadn’t already been declared before. And if an identifier was the same as another one (def c = a), you’d need another pass to resolve it.
Since we don’t have any way of knowing when exit_definitions(…) is called for the last time, any post-processing to fix those issues must be done wherever definitions is used. In our case, it’s used only once in exit_config(…), but if it was used in several productions, we’d have to call a post-processing method from each location (and not forget about it if new instances are made!). Or we would have to add an intermediate nonterminal to keep things tidier.
Another consequence, which isn’t obvious at first, is that all the values SynDefinitions are stored in the wrapper stack until the end of the list, and all the production symbols are stored in the parser stack, too. Not only do you get the last item first, after the whole list had to be parsed, but it consumes more memory to store the intermediate data and steps.
Right-recursive loops are simple and useful when you need to preserve that reverse order, but otherwise it’s not something you’ll want to use in a grammar very often.
Star
We’ve already seen the syntactic sugar for repetitions of zero or more items: definition*. We can even put it directly into the definitions nonterminal:
definitions -> ("def" Id "=" value ";")*;
Lexigram transforms it as follows:
definitions -> definitions_1 . definitions_1 -> "def" Id "=" value ";" definitions_1 . definitions_1 -> ε
The generated code hides that transformation by giving the end result once definitions is entirely parsed. There are no listener calls corresponding to the nonterminal definitions_1.
#![allow(unused)]
fn main() {
pub enum CtxDefinitions {
/// `definitions -> ("def" Id "=" value ";")*`
V1 { star: SynDefinitions1 },
}
pub struct SynDefinitions1(pub Vec<SynDefinitions1Item>);
pub struct SynDefinitions1Item { pub id: String, pub value: SynValue }
// user-code template:
impl ConfigListener for Listener {
fn exit_definitions(&mut self, ctx: CtxDefinitions) -> SynDefinitions {
// definitions -> ("def" Id "=" value ";")*
let CtxDefinitions::V1 { star } = ctx;
SynDefinitions()
}
// ...
}
}
The wrapper implements the glue logic to collect each item of the repetition and push it into a Vec. When we implement exit_definitions(…), we get a vector star containing all the (id, value) pairs. It’s often the most convenient way for small lists of items.
As we’ve already seen in the overview, valueless symbols like "def", "=", and ";" are not given in the context variants, so they’re not stored in repetition vectors either.
If there are alternatives in the repetition, like x -> (α | β)*, the type inside the vector is an enum with the variants, so that each item can be either α or β. It can be very handy in simple situations, but it’s not to be abused; when the repeated item becomes too complex, another loop type may be preferable.
At this point, you might be wondering: isn’t definitions_1 right-recursive? And didn’t we say that a right recursion produced the results in reverse order and had a few other downsides? Are we getting the items in reverse order here too?
While it’s indeed right-recursive, which is the only way to loop in LL(1), the original rule wasn’t meant as a right recursion. Therefore, Lexigram knows that the trailing definitions_1 of production 1 below doesn’t need to be known before using the items Id and value.
0: definitions -> definitions_1 1: . definitions_1 -> "def" Id "=" value ";" definitions_1 2: . definitions_1 -> ε
Let’s see what happens in more detail:
- When the production 1 is predicted, the wrapper pushes an empty “loop” vector on the value stack.
- The production symbols are then parsed, and those that have a value are pushed on the wrapper stack; here,
Idandvalue. - When the
definitions_1symbol at the end of production 1 is met, it isn’t processed as a normal nonterminal to be derived (which would put another empty vector on the stack), but as a loop nonterminal. When wrapper receives a loop call for that nonterminal, it pops the values from their respective stack and pushes that as a new item into the loop vector, which is now at the top of the value stack. - This process iterates until all the items have been parsed.
- When the loop is finished, production 0 is fully parsed, and the wrapper knows that the value of
definitions_1, which stands for the original*repetition, is at the top of the value stack.
Note
For the curious minds who know how table-driven parsers work in general.
When a production like
definitions_1 -> "def" Id "=" value ";" definitions_1is predicted, the usual algorithm description tells us that the symbols on the right-hand side are put in reverse order on a stack. Here, it would pushdefinitions_1,";",value,"=",Id,"def". The parser would then pop the next symbol and, depending on its nature:
- if it’s a terminal, like
"def", it would check if the next token matches,- if it’s a nonterminal, like
value, it would predict, from the parsing table, which alternative of that nonterminal’s productions must be added onto the stack.Lexigram adds other symbols on the stack to call the wrapper:
- exit symbols for the nonterminals, so that the
exit_*(…)methods are called at the right time,- loop symbols to manage the iterations as we saw above,
- hook symbol to let the user intercept and modify specific tokens.
These symbols are shown in the log (“opcodes”).
In conclusion, the iterations of the parsing loop insert the items in the correct order, and nothing accumulates on the stacks. It’s more straightforward and not as resource-consuming.
The same mechanism is used in other types of loops, like the left recursion and the low-latency loops.
Plus
For repetitions of one or more items, + can be used; the result is identical to * from the point of view of the listener implementation, except the name plus is used instead of star in the context variant.
definitions -> ("def" Id "=" value ";")+;
What we don’t see in the listener trait is that the final LL(1) grammar is slightly more complicated “internally”. The + is first normalized:
definitions -> definitions_1 . definitions_1 -> "def" Id "=" value ";" definitions_1 . definitions_1 -> "def" Id "=" value ";"
Since the productions begin with identical symbols, "def", Id, "=", and so on, this must be left-factorized, too:
definitions -> definitions_1 . definitions_1 -> "def" Id "=" value ";" definitions_2 . definitions_2 -> definitions_1 . definitions_2 -> ε
Left Recursion
As we saw in the LL(1) grammar limitations, left recursion is transformed as a right recursion, too.
definitions ->
definitions "def" Id "=" value ";"
|
;
The transformed rules look identical to what we saw with the * repetition:
0: definitions -> definitions_1 1: . definitions_1 -> "def" Id "=" value ";" definitions_1 2: . definitions_1 -> ε
Warning
Don’t confuse the
εin the production alternative 2 above, which corresponds to the end of the loop, with theεwe had in the right recursion, which corresponded to the “empty” alternative of the original rule. That “empty” is invisible in the case of the left recursion because it’s simplified during the transformation.0: definitions -> ε definitions_1 -> definitions_1 ^^^Remember that
a -> a α | βis transformed asa -> β a_1; a_1 -> α a_1 | ε. Here,βisε.
The only difference from a * loop is in the wrapper. The context variants are shown as left-recursive, and, unlike the * repetition, the wrapper calls the listener trait method at each iteration, leaving you the choice of the loop value type and how to use it.
#![allow(unused)]
fn main() {
// user-code template:
impl ConfigListener for Listener {
fn exit_definitions(&mut self, ctx: CtxDefinitions) -> SynDefinitions {
match ctx {
// definitions -> definitions "def" Id "=" value ";"
CtxDefinitions::V1 { definitions, id, value } => {}
// definitions -> ε
CtxDefinitions::V2 => {}
}
SynDefinitions()
}
fn exitloop_definitions(&mut self, definitions: &mut SynDefinitions) {}
// ...
}
}
Another important difference is the availability of another method called after the last iteration, exitloop_definitions(…). It’s easy to do thanks to the transformation in the grammar: when the alternative 2 is predicted on the last iteration, the wrapper calls the method. In comparison, this wasn’t possible in the case of a right-recursive rule because, when the loop finally exits, the parser must still make all the pending exit calls for the items on the stack. And since there is no separate nonterminal for the loop, it would make calling such an exitloop_*(…) method more complicated.
When Lexigram sees a left recursion, it knows the rightmost nonterminal can be treated differently, and since the original recursion is on the left term, the user expects to see the reductions on the left:
For the same input, the reduction is done as follows:
definitions
─────┬─────
│
─────────────────────┴──────────────────────
definitions "def" Id("c") "=" value("2") ";"
─────┬─────
│
─────────────────────┴──────────────────────
definitions "def" Id("b") "=" value("1") ";"
─────┬─────
│
─────────────────────┴──────────────────────
definitions "def" Id("a") "=" value("0") ";"
─────┬─────
│
ε
The successive calls are, in this case:
init_definitions()exit_definitions(CtxDefinitions::V2), returnswexit_definitions(CtxDefinitions::V1 { id: "a", value: "0", definitions: w }) -> xexit_definitions(CtxDefinitions::V1 { id: "b", value: "1", definitions: x }) -> yexit_definitions(CtxDefinitions::V1 { id: "c", value: "2", definitions: y }) -> zexitloop_definitions(&mut z)
Left vs Right
Our need to encode the loop for definitions didn’t quite illustrate a subtlety between left and right recursion because we had an empty alternative. To remove any confusion and stress the difference between left and right, let’s introduce a non-empty alternative in each case.
Here is a right-recursive version that ends with an “end”:
defs -> "def" Id "=" value ";" defs | "end" Id;
It corresponds to those productions (no transformation):
0: defs -> "def" Id "=" value ";" defs 1: defs -> "end" Id
It can parse an input that ends with “end”, since that’s the only admissible rightmost token in the original rule:
def a = 0;
def b = 1;
end loop
This would generate those successive calls:
init_defs()exit_defs(CtxDefs::V2 { id: "loop" } -> xexit_defs(CtxDefs::V1 { id: "b", value: "1", defs: x }) -> yexit_defs(CtxDefs::V1 { id: "a", value: "0", defs: y }) -> z
And here is a left-recursive rule that begins with “begin”:
defs -> defs "def" Id "=" value ";" | "begin" Id; // defs α | β
or, if that’s easier to visualize,
defs -> "begin" Id | defs "def" Id "=" value ";"; // β | defs α
Lexigram transforms it into those productions:
0: defs -> "begin" Id defs_1 // β defs_1 1: . defs_1 -> "def" Id "=" value ";" defs_1 // α defs_1 2: . defs_1 -> ε // (loop end)
It can parse an input that begins with “begin”, since that’s the only admissible leftmost token in the original rule:
begin loop
def a = 0;
def b = 1;
This would generate those successive calls:
init_defs()exit_defs(CtxDefs::V2 { id: "loop" }) -> xexit_defs(CtxDefs::V1 { id: "a", value: "0", defs: x }) -> yexit_defs(CtxDefs::V1 { id: "b", value: "1", defs: y }) -> zexitloop_definitions(&mut z)
Latency
To close the loop, we’ll have a quick look at a variant of right recursion and */+ repetitions. We saw that both those loop constructs had a higher latency. The right recursion puts everything in the stack until the last iteration, and only then starts calling the listener methods. And the repetitions deliver the vector of collected items once the loop has ended.
Yet they provide useful alternatives to left recursion, by allowing independent terms at the end for the right recursion and none at all for the repetition. Wouldn’t it be worth having a way to get the loop items earlier?
That’s why the “low-latency” attribute was introduced.
In a low-latency right recursion, the rule has to be written as follows (the <L> can be placed anywhere in the recursive alternative):
definitions -> "def" Id "=" value ";" <L> definitions | ;
Lexigram doesn’t change the productions:
1: definitions -> "def" Id "=" value ";" definitions 2: definitions -> ε
But the listener methods do change:
#![allow(unused)]
fn main() {
// user-code template:
impl ConfigListener for Listener {
fn init_definitions(&mut self) -> SynDefinitions {
SynDefinitions(vec![])
}
fn exit_definitions(&mut self, acc: &mut SynDefinitions, ctx: CtxDefinitions) {
match ctx {
// definitions -> "def" Id "=" value ";" <L> definitions
CtxDefinitions::V1 { id, value } => {
acc.0.push((id, value));
}
// definitions -> ε
CtxDefinitions::V2 => {}
}
}
// ...
}
}
Now,
self.init_definitions()returns the initial value of the loop variable; for example, an empty vector.self.exit_definitions(acc, ctx)is called after parsing the independent term and each iteration.
At first, calling init_definitions(…), then exit_definitions(…) on the independent term may sound redundant, since both could provide the initial value, but they allow to separate the initialization of the nonterminal, before any part of it is parsed, and adjust it with the independent term if necessary. The other reason to do so is the mutable reference to the loop variable, acc, which is usually more efficient than moving its value at each iteration.
Since it’s a right recursion, there is still no way to know when it’s the last iteration, but now the data can be processed directly in the incoming order. The example of code above will provide a vector containing the items in the correct order. It would also allow you to check that Id hasn’t already been defined before and, if so, to return the appropriate error message. And it would allow you to use an already defined Id as value, without using a second pass.
The low-latency repetition is written a little differently, since the name of the loop nonterminal is given explicitly this time:
definitions -> (<L=i_def> "def" Id "=" value ";")*;
The resulting productions are the same, except maybe for the nonterminal name:
1: definitions -> i_def 2: . i_def -> "def" Id "=" value ";" i_def 3: . i_def -> ε
But the listener methods change, too:
#![allow(unused)]
fn main() {
// user-code template:
impl ConfigListener for Listener {
fn init_definitions(&mut self) {}
fn exit_definitions(&mut self, ctx: CtxDefinitions) -> SynDefinitions {
// definitions -> (<L> "def" Id "=" value ";")*
let CtxDefinitions::V1 { star } = ctx;
SynDefinitions()
}
fn init_i_def(&mut self) -> SynIDef {
SynIDef(vec![])
}
fn exit_i_def(&mut self, acc: &mut SynIDef, ctx: CtxIDef) {
// `<L> "def" Id "=" value ";"` iteration in
// `definitions -> ( ►► <L> "def" Id "=" value ";" ◄◄ )*`
let CtxIDef::V1 { id, value } = ctx;
acc.0.push((id, value));
}
// ...
}
}
This variant allows you to define the type of the loop variable and to handle it directly.
self.init_i_def()is called after any symbol preceding the repetition in the original production is parsed, and before the first loop production is parsed.self.exit_i_def(acc, ctx)is called after each iteration is parsed.
There is no indication of the last iteration, but since there is an exit_definitions(…), it’s not really necessary; you can do any post-processing on star in that method if necessary.
Note
In the case of the
+repetition, there is an additional flag in the context that indicates the last iteration. Unlike the*repetition, it’s very easy to provide.#![allow(unused)] fn main() { CtxIDef::V1 { id: String, value: SynValue, last_iteration: bool } ^^^^^^^^^^^^^^ }
This concludes the quick tour of the art of loops in LL(1) grammars using Lexigram.
Considering definitions has no independent term, the * repetition, with or without <L>, seems a good way to handle that language element. The low-latency version offers the advantage of directly doing operations that would otherwise require post-processing the final vector or dictionary, so it may be the better choice here.
value & Recap
The value that our constants can represent are varied:
- string literals, for file names, tag names
- unsigned integer literals, for indentation levels
- boolean literals, to enable optional features like spans generation
- identifiers, if one constants equals another
- keywords like
stdoutwhich is a possible output code location
The grammar elements so far are as follows, where some details are not yet defined (the …). This time, we show the full lexicon definitions instead of a simplified grammar:
lexicon Config;
fragment EscChar : '\\' [nrt"\\];
fragment Char : EscChar | ~[\n\r\t"\\];
// punctuation and operators
Equal : '=';
Semicolon : ';';
// keywords
Def : 'def';
Lexer : 'lexer';
Options : 'options';
Parser : 'parser';
Stdout : 'stdout';
// variable terminals
BoolLiteral : 'false' | 'true';
Id : [a-zA-Z][a-zA-Z_0-9]*;
NumLiteral : [0-9]+;
StrLiteral : '"' Char+ '"';
SComment : '/*' .*? '*/' -> skip;
SLineComment : '//' ~[\r\n]* -> skip;
SWhiteSpace : [ \n\r\t]+ -> skip;
grammar Config;
config:
definitions
lexer
parser
options
;
definitions:
(<L=i_def> Def Id Equal value Semicolon)* // any type
;
lexer:
Lexer …
;
parser:
(Parser …)?
;
options:
(Options …)?
;
value:
BoolLiteral
| NumLiteral
| StrLiteral
| Id
| Stdout
;
A few comments on the lexicon:
- The
fragmentacts as an alias for lexical rules, or parts of lexical rules, that you intend to use multiple times or that you’d rather write separately for the sake of clarity. - The rules are expressed like simple regular expressions:
- Matching string literals are between single quotes:
'def'. - Sets of characters can be defined between square brackets (
[a-z]) or as ellipsis-separated characters ('a'..'z'). The dot.means any character at all. - An item can be negated with the prefix
~(normally used with character sets). - Items can be optional (
?) or repeated (*or+); non-greedy repetitions are noted*?and+?. - Escape codes can be used for usual characters:
\n,\r,\t,\\,\', and for UTF-8 codes:\u{3b5}.
- Matching string literals are between single quotes:
- The fixed keywords are defined before variable tokens like
Id, which could conflict with them. For example, “def” corresponds to both the definition ofDefandId. In that case, the first defined keyword prevails, which is what we want here. It means we can’t use “def” as anId, but that’s a usual limitation in programming languages. - The last three tokens have a
-> skipaction—everything after->is called an action.skipdiscards the token, so the parser will never receive it. In this case, we use that to discard all the comments and the white spaces and lines between the other tokens. It’s placed after everything so we’re sure those rules have the lowest priority.
The value productions are straightforward. The four first tokens are variable, so the parsed input is available in the context variants:
#![allow(unused)]
fn main() {
pub enum CtxValue {
V1 { boolliteral: String }, // value -> BoolLiteral
V2 { numliteral: String }, // value -> NumLiteral
V3 { strliteral: String }, // value -> StrLiteral
V4 { id: String }, // value -> Id
V5, // value -> "stdout"
}
pub trait ConfigListener {
// ...
fn init_value(&mut self) {}
fn exit_value(&mut self, ctx: CtxValue) -> SynValue;
}
}
lexer and parser
The lexer and the parser specifications are similar: they include, between brackets, one or more comma-separated options of the type:
combinedandinput: define the location of the lexicon; in the case ofcombined, it also contains the grammar.output: defines the location of the generated code.header: defines header(s) put in front of the generated code; e.g. inner attributes, comments, … (header: { "#![allow(unused)]", "#![cfg(tests)]" })indent: defines the indentation (indent: 4).
An input location is either
- a filename (
input: "config.l"). - a filename and the name of a tag between square brackets (
input: "config.lg" ["lexicon"]).
An output location can be either
- a filename or
stdout. - a filename or
stdout, and the name of a tag between square brackets.
Since both lexer and parser use the same types of options, we’ll use a common io_options rule to parse them.
It would be possible to force the grammar to accept either combined or input, and even to accept combined only for the lexer and not the parser, but it would complicate it significantly. It’s much easier to accept a list of any option and to verify them in the listener.
Similarly, we’ll use value for each argument, even if "indent" only accepts a number (or a constant Id holding a number) and "input" only accepts a string literal. Again, it’s easier to check the correct type in the listener than to surcharge the grammar.
Here are a few punctual additions to the lexicon:
Colon : ':';
Comma : ',';
Lbracket : '{';
LSbracket : '[';
Rbracket : '}';
RSbracket : ']';
and to the grammar:
lexer:
Lexer Lbracket io_options Rbracket
;
parser:
(Parser Lbracket io_options Rbracket)?
;
io_options:
io_option (<L=i_io_opt> Comma io_option)*
;
io_option:
Combined Colon value tag_opt
| Input Colon value tag_opt
| Output Colon value tag_opt
| Indent Colon value
| Headers Colon Lbracket value (Comma value)* Rbracket
;
tag_opt:
(LSbracket value RSbracket)?
;
The lexer part is simple:
#![allow(unused)]
fn main() {
pub enum CtxLexer {
/// `lexer -> "lexer" "{" io_options "}"`
V1 { io_options: SynIoOptions },
}
pub trait ConfigListener {
// ...
fn init_lexer(&mut self) {}
fn exit_lexer(&mut self, ctx: CtxLexer) -> SynLexer;
}
}
The parser part adds an alternative used when there is no parser (the listener methods are similar to the lexer’s):
#![allow(unused)]
fn main() {
pub enum CtxParser {
/// `parser -> "parser" "{" io_options "}"`
V1 { io_options: SynIoOptions },
/// `parser -> ε`
V2,
}
}
The io_options is a low-latency repetition and has a little surprise in reserve for us:
#![allow(unused)]
fn main() {
pub enum CtxIoOptions {
/// `io_options -> io_option (<L> "," io_option)*`
V1 { star: SynIIoOpt },
}
pub enum InitCtxIIoOpt {
/// value of `io_option` before `<L> "," io_option` iteration in
/// `io_options -> io_option ( ►► <L> "," io_option ◄◄ )*`
V1 { io_option: SynIoOption },
}
pub enum CtxIIoOpt {
/// `<L> "," io_option` iteration in
/// `io_options -> io_option ( ►► <L> "," io_option ◄◄ )*`
V1 { io_option: SynIoOption },
}
pub trait ConfigListener {
// ...
fn init_io_options(&mut self) {}
fn exit_io_options(&mut self, ctx: CtxIoOptions) -> SynIoOptions;
fn init_i_io_opt(&mut self, ctx: InitCtxIIoOpt) -> SynIIoOpt;
fn exit_i_io_opt(&mut self, acc: &mut SynIIoOpt, ctx: CtxIIoOpt);
fn exitloop_i_io_opt(&mut self, acc: &mut SynIIoOpt) {}
}
}
There’s an extra context type InitCtxIIoOpt, and CtxIoOptions is missing the io_option field for the nonterminal in front of the repetition!
What happened here is that Lexigram recognized that io_option ("," io_option)* was a list of io_option items separated by a comma. It changed the wrapper to make our work easier with a first call init_i_io_opt(…) handing over the first value (the one before the repetition). The successive calls exit_i_io_opt(…) give the loop items inside the repetition.
Without that feature, exit_i_io_opt(…) would give items 2 to n, and finally exit_io_options(…) would give item 1 and our accumulator for items 2 to n. If star was a vector and nothing else mattered, we could get away by inserting item 1 in front. But if we needed to check that all the items are given in the correct order or without duplicates, we’d need extra post-processing, so we’d lost the benefit of a low-latency loop. Here, we get all the items in time and in the right order.
Since it’s an <L> loop, init_i_io_opt(…) returns the first value of the accumulator, and exit_i_io_opt(…) gets a mutable reference to it while it’s still on the stack. This is more efficient than popping the accumulator from the stack, moving it into the context and then out again, and finally pushing the modified value on the stack again.
We find something similar in io_option, although it’s not as conspicuous:
#![allow(unused)]
fn main() {
pub enum CtxIoOption {
/// `io_option -> "combined" ":" value tag_opt`
V1 { value: SynValue, tag_opt: SynTagOpt },
/// `io_option -> "input" ":" value tag_opt`
V2 { value: SynValue, tag_opt: SynTagOpt },
/// `io_option -> "output" ":" value tag_opt`
V3 { value: SynValue, tag_opt: SynTagOpt },
/// `io_option -> "indent" ":" value`
V4 { value: SynValue },
/// `io_option -> "headers" ":" "{" value ("," value)* "}"`
V5 { star: SynIoOption1 },
}
pub enum CtxTagOpt {
/// `tag_opt -> "[" value "]"`
V1 { value: SynValue },
/// `tag_opt -> ε`
V2,
}
pub trait ConfigListener {
// ...
fn init_io_option(&mut self) {}
fn exit_io_option(&mut self, ctx: CtxIoOption) -> SynIoOption;
}
}
The value in front of the "headers" repetition is also missing in the context, but there’s no extra context or method this time. Since it’s a non-<L> repetition, the wrapper takes everything in charge to provide us with an ordered vector. If we parse the following input:
headers: { "a", "b", "c" }
we get a vector vec![a, b, c] in star.0, where a is the SynValue for “a”, and so on.
This token-separated-items feature is only available for * repetitions, <L> and non-<L>, since it wouldn’t make much sense for +. It’s automatically detected for this pattern:
... α (β α)* ...
where
αis a list of symbols, terminals and/or nonterminals.βcan be any number of terminals, but they can’t hold a value. It also works if there’s noβ, but you may as well writeα+in that case.
A proper syntax will likely be added in the future for those repetitions.
options
The general options are different from the io_options of the lexer and the parser, although indent and headers are common. They include, between brackets, one or more comma-separated options:
headersandindent: define the headers and the indentation level for all the components; the syntax is the same as before.libs: defines extra bindings for the wrapper; e.g.libs: { "super::listener_types", "std::collections::HashMap" }nt-valuedefines which nonterminals hold a value of the typeSyn<nonterminal name>:nt-value: default: the nonterminals defined in the grammar and the<L>nonterminals hold a value (which is the default behaviour).nt-value: none: all the nonterminals are valueless. They’re not in the context variants, and there’s no value to return from theexit_*(…)methods. It’s a good option if all the values can easily be stored in the listener.nt-value: parents: only the nonterminals defined in the grammar hold a value. Any nonterminal generated from transformations like left recursion don’t hold any value.nt-value: set { "a", "b", "c" }: only nonterminalsa,b, andchold a value.
spans: followed by a boolean value
We need a few extra keywords in the lexicon:
Libs : 'libs';
NTValue : 'nt-value';
Options : 'options';
Spans : 'spans';
The main grammar rules follow a similar pattern to what we did for the lexer and parser:
options:
(Options Lbracket global_options Rbracket)?
;
global_options:
global_option (<L=i_global_opt> Comma global_option)*
;
global_option:
Headers Colon Lbracket value (Comma value)* Rbracket
| Indent Colon value
| Libs Colon Lbracket value (Comma value)* Rbracket
| NTValue Colon nt_value
| Spans Colon value
;
For global_options, and for "headers" and the "libs" in global_option, we’re using the same token-separated-items lists as before. Since there’s nothing new, we’re not going to look at the generated code, that we’ll see in the implementation section anyway.
The last nonterminal to define is nt_value. Here are the last keywords we need for the lexer:
Default : 'default';
None : 'none';
Parents : 'parents';
Set : 'set';
And here is the grammar part. We use value for the nonterminal names instead of Id because we want to avoid collisions with the existing keywords. For example, we wouldn’t be able to use nonterminal names like lexer and parser in that list, since the lexer wouldn’t identify them as Id, but as other tokens. There is a way around that with more advanced features of Lexigram, though, as you can see in the gen_typedef / typedef example.
nt_value:
Default
| None
| Parents
| Set Lbracket value (Comma value)* Rbracket
;
Final Grammar
lexicon Config;
fragment EscChar : '\\' [nrt"\\];
fragment Char : EscChar | ~[\n\r\t"\\];
// punctuation and operators
Colon : ':';
Comma : ',';
Equal : '=';
Lbracket : '{';
LSbracket : '[';
Rbracket : '}';
RSbracket : ']';
Semicolon : ';';
// keywords
Combined : 'combined';
Def : 'def';
Default : 'default';
Headers : 'headers';
Indent : 'indent';
Input : 'input';
Lexer : 'lexer';
Libs : 'libs';
None : 'none';
NTValue : 'nt-value';
Options : 'options';
Output : 'output';
Parents : 'parents';
Parser : 'parser';
Set : 'set';
Spans : 'spans';
Stdout : 'stdout';
// variable terminals
BoolLiteral : 'false' | 'true';
Id : [a-zA-Z][a-zA-Z_0-9]*;
NumLiteral : [0-9]+;
StrLiteral : '"' Char+ '"';
SComment : '/*' .*? '*/' -> skip;
SLineComment : '//' ~[\r\n]* -> skip;
SWhiteSpace : [ \n\r\t]+ -> skip;
grammar Config;
config:
definitions
lexer
parser
options
;
definitions:
(<L=i_def> Def Id Equal value Semicolon)*
;
lexer:
Lexer Lbracket io_options Rbracket
;
parser:
(Parser Lbracket io_options Rbracket)?
;
options:
(Options Lbracket global_options Rbracket)?
;
io_options:
io_option (<L=i_io_opt> Comma io_option)*
;
io_option:
Combined Colon value tag_opt
| Input Colon value tag_opt
| Output Colon value tag_opt
| Indent Colon value
| Headers Colon Lbracket value (Comma value)* Rbracket
;
tag_opt:
(LSbracket value RSbracket)?
;
global_options:
global_option (<L=i_global_opt> Comma global_option)*
;
global_option:
Headers Colon Lbracket value (Comma value)* Rbracket
| Indent Colon value
| Libs Colon Lbracket value (Comma value)* Rbracket
| NTValue Colon nt_value
| Spans Colon value
;
value:
BoolLiteral
| NumLiteral
| StrLiteral
| Id
| Stdout
;
nt_value:
Default
| None
| Parents
| Set Lbracket value (Comma value)* Rbracket
;
Listener implementation
Now that the grammar has been designed, we’ll generate the code and implement the parser.
We’d like our configuration parser to give an Options object, ready to be used with try_gen_parser(…) to generate a parser from Rust code:
#![allow(unused)]
fn main() {
let options = ... ; // what's returned by the parser
match try_gen_parser(Action::Generate, options) {
Ok(log) => {
println!("Code generated");
assert!(
log.has_no_warnings(),
"unexpected warning(s):\n{}",
log.get_warnings().join("\n"));
}
Err(build_error) => panic!("{build_error}"),
}
}
Starting Point
Let’s first generate the lexer, parser, and the wrapper. It would also be good to keep a template of the user types and listener implementation; if we change the grammar at some point, it’s easier to update the listener code by checking how the templates changed. Those templates are generated alongside the code on demand.
Create a new library project, then add src/parser.rs and src/test.rs. For now, add the lexigram-core dependency.
Apart from the top-level parser definitions, we intend to have a few modules in parser.rs:
listener: where we implement the generated listener trait (we’ll initially copy that from the template).listener_types: where we declare all the user types (we’ll initially copy that from the template).config_lexer: the generated lexer.config_parser: the generated low-level parser.
You can change the structure, of course, but that’s what we’ll assume in the code samples below.
Here’s how the initial parser.rs file should look like. We’re adding attributes to avoid unnecessary warnings about unused items for now; we’ll remove that later.
#![allow(unused)]
fn main() {
// top-level parser
mod listener {
// listener implementation
}
#[allow(unused)]
mod listener_types {
// listener types
}
#[allow(unused)]
mod config_lexer {
// [config_lexer]
// [config_lexer]
}
#[allow(unused)]
mod config_parser {
// [config_parser]
// [config_parser]
}
}
Create another file in the project’s root directory, templates.txt, with two pairs of tags for the templates:
// -----------------------------------------------------------------------------
// [template_user_types]
// [template_user_types]
// -----------------------------------------------------------------------------
// [template_listener_impl]
// [template_listener_impl]
// -----------------------------------------------------------------------------
Add those files to version control (e.g. Git), so you can see the differences each time you generate a new version of the parser.
Generating From Code
If you want to create the code programmatically, create an additional binary crate gen_config and add it as workspace member. It needs lexi-gram as dependency, too.
Write the code that generates the parser in main.rs, with paths relative to the main project’s directory (that’s where the binary should be run from):
use lexi_gram::lexigram_lib;
use lexi_gram::{gencode, genspec};
use lexi_gram::gen_parser::try_gen_parser;
use lexi_gram::options::{Action, OptionsBuilder};
use lexigram_lib::CollectJoin;
use lexigram_lib::log::LogStatus;
static LEXICON_FILE: &str = "src/config.l";
static GRAMMAR_FILE: &str = "src/config.g";
static DEST_FILE: &str = "src/parser.rs";
static LEXER_TAG: &str = "config_lexer";
static PARSER_TAG: &str = "config_parser";
static DEST_TEMPLATES: &str = "templates.txt";
static USERS_TAG: &str = "template_user_types";
static LISTENER_TAG: &str = "template_listener_impl";
fn main() {
let options = OptionsBuilder::new()
.indent(4)
.lexer(
genspec!(filename: LEXICON_FILE),
gencode!(filename: DEST_FILE, tag: LEXER_TAG))
.parser(
genspec!(filename: GRAMMAR_FILE),
gencode!(filename: DEST_FILE, tag: PARSER_TAG))
.types_code(gencode!(filename: DEST_TEMPLATES, tag: USERS_TAG))
.listener_code(gencode!(filename: DEST_TEMPLATES, tag: LISTENER_TAG))
.libs(["super::listener_types::*"])
.span_params(true)
.build()
.expect("should have no error");
match try_gen_parser(Action::Generate, options) {
Ok(log) => {
println!("Code generated in {DEST_FILE}\n{log}");
assert!(log.has_no_warnings(), "unexpected warning(s)");
}
Err(build_error) => panic!("{build_error}"),
}
}
Generating From CLI
Make sure you have installed the Lexigram binary.
If the project has a single crate, with the lexicon and grammar files in the src subdirectory, the following command should generate the same code:
lexigram --indent 4\
-x src/config.l -l "src/parser.rs" tag config_lexer\
-g src/config.g -p "src/parser.rs" tag config_parser\
--types templates.txt tag template_user_types\
--listener templates.txt tag template_listener_impl\
--lib "super::listener_types::*" --spans
Basic Parser
Before implementing the full parser, it’s a good idea to test a minimal implementation with all the generated code.
We can start by copying the content of the templates
[template_user_types]: copy what’s between the tags into thelistener_typesmodule.[template_listener_impl]: copy what’s between the tags into thelistenermodule.
The parser needs a field for the lexer, parser, and wrapper. The listener is inside the wrapper, so you don’t need a field for it at the top.
We’ll provide the input as a &str, so we need an extra lifetime 'ls:
#![allow(unused)]
fn main() {
pub struct ConfigParser<'l, 'p, 'ls> {
lexer: Lexer<'l, Cursor<&'l str>>,
parser: Parser<'p>,
wrapper: Option<Wrapper<Listener<'ls>>>,
}
}
In the listener module, let’s add a reference to the input in the listener, as a vector of references to each line. It’s not mandatory, but we’ll show a simple way to point at the error by annotating the relevant input lines. We should also add bindings for PosSpan and Logger, and everything above, which are not generated automatically in the template.
#![allow(unused)]
fn main() {
mod listener {
use lexigram_core::lexer::PosSpan;
use lexigram_core::log::Logger;
use super::*;
pub(super) struct Listener<'ls> {
pub log: BufLog,
lines: Option<Vec<&'ls str>>,
}
impl<'ls> Listener<'ls> {
pub fn new() -> Self {
Listener {
log: BufLog::new(),
lines: None,
}
}
pub fn attach_lines(&mut self, lines: Vec<&'ls str>) {
self.lines = Some(lines);
}
}
// ... (the rest remains as in the template)
}
}
We’re now ready to implement the top parser.
#![allow(unused)]
fn main() {
use std::io::Cursor;
use listener_types::*;
use config_lexer::build_lexer;
use config_parser::*;
use listener::Listener;
use lexigram_core::char_reader::CharReader;
use lexigram_core::lexer::{Lexer, TokenSpliterator};
use lexigram_core::log::{BufLog, Logger, LogStatus};
use lexigram_core::parser::Parser;
const VERBOSE_WRAPPER: bool = false;
impl<'l, 'ls: 'l> ConfigParser<'l, '_, 'ls> {
/// Creates a new parser
pub fn new() -> Self {
let lexer = build_lexer();
let parser = build_parser();
ConfigParser { lexer, parser, wrapper: None }
}
/// Parses a text.
///
/// On success, returns the log.
///
/// On failure, returns the log with the error messages.
pub fn parse(&mut self, text: &'ls str) -> Result<BufLog, BufLog> {
self.wrapper = Some(Wrapper::new(Listener::new(), VERBOSE_WRAPPER));
let stream = CharReader::new(Cursor::new(text));
self.lexer.attach_stream(stream);
self.wrapper.as_mut().unwrap()
.get_listener_mut()
.attach_lines(text.lines().collect());
let tokens = self.lexer.tokens().keep_channel0();
let result = self.parser.parse_stream(self.wrapper.as_mut().unwrap(), tokens);
let Listener { mut log, .. } = self.wrapper.take().unwrap().give_listener();
if let Err(e) = result {
log.add_error(e.to_string());
}
if log.has_no_errors() {
Ok(log)
} else {
Err(log)
}
}
}
}
The idea we’re following is to put the wrapper inside an Option (yes, we’re wrapping the wrapper). That way, we can keep the same parser object, even if we parse several inputs in a loop. Each time, a new wrapper/listener is created with the references to the new input, and it’s destroyed after the parsing is done. It makes it easier to deal with the lifetimes, and it avoids reusing a listener which may have remnants of the previous input.
All we need now is a test. Add this to src/tests.rs:
#![allow(unused)]
#![cfg(test)]
fn main() {
use crate::parser::ConfigParser;
#[test]
fn test_run() {
let mut p = ConfigParser::new();
for (i, src) in [SRC1, SRC2].into_iter().enumerate() {
println!("source #{i}");
match p.parse(src) {
Ok(log) => println!("success\n{log}"),
Err(log) => panic!("error\n{log}"),
}
}
}
// ---------------------------------------------------------
static SRC1: &str = r#"
def SOURCE_FILENAME = "../watcher/src/lib.rs";
lexer {
combined: "src/watcher.lg",
output: SOURCE_FILENAME ["watcher_lexer"],
indent: 4
}
parser {
output: SOURCE_FILENAME ["watcher_parser"],
indent: 4
}
options {
nt-value: none,
spans: true
}
"#;
static SRC2: &str = r#"
def LEXICON_GRAMMAR_FILENAME = "src/microcalc.lg";
def SOURCE_FILENAME = "../microcalc/src/main.rs";
lexer {
input: LEXICON_GRAMMAR_FILENAME,
output: SOURCE_FILENAME ["microcalc_lexer"],
indent: 4
}
parser {
input: LEXICON_GRAMMAR_FILENAME,
output: SOURCE_FILENAME ["microcalc_parser"],
indent: 4
}
options {
libs: { "super::listener_types::*" },
nt-value: default
}
"#;
}
You can launch the test to verify that it works. Then corrupt one of the inputs to verify that the error is detected.
Storing the Options
Since Options and its fields are public, we’ll simply store an instance in the listener and set its fields accordingly to what is parsed.
By default, the generated code relies only on the lexigram-core crate. That’s what we used in our first test. But Options is in lexi-gram, which also depends on lexigram-core, as we explain in the Lexigram crates chapter.
Rather than using two dependencies, we’ll use only lexi-gram, from which we’ll take lexigram-core. For example with this binding:
#![allow(unused)]
fn main() {
use lexi_gram::lexigram_lib::lexigram_core;
}
Let’s add this binding for the lexer and the low-level parser, which also need lexigram_core.
-
If you generated the parser from code, insert a
.headers(…), and generate the code again:#![allow(unused)] fn main() { let options = OptionsBuilder::new() .headers(["use lexi_gram::lexigram_lib::lexigram_core;"]) .indent(4) .lexer( genspec!(filename: LEXICON_FILE), gencode!(filename: DEST_FILE, tag: LEXER_TAG)) .parser( genspec!(filename: GRAMMAR_FILE), gencode!(filename: DEST_FILE, tag: PARSER_TAG)) .types_code(gencode!(filename: DEST_TEMPLATES, tag: USERS_TAG)) .listener_code(gencode!(filename: DEST_TEMPLATES, tag: LISTENER_TAG)) .libs(["super::listener_types::*"]) .span_params(true) .build() .expect("should have no error"); } -
If you generated the parser from the command-line, add a
--headerand generate the code again:lexigram --indent 4 --header "use lexi_gram::lexigram_lib::lexigram_core;"\ -x src/config.l -l "src/parser.rs" tag config_lexer\ -g src/config.g -p "src/parser.rs" tag config_parser\ --types templates.txt tag template_user_types\ --listener templates.txt tag template_listener_impl\ --lib "super::listener_types::*" --spans
In parser.rs, insert this at the top of the bindings:
#![allow(unused)]
fn main() {
use lexi_gram::lexigram_lib::lexigram_core;
}
The listener can now have its new options field:
#![allow(unused)]
fn main() {
pub(super) struct Listener<'ls> {
pub options: Options,
pub log: BufLog,
lines: Option<Vec<&'ls str>>,
}
impl<'ls> Listener<'ls> {
pub fn new() -> Self {
Listener {
options: Options::default(),
log: BufLog::new(),
lines: None,
}
}
// ...
}
}
Check that everything compiles and that the tests still run.
Now we’re ready to modify the listener implementation.
Storing the Constants
The next task is to store the constants that are defined at the beginning. They can be used anywhere, so they could contain any type of value defined in the grammar.
In the listener_types module, we define the type that holds any type of value, and we add a field to the listener (which also needs to be initialized in new(…)):
#![allow(unused)]
fn main() {
/// User-defined type for `value`
#[derive(Clone, PartialEq, Debug)]
pub(super) enum SynValue {
Error,
Bool(bool),
Num(usize),
Str(String),
CodeStdout,
}
pub(super) struct Listener<'ls> {
// ...
consts: HashMap<String, SynValue>,
}
}
Back in the listener module, we can now modify exit_value(…):
#![allow(unused)]
fn main() {
fn exit_value(&mut self, ctx: CtxValue, spans: Vec<PosSpan>) -> SynValue {
match ctx {
// value -> BoolLiteral
CtxValue::V1 { boolliteral } => SynValue::Bool(boolliteral == "true"),
// value -> NumLiteral
CtxValue::V2 { numliteral } => {
match usize::from_str(&numliteral) {
Ok(v) => SynValue::Num(v),
Err(e) => {
self.log.add_error(format!("at {}, {e}", spans[0]));
SynValue::Error
}
}
}
// value -> StrLiteral
CtxValue::V3 { mut strliteral } => {
strliteral.remove(0); // remove quotes
strliteral.pop();
SynValue::Str(strliteral)
}
// value -> Id
CtxValue::V4 { id } => {
if let Some(v) = self.consts.get(&id) {
v.clone()
} else {
self.log.add_error(format!("at {}, {id} is not defined", spans[0]));
SynValue::Error
}
}
// value -> "stdout"
CtxValue::V5 => SynValue::CodeStdout,
}
}
}
The spans vector contains the location of the symbols in the corresponding production. Note that there are positions for symbols that don’t have any value, like fixed terminals. The type implements Display to show the location of one or more characters as “line:column”, “line:col1-col2”, or “line1:col1-line2:col2”. Later, we’ll update that to show the actual input that produces the error, but it’s enough for now.
The definitions are in the rule definitions, but we’re mostly interested in exit_i_def(…):
#![allow(unused)]
fn main() {
fn exit_definitions(&mut self, _: CtxDefinitions, _: Vec<PosSpan>) -> SynDefinitions {
SynDefinitions()
}
fn init_i_def(&mut self) -> SynIDef {
SynIDef()
}
fn exit_i_def(&mut self, acc: &mut SynIDef, ctx: CtxIDef, spans: Vec<PosSpan>) {
// `<L> "def" Id "=" value ";"` iteration in
// `definitions -> ( ►► <L> "def" Id "=" value ";" ◄◄ )*`
let CtxIDef::V1 { id, value } = ctx;
if self.consts.contains_key(&id) {
self.log.add_error(format!("at {}, redefined constant {id}", spans[4]));
} else if value != SynValue::Error {
self.consts.insert(id, value);
}
}
}
Note that we use spans[4], but value is the 4th item in the production (<L> doesn’t count). Why?
In a loop, there’s an extra value in index 0, that contains the accumulated positions so far. So here, spans[0] spans all the definitions parsed, including the current one. spans[1] is "def", and so on.
To have a peek at the results, we can implement the listener’s exit(…) method:
#![allow(unused)]
fn main() {
fn exit(&mut self, config: SynConfig, span: PosSpan) {
let mut defs = self.consts.iter()
.map(|(k, v)| format!("\n- {k}: {v:?}"))
.to_vec();
defs.sort();
println!("definitions:{}", defs.join(""));
}
}
to_vec requires a use lexi_gram::lexigram_lib::CollectJoin;; alternatively, you can use a collect.
Run the test again to verify the definitions, and try to add new ones, including redefinitions and references to undefined constants to check the errors.
io_options
Next, we have the production rules of the lexer and parser groups, both using io_options to parse the options.
lexer:
Lexer Lbracket io_options Rbracket
;
io_options:
io_option (<L=i_io_opt> Comma io_option)*
;
io_option:
Combined Colon value tag_opt // string
| Input Colon value tag_opt // string
| Output Colon value tag_opt // string
| Indent Colon value // num
| Headers Colon Lbracket value (Comma value)* Rbracket // string
;
tag_opt:
(LSbracket value RSbracket)? // string
;
We have two possible approaches.
In the first one, we store in a new field of the listener whether we’re parsing the lexer or the parser when we enter those rules (init_lexer(…) and init_parser(…)). In io_option, we set the appropriate values of self.options depending on the production alternative and the subject (lexer/parser). The nonterminals io_options and io_option don’t hold a value.
In the second approach, we store in io_option a new type that contains the possible options (Specification, CodeLocation, usize, …). We fold the values in io_options and detect repetitions, then we store the same type in io_options, and lexer/parser detect other errors and store the options in self.options.
Both work, but we’ll choose the second approach because there’s less coupling and the responsibilities are better separated.
For io_option, we define the following type in the listener_types module (Specifications and CodeLocation must be imported):
#![allow(unused)]
fn main() {
/// User-defined type for `io_option`
#[derive(Clone, PartialEq, Debug)]
pub enum SynIoOption {
Error,
Combined(Specification),
Spec(Specification),
Code(CodeLocation),
Indent(usize),
Headers(Vec<String>),
}
}
The Error variant is added in case an error is encountered when parsing the nonterminal.
tag_opt, which is used in io_option, is just an optional string for the tag:
#![allow(unused)]
fn main() {
/// User-defined type for `tag_opt`
pub type SynTagOpt = Option<String>;
}
For io_options, here is the “accumulator” type we’ll use to fold the option values:
#![allow(unused)]
fn main() {
/// User-defined type for `io_options`
#[derive(Clone, PartialEq, Debug)]
pub struct SynIoOptions {
pub is_combined: bool,
pub spec: Specification,
pub code: CodeLocation,
pub indent_opt: Option<usize>,
pub headers: Vec<String>,
}
impl Default for SynIoOptions {
fn default() -> Self {
SynIoOptions {
is_combined: false,
spec: Specification::None,
code: CodeLocation::None,
indent_opt: None,
headers: vec![],
}
}
}
}
Here is its implementation:
#![allow(unused)]
fn main() {
impl SynIoOptions {
pub fn new() -> Self {
Self::default()
}
pub fn fold(&mut self, io_option: SynIoOption) -> Result<(), &str> {
match io_option {
SynIoOption::Error => {}
SynIoOption::Combined(spec) => {
if self.spec.is_none() {
self.is_combined = true;
self.spec = spec;
} else {
return Err("more than one specification");
}
}
SynIoOption::Spec(spec) => {
if self.spec.is_none() {
self.spec = spec;
} else {
return Err("more than one specification");
}
}
SynIoOption::Code(code) => {
if self.code.is_none() {
self.code = code;
} else {
return Err("more than one code location");
}
}
SynIoOption::Indent(indent) => {
if self.indent_opt.is_none() {
self.indent_opt = Some(indent);
} else {
return Err("more than one indentation");
}
}
SynIoOption::Headers(headers) => {
self.headers.extend(headers);
}
}
Ok(())
}
}
}
We don’t check if we can use combined at this level; this and its effect are handled in lexer and parser.
Here’s the implementation of exit_io_option(…), which is a little long because of the different cases and the error checking, but otherwise straightforward.
#![allow(unused)]
fn main() {
fn exit_io_option(&mut self, ctx: CtxIoOption, spans: Vec<PosSpan>) -> SynIoOption {
fn get_spec(value: SynValue, tag_opt: SynTagOpt) -> Result<Specification, &'static str> {
match value {
SynValue::Error => Ok(genspec!(none)),
SynValue::Bool(_)
| SynValue::Num(_)
| SynValue::CodeStdout => Err("string"),
SynValue::Str(s) => {
if let Some(tag) = tag_opt {
Ok(genspec!(filename: s, tag: tag))
} else {
Ok(genspec!(filename: s))
}
}
}
}
match ctx {
// io_option -> "combined" ":" value tag_opt
CtxIoOption::V1 { value, tag_opt } => {
match get_spec(value, tag_opt) {
Ok(spec) => SynIoOption::Combined(spec),
Err(expected) => {
self.log.add_error(format!("at {}, expected {expected}", spans[2]));
SynIoOption::Error
}
}
}
// io_option -> "input" ":" value tag_opt
CtxIoOption::V2 { value, tag_opt } => {
match get_spec(value, tag_opt) {
Ok(spec) => SynIoOption::Spec(spec),
Err(expected) => {
self.log.add_error(format!("at {}, expected {expected}", spans[2]));
SynIoOption::Error
}
}
}
// io_option -> "output" ":" value tag_opt
CtxIoOption::V3 { value, tag_opt } => {
match value {
SynValue::Error => SynIoOption::Error,
SynValue::Bool(_)
| SynValue::Num(_) => {
self.log.add_error(
format!("at {}, expected string or stdout", spans[2]));
SynIoOption::Error
}
SynValue::Str(s) => {
if let Some(tag) = tag_opt {
SynIoOption::Code(gencode!(filename: s, tag: tag))
} else {
SynIoOption::Code(gencode!(filename: s))
}
}
SynValue::CodeStdout => SynIoOption::Code(gencode!(stdout)),
}
}
// io_option -> "indent" ":" value
CtxIoOption::V4 { value } => {
if let SynValue::Num(indent) = value {
SynIoOption::Indent(indent)
} else {
if value != SynValue::Error {
self.log.add_error(
format!("at {}, expected number instead of {value:?}", spans[2]));
}
SynIoOption::Error
}
}
// io_option -> "headers" ":" "{" value ("," value)* "}"
CtxIoOption::V5 { star: SynIoOption1(values) } => {
match self.values_to_strings(values, &spans[3]) {
Ok(headers) => SynIoOption::Headers(headers),
Err(()) => SynIoOption::Error,
}
}
}
}
}
Where we add this listener method to convert Vec<SynValue> to Vec<String> because it will be used several times:
#![allow(unused)]
fn main() {
impl<'ls> Listener<'ls> {
// ...
fn values_to_strings(&mut self, values: Vec<SynValue>, spans: &PosSpan) -> Result<Vec<String>, ()> {
let values_len = values.len();
let strings = values.into_iter().enumerate().filter_map(|(i, v)| {
if let SynValue::Str(s) = v {
Some(s)
} else {
self.log.add_error(format!("at {}, item #{} isn't a string", spans, i + 1));
None
}
}).to_vec();
if strings.len() == values_len {
Ok(strings)
} else {
Err(())
}
}
}
}
Finally, parsing the optional tag:
#![allow(unused)]
fn main() {
fn exit_tag_opt(&mut self, ctx: CtxTagOpt, spans: Vec<PosSpan>) -> SynTagOpt {
match ctx {
// tag_opt -> "[" value "]"
CtxTagOpt::V1 { value } => {
if let SynValue::Str(s) = value {
Some(s)
} else {
self.log.add_error(format!("at {}, expected a string", spans[1]));
None
}
}
// tag_opt -> ε
CtxTagOpt::V2 => None,
}
}
}
i_io_opt, which iterates on io_option items and folds them, has the same type as io_options:
#![allow(unused)]
fn main() {
/// User-defined type for `<L> "," io_option` iteration in `io_options -> io_option ( ►► <L> "," io_option ◄◄ )*`
pub type SynIIoOpt = SynIoOptions;
}
Remember that io_option items are separated by a comma, so they’re defined in the grammar like below, but Lexigram sees that the first io_option item outside the repetition is actually the same pattern as the repetition (without the comma), so it includes that first item in the repetition:
io_options: io_option (<L=i_io_opt> Comma io_option)*;
Since it’s a <L> repetition, we get the two methods below. The first one gets the first item, and the second gets the remaining items.
#![allow(unused)]
fn main() {
fn init_i_io_opt(&mut self, ctx: InitCtxIIoOpt, spans: Vec<PosSpan>) -> SynIIoOpt {
// value of `io_option` before `<L> "," io_option` iteration in
// `io_options -> io_option ( ►► <L> "," io_option ◄◄ )*`
let InitCtxIIoOpt::V1 { io_option } = ctx;
let mut acc = SynIIoOpt::new();
if let Err(e) = acc.fold(io_option) {
self.log.add_error(format!("at {}, {e}", spans[0]));
}
acc
}
fn exit_i_io_opt(&mut self, acc: &mut SynIIoOpt, ctx: CtxIIoOpt, spans: Vec<PosSpan>) {
// `<L> "," io_option` iteration in
// `io_options -> io_option ( ►► <L> "," io_option ◄◄ )*`
let CtxIIoOpt::V1 { io_option } = ctx;
if let Err(e) = acc.fold(io_option) {
self.log.add_error(format!("at {}, {e}", spans[2]));
}
}
}
Note
In
exit_i_io_opt(…),spans[0]spans all the loop items that have already been parsed up to this point,spans[1]is the comma, sospans[2]is where theio_optioninput is located.io_options: io_option (<L=i_io_opt> "," io_option)*; ^^^^^^^^^^^^ ^^^ ^^^^^^^^^ [0]: accumulated not in spans [1] [2]In
exit_io_options(…),spans[0], which corresponds to the firstio_optionnonterminal, contains the span of that first item plus all the items in the repetition.
In fact, we don’t really need to check the error in init_i_io_opt(…) because, at this point, only incorrect repetitions of some options are verified.
exit_io_options(…) has nothing else to do but to pass the value:
#![allow(unused)]
fn main() {
fn exit_io_options(&mut self, ctx: CtxIoOptions, spans: Vec<PosSpan>) -> SynIoOptions {
// io_options -> io_option (<L> "," io_option)*
let CtxIoOptions::V1 { star } = ctx;
star
}
}
lexer and parser
The exit_lexer(…) method takes the value with the folded options and sets the corresponding values in the final self.options field.
However, there is a little snag with the indent option. Normally,
- a default indentation for the
lexer,parser,types, andlistenercan be specified in the command-line binary (and in theOptionBuilder) by putting this option in front of everything else. - if the indentation is specified for any of those parts of code, it has more priority over the default indentation level.
Thus, lexigram --indent 4 -x lexicon.l -l lexer.rs --indent 0 doesn’t indent the lexer code.
But in our grammar, the global options are specified after lexer and parser, so we might erase their specific indentation by the global, default value. To avoid those conflicts, we’ll store the specific values in separate fields of the listener, then we’ll resolve conflicts after parsing everything:
#![allow(unused)]
fn main() {
pub(super) struct Listener<'ls> {
// ...
lexer_indent: Option<usize>,
parser_indent: Option<usize>,
}
}
The only other particularity in exit_lexer(…) is copying the specifications (the location of the lexicon/grammar) into the self.options.parser_spec field:
#![allow(unused)]
fn main() {
fn exit_lexer(&mut self, ctx: CtxLexer, spans: Vec<PosSpan>) -> SynLexer {
// lexer -> "lexer" "{" io_options "}"
let CtxLexer::V1 { io_options } = ctx;
if io_options.is_combined {
self.options.parser_spec = io_options.spec.clone();
}
self.options.lexer_spec = io_options.spec;
self.options.lexer_code = io_options.code;
self.options.lexer_headers.extend(io_options.headers);
self.lexer_indent = io_options.indent_opt;
SynLexer()
}
}
For the parser, there are more verifications to perform:
- if parser options were found (
V1variant)combinedshouldn’t be used.- If
combinedwas used in the lexer, there shouldn’t be a redundantinputspec in the parser options. - Otherwise,
inputmust be present.
- Finally, the parser options may be absent if one doesn’t want a parser. In this case,
combinedis illegal in theV2variant.
#![allow(unused)]
fn main() {
fn exit_parser(&mut self, ctx: CtxParser, spans: Vec<PosSpan>) -> SynParser {
match ctx {
// parser -> "parser" "{" io_options "}"
CtxParser::V1 { io_options } => {
if io_options.is_combined {
self.log.add_error(
format!("at {}, 'combined' can only be used in 'lexer' options", spans[2]));
} else {
if io_options.spec.is_none() && self.options.parser_spec.is_none() {
self.log.add_error(format!("at {}, undefined grammar location", spans[2]));
} else if !io_options.spec.is_none() && !self.options.parser_spec.is_none() {
self.log.add_error(
format!("at {}, redefined grammar location ('combined' in lexer)", spans[2]));
}
}
if !io_options.spec.is_none() {
self.options.parser_spec = io_options.spec;
}
self.options.parser_code = io_options.code;
self.options.parser_headers.extend(io_options.headers);
self.parser_indent = io_options.indent_opt;
}
// parser -> ε
CtxParser::V2 => {
if !self.options.parser_spec.is_none() {
self.log.add_error(
"combined lexicon/grammar, but missing parser code location".to_string());
}
}
}
SynParser()
}
}
global_options
Finally, we have the group of the global options, which is a mix of default/global options and wrapper code options.
options:
(Options Lbracket global_options Rbracket)?
;
global_options:
global_option (<L=i_global_opt> Comma global_option)*
;
global_option:
Headers Colon Lbracket value (Comma value)* Rbracket
| Indent Colon value
| Libs Colon Lbracket value (Comma value)* Rbracket
| NTValue Colon nt_value
| Spans Colon value
;
value: /* ... */
nt_value:
Default
| None
| Parents
| Set Lbracket Id (Comma Id)* Rbracket
;
The handling of these options is quite similar to the io_options of the lexer and parser groups.
Let’s first get nt-value out of the way. We’re using the same type that Options requires, since it’s already available and fits our purpose:
#![allow(unused)]
fn main() {
/// User-defined type for `nt_value`
pub type SynNtValue = NTValue;
}
Parsing it is straightforward, but since we don’t have a NTValue::Error if there’s something wrong with the set variant, we simply return an empty list of names. In case of parsing error, it allows the parser to continue parsing the input and possibly to detect further errors. It would ultimately report the errors instead of returning an Options, so it’s not important if some values aren’t accurate.
We could also use the listener abort feature, if not returning accurate data could risk the code to crash later.
#![allow(unused)]
fn main() {
fn exit_nt_value(&mut self, ctx: CtxNtValue, spans: Vec<PosSpan>) -> SynNtValue {
match ctx {
// nt_value -> "default"
CtxNtValue::V1 => NTValue::Default,
// nt_value -> "none"
CtxNtValue::V2 => NTValue::None,
// nt_value -> "parents"
CtxNtValue::V3 => NTValue::Parents,
// nt_value -> "set" "{" value ("," value)* "}"
CtxNtValue::V4 { star: SynNtValue1(values) } => {
match self.values_to_strings(values, &spans[2]) {
Ok(names) => NTValue::SetNames(names),
Err(()) => NTValue::SetNames(vec![]),
}
}
}
}
}
We then define the following type for global_option:
#![allow(unused)]
fn main() {
/// User-defined type for `global_option`
#[derive(Debug, PartialEq)]
pub enum SynGlobalOption {
Error,
Headers(Vec<String>),
Indent(usize),
Libs(Vec<String>),
NTValue(NTValue),
Spans(bool),
}
}
Lastly, global_options is the accumulator in which the global_option items are folded:
#![allow(unused)]
fn main() {
/// User-defined type for `global_options`
#[derive(Debug, PartialEq)]
pub struct SynGlobalOptions {
pub headers: Vec<String>,
pub indent_opt: Option<usize>,
pub libs: Vec<String>,
pub nt_value_opt: Option<NTValue>,
pub spans_opt: Option<bool>,
}
impl Default for SynGlobalOptions {
fn default() -> Self {
SynGlobalOptions {
headers: vec![],
indent_opt: None,
libs: vec![],
nt_value_opt: None,
spans_opt: None,
}
}
}
}
Its implementation is again very similar to that of SynIoOptions:
#![allow(unused)]
fn main() {
impl SynGlobalOptions {
pub fn new() -> Self {
Self::default()
}
pub fn fold(&mut self, global_option: SynGlobalOption) -> Result<(), &str> {
match global_option {
SynGlobalOption::Error => {}
SynGlobalOption::Headers(headers) => {
self.headers.extend(headers);
}
SynGlobalOption::Indent(indent) => {
if self.indent_opt.is_none() {
self.indent_opt = Some(indent);
} else {
return Err("more than one indentation");
}
}
SynGlobalOption::Libs(libs) => {
self.libs.extend(libs);
}
SynGlobalOption::NTValue(nt_value) => {
if self.nt_value_opt.is_none() {
self.nt_value_opt = Some(nt_value);
} else if let Some(current) = self.nt_value_opt.as_mut() {
match (current, nt_value) {
// we allow to grow names
(NTValue::SetNames(names), NTValue::SetNames(new)) => {
names.extend(new);
}
_ => return Err("conflicting nt-value options"),
}
}
}
SynGlobalOption::Spans(spans) => {
if self.spans_opt.is_none() {
self.spans_opt = Some(spans);
} else {
return Err("more than one `spans` option")
}
}
}
Ok(())
}
}
}
For nt-value, we allow several options in the same group if they’re all adding names with set:
options {
nt-value: set { "lexer", "parser" },
nt-value: set { "options" },
spans: true
}
The implementation of global_option is simpler than what we did for io_option, but still very similar (especially since headers and indent are there, too):
#![allow(unused)]
fn main() {
fn exit_global_option(&mut self, ctx: CtxGlobalOption, spans: Vec<PosSpan>) -> SynGlobalOption {
match ctx {
// global_option -> "headers" ":" "{" value ("," value)* "}"
CtxGlobalOption::V1 { star: SynGlobalOption1(values) } => {
match self.values_to_strings(values, &spans[3]) {
Ok(headers) => SynGlobalOption::Headers(headers),
Err(()) => SynGlobalOption::Error,
}
}
// global_option -> "indent" ":" value
CtxGlobalOption::V2 { value } => {
if let SynValue::Num(indent) = value {
SynGlobalOption::Indent(indent)
} else {
if value != SynValue::Error {
self.log.add_error(format!("at {}, expected number instead of {value:?}", spans[2]));
}
SynGlobalOption::Error
}
}
// global_option -> "libs" ":" "{" value ("," value)* "}"
CtxGlobalOption::V3 { star: SynGlobalOption2(values) } => {
match self.values_to_strings(values, &spans[3]) {
Ok(libs) => SynGlobalOption::Libs(libs),
Err(()) => SynGlobalOption::Error,
}
}
// global_option -> "nt-value" ":" nt_value
CtxGlobalOption::V4 { nt_value } => SynGlobalOption::NTValue(nt_value),
// global_option -> "spans" ":" value
CtxGlobalOption::V5 { value } => {
if let SynValue::Bool(flag) = value {
SynGlobalOption::Spans(flag)
} else {
self.log.add_error(format!("at {}, expected boolean instead of {value:?}", spans[2]));
SynGlobalOption::Error
}
}
}
}
}
The iteration in global_options is also an <L> repetition in a token-separated list. Lexigram provides the first item in init_i_global_opt(…) and the following ones in exit_i_global_opt. All we have to do is accumulate and report any error:
#![allow(unused)]
fn main() {
fn init_i_global_opt(&mut self, ctx: InitCtxIGlobalOpt, spans: Vec<PosSpan>) -> SynIGlobalOpt {
// value of `global_option` before `<L> "," global_option` iteration in `global_options -> global_option ( ►► <L> "," global_option ◄◄ )*`
let InitCtxIGlobalOpt::V1 { global_option } = ctx;
let mut acc = SynGlobalOptions::new();
acc.fold(global_option);
acc
}
fn exit_i_global_opt(&mut self, acc: &mut SynIGlobalOpt, ctx: CtxIGlobalOpt, spans: Vec<PosSpan>) {
// `<L> "," global_option` iteration in `global_options -> global_option ( ►► <L> "," global_option ◄◄ )*`
let CtxIGlobalOpt::V1 { global_option } = ctx;
if let Err(e) = acc.fold(global_option) {
self.log.add_error(format!("at {}, {e}", spans[2]));
}
}
}
There’s nothing to do in global_options but to return the value in the context:
#![allow(unused)]
fn main() {
fn exit_global_options(&mut self, ctx: CtxGlobalOptions, spans: Vec<PosSpan>) -> SynGlobalOptions {
// global_options -> global_option (<L> "," global_option)*
let CtxGlobalOptions::V1 { star } = ctx;
star
}
}
options
To complete the listener implementation, options copies the values accumulated in global_options.
#![allow(unused)]
fn main() {
fn exit_options(&mut self, ctx: CtxOptions, spans: Vec<PosSpan>) -> SynOptions {
match ctx {
// options -> "options" "{" global_options "}"
CtxOptions::V1 { global_options } => {
self.options.lexer_headers.extend(global_options.headers.clone());
self.options.parser_headers.extend(global_options.headers);
if let Some(indent) = global_options.indent_opt {
self.options.lexer_indent = indent;
self.options.parser_indent = indent;
}
self.options.libs.extend(global_options.libs);
if let Some(nt_value) = global_options.nt_value_opt {
self.options.nt_value = nt_value;
}
if let Some(spans_opt) = global_options.spans_opt {
self.options.gen_span_params = spans_opt;
}
}
// options -> ε
CtxOptions::V2 => {}
}
SynOptions()
}
}
The global indent value is copied to all the indent options. We overwrite those values with self.lexer_indent and self.parser_indent, if they’re defined, in the exit(…) method:
#![allow(unused)]
fn main() {
fn exit(&mut self, config: SynConfig, span: PosSpan) {
if let Some(indent) = self.lexer_indent { self.options.lexer_indent = indent };
if let Some(indent) = self.parser_indent { self.options.parser_indent = indent };
}
}
Annotating the Input
The current error messages show the position of the error. It’s nice, but seeing the erroneous input is more helpful.
For example, try parsing the following input:
def SOURCE_FILENAME = "../watcher/src/lib.rs";
lexer {
combined: "src/watcher.lg",
output: SOURCE_FILENAMES ["watcher_lexer"],
indent: 4
}
You’ll get this error in the log:
- ERROR : at 5:13-28, SOURCE_FILENAMES is not defined
Let’s modify the listener to annotate the error in the input text. There are a few helpers for that in lexigram_core::text_span, but the listener must implement the GetLine trait:
#![allow(unused)]
fn main() {
use lexi_gram::lexigram_lib::lexigram_core::text_span::GetLine;
// ...
impl<'ls> GetLine for Listener<'ls> {
fn get_line(&self, n: usize) -> &str {
self.lines.as_ref().unwrap()[n - 1]
}
}
}
Then, when errors are detected, self.extract_text(…) can be used to get the offending line(s) or self.annotate_text(…) to emphasize the erroneous parts in them.
#![allow(unused)]
fn main() {
use lexi_gram::lexigram_lib::lexigram_core::text_span::GetTextSpan;
// ...
impl ConfigListener for Listener<'_> {
fn exit_value(&mut self, ctx: CtxValue, spans: Vec<PosSpan>) -> SynValue {
match ctx {
// ...
// value -> Id
CtxValue::V4 { id } => {
if let Some(v) = self.consts.get(&id) {
v.clone()
} else {
//self.log.add_error(format!("at {}, {id} is not defined", spans[0]));
let text = self.annotate_text(&spans[0]);
self.log.add_error(format!("{id} is not defined:\n\n{text}\n"));
SynValue::Error
}
}
// ...
}
Now, the error looks like this (in the current crate version, the actual output changes the colour of SOURCE_FILENAMES, so you need an ANSI-friendly terminal):
- ERROR : SOURCE_FILENAMES is not defined:
6: output: SOURCE_FILENAMES ["watcher_lexer"],
^^^^^^^^^^^^^^^^
It’s not perfect: the parser and the lexer can only give the position when they detect an error, since they’re unaware of what the listener implements. In a future version, it will be possible to intercept their errors to customize the output, but it’s not implemented yet.
Wrapping It Up
The parser can now return the options instead of just the log. Maybe it’s better to put that in a single object:
#![allow(unused)]
fn main() {
pub struct ConfigResult {
pub options: Options,
pub log: BufLog,
}
impl<'l, 'ls: 'l> ConfigParser<'l, '_, 'ls> {
// ...
pub fn parse(&mut self, text: &'ls str) -> Result<ConfigResult, BufLog> {
self.wrapper = Some(Wrapper::new(Listener::new(), VERBOSE_WRAPPER));
let stream = CharReader::new(Cursor::new(text));
self.lexer.attach_stream(stream);
self.wrapper.as_mut().unwrap().get_listener_mut().attach_lines(text.lines().collect());
let tokens = self.lexer.tokens().keep_channel0();
let result = self.parser.parse_stream(self.wrapper.as_mut().unwrap(), tokens);
let Listener { mut log, options, .. } = self.wrapper.take().unwrap().give_listener();
if let Err(e) = result {
log.add_error(e.to_string());
}
if log.has_no_errors() {
Ok(ConfigResult { options, log })
} else {
Err(log)
}
}
}
}
and the tests, which should still be expanded to actually validate the result:
#![allow(unused)]
fn main() {
#[test]
fn test_run() {
let mut p = ConfigParser::new();
for (i, src) in [SRC1, SRC2].into_iter().enumerate() {
println!("source #{i}");
match p.parse(src) {
Ok(ConfigResult { options, log }) => println!("{options:#?}\n\nlog:{log}"),
Err(log) => panic!("error\n{log}"),
}
}
}
}
Possible Improvements
That’s a good first step. It’s not complete: there are other options that are not yet accessible from the configuration file parser, like the templates and some switches.
The templates of the user types and the listener behave like the lexer and parser options, so they could be additional groups with their own generated code location and indentation. Feel free to add those features to the existing grammar and implementation, using the templates.txt file to more easily spot the required changes.
Valueless Nonterminals
Some of the nonterminals don’t need a value: you can remove them with the --nt-value command-line option:
lexigram --indent 4\
-x src/config.l -l "src/parser.rs" tag config_lexer\
-g src/config.g -p "src/parser.rs" tag config_parser\
--types templates.txt tag template_user_types\
--listener templates.txt tag template_listener_impl\
--lib "super::listener_types::*" --spans\
--nt-value set "<default>,-config,-definitions,-i_def,-lexer,-parser,-options"
or with the .set_nt_value(…) method of the OptionsBuilder:
#![allow(unused)]
fn main() {
// ... (other constants)
static NT_NAMES: [&str; 7] = [
"<default>",
"-config",
"-definitions",
"-i_def",
"-lexer",
"-parser",
"-options",
];
fn gen_source_config_l_g(action: Action) {
let options = OptionsBuilder::new()
.headers(["use lexi_gram::lexigram_lib::lexigram_core;"])
// ...
.span_params(true)
.set_nt_value(NTValue::SetNames(NT_NAMES.into_iter().map(|s| s.to_string()).to_vec()))
.build()
.expect("should have no error");
// ...
}
With the help of the template, removing the Syn* types and updating the relevant exit_*(…) methods should be easy.
Trailing Comma in Lists
The options in the lexer, parser, and options group are separated by a comma, and so are some list elements like headers, libs, and nt-value set.
However, the grammar doesn’t allow for a trailing comma at the end of the list:
options {
nt-value: set {
"lexer",
"parser",
"options", // <-- trailing comma
},
spans: true
}
If you modify the grammar to allow it, Lexigram warns you that the grammar is ambiguous. For example, if you modify the following rule:
nt_value:
Default
| None
| Parents
| Set Lbracket value (Comma value)* comma_opt Rbracket
;
comma_opt:
Comma
|
;
you get this message, that we split to improve the clarity (it’s indeed a little obscure!):
Warning: - calc_table: ambiguity for NT 'nt_value_1', T ',':
<"," value nt_value_1> or <ε>
=> <"," value nt_value_1> has been chosen
To understand what it means, you must inspect the table with the rule alternatives, at the end of the log:
32: nt_value -> "default"
33: nt_value -> "none"
34: nt_value -> "parents"
35: nt_value -> "set" "{" value nt_value_1 comma_opt "}"
44: . nt_value_1 -> "," value nt_value_1 <- starts with ","
45: . nt_value_1 -> ε
36: comma_opt -> "," <- starts with ","
37: comma_opt -> ε
The repetition of comma-separated items is handled by nt_value_1. The new optional comma after value_1 in production 44 is problematic because the parser cannot tell whether it’s the comma in production 44 or the trailing comma in production 35.
- In the first case, annotated
<"," value nt_value_1, the iteration continues with production 44. - In the second case, annotated
<ε>, the iteration ends with production 45, which is then followed by production 36.
If the parser could predict the next production by looking two tokens ahead instead of one, so if it was an LL(2), it could check whether the 2nd token is "}" or one of the tokens that starts value ("false", "true", Id, …).
Here, the parser is LL(1), so it removes the ambiguity by assuming it’s always production 44: it takes a new iteration. If your list has a trailing comma, it will generate a syntax error when the parser gets a "}" instead of a value-friendly token.
Is it possible to have trailing commas in a LL(1) grammar?
Yes, but you must reword the rules. Let’s consider a simpler, generic rule:
a -> "{" Id ( "," Id )* ","? "}";
which is transformed into:
a -> "{" Id a_1 a_2 "}";
a_1 -> "," Id a_1 | ε;
a_2 -> "," | ε;
It can be transformed into:
a -> "{" Id a_1 "}";
a_1 -> "," a_2 | ε;
a_2 -> Id a_1 | ε;
Lexigram doesn’t do that transformation automatically for you. You could write those rules, but you’d have to process the list on your own, which would be less convenient than the vector we receive for free in our current grammar.
An alternative is to end all the options in a group by a semicolon, and not to allow a trailing comma in lists like libs.
That’s quite typical of the occasional compromise or transformation you must do when you’re working with an LL(1) parser.
What Next?
Hopefully, this tutorial has provided a good overview of the Lexigram tool and how to use it to build a parser.
The reference documentation provides a complete overview of the features:
Don’t miss the crate documentation, which briefly presents the different crates.
To see practical examples, you can look at the GitHub repository. You’ll find an example directory with a few illustrations of common and more advanced features. Each example is divided into two parts: a gen_<name> crate with the grammar and the parser generator, and the <name> crate with the parser implementation. At the time of writing, the following examples are available:
microcalc/ impl: a simple, bare-bones parser.config/ impl: a configuration parser on which this tutorial is based.terminate/ impl: a parser that usesTerminate::Concludeto discard the rest of the input.watcher/ impl: a parser that shows the latency between reading the input and calling the listener methods (check the Cargo.toml file to reduce the latency further).pandemonium/ impl: a test that includes all the grammar transformations. It’s probably less instructive.
Another example is a parser for the simplified lexicon/grammar (a -> "let" Id "=" expr ";";), and which is used for a series of unit tests. Some parts may be harder to understand because it uses the full library to generate a parser.
There are of course the lexers and parsers of Lexi and Gram, but they’re even less straightforward to follow because they’re built in two stages and also use the full library to build the lexer and the parser.
Lexigram crates
The Lexigram project is split into several crates. The following sections give a quick summary of what you can find in them.
The dependency tree is as follows:
lexigram: the command-line lexer & parser generatorlexi-gram: the high-level methods and objects to generate a lexer & parserlexigram-lib: the low-level code that generates the lexer & parserlexigram-core: the minimal library required by the generated parser
lexigram
This crate contains the command-line executable lexigram, a lexer & parser generator.
Link to the crate documentation.
You can install it with
cargo install lexigram
Lexigram is able to read the lexicon and the grammar either separately or from the same location. The location can be either a file or the part of a file found between a pair of custom tags.
The options for the location of the language specifications are as follows:
-x|--lexicon: location of the lexicon-x calc.lreads from the file ./calc.l-x calc.lg tag lexiconreads from the file ./calc.lg the lexicon located between a pair of tags[lexicon]
-g|--grammar: location of the grammar (same principle as above)-c|--combined: location of the combined lexicon/grammar (same principle as above)
The tags can be in a comment or in a string; Lexigram simply looks if it’s included in two lines of the file. If it can’t find both tags, Lexigram stops before going any further; e.g. it won’t start writing to the file, possibly erasing important code
Warning
Always leave a blank line after the first tag and before the second when you write the lexicon and the grammar, because those lines are skipped.
You can find example of lexicons and grammars, either in their own file or between tags, in the repository. Look for *.l, *.g, and *.lg files in the gen_* crates; the extension doesn’t really matter for Lexigram, but we stuck to this convention for the sake of consistency.
The same system as above applies to the location where the generated code should be written, although you can also output it to stdout. Using the tags allow you to have your grammar in the same file as the generated code, if you wanted something minimal, or in one of the crate’s Rust files. The generated lexer and parser can even share the same module, although it’s tidier to keep them apart.
-l|--lexer: location of the generated lexer-l src/calc_lex.rsoverwrites the file ./src/calc_lex.rs-l src/calc.rs tag lexerwrites to the file ./src/calc.rs between the tags[lexer](overwriting anything previously written between those tags)-l -: writes the generated lexer to stdout
-p|--parser: location of the generated parser and the wrapper for the user listener (same principle as above)
Alongside the code of the lexer and parser, Lexigram can optionally generate a code template for the user types and the listener implementation. The location follows the same syntax as the options --lexer and --parser:
--types: location of the user types template--listener: location of the listener implementation
There are a few global options that give you control over the generated code. Mainly:
--header: extra headers to be put in front of the generated code--indent: the default indentation--nt-value: list or category of nonterminals which hold a value--lib: extra library required by the listener; typically the location where the types of the nonterminals are defined, when they hold a value. Can be used multiple times to add several libraries--log: detailed log, which contains useful information like the list of terminals and nonterminals, or code that can be copied/pasted the first time, like the nonterminal types used in the generated code and a skeleton implementation of the listener-v|--verify: verification that the code previously generated is still the same, which can be used in validation tests
The complete list of options can be obtained with
lexigram -h
Warning
It’s worth noting that
- The options
--lexiconand--lexermust be given before--grammarand--parser.- The template options
--typesand--listenerare best given after--parser- If the lexicon and the grammar are combined,
--combinedmust be given first, then--lexer, and finally--parser- The options
--headerand--indentapply to the previously mentioned part: they apply to the lexer after--lexerand to the parser after--parser. If you want the same headers or indentation for both, you can put those options in front.In short, follow this order:
--lexer,--parser,--types,--listener, then other global options.--indentin front as a default value--headerin front if it must apply to both lexer and parserExample:
lexigram -c calc.lg --indent 4 -l lexer.rs --header "#![cfg(test)]" \ -p parser parser.rs --header "#![allow(unused)"
- indents both the lexer and the parser by 4 spaces
- puts
#![cfg(test)]in front of the lexer code- puts
#![allow(unused)in front of the parser code
Examples
This command reads the combined lexicon/grammar from ./grammar/calc.lg and writes the generated code to ./src/lexer.rs and ./src/parser.rs, respectively. The log is output to stderr.
lexigram -c grammar/calc.lg -l src/lexer.rs -p src/parser.rs --log
Adding -v verifies that the generated code is still the same (once it’s been written, of course).
lexigram-core
This crate contains the minimal library required by the generated lexers and parsers.
Link to the crate documentation.
Some public items may be of interest; among others:
logis the log object used by the parser, which you may also use in your listener so that all the notes, warnings, and errors are in the same place.text_spandefines two traits which can be used in the listener implementation to extract the text corresponding to a terminal or a nonterminal and show it in context of the parsed input. This could be used to indicate where an error has occurred, for instance.char_readeris the object that sends characters to the lexer; the module has a few other methods related to UTF-8 .lexerincludes the lexer and associated types, includingPosSpan, the type of an optional parameter of the listener methods (token position in the text).parserincludes the parser.
The lexigram-core crate has an optional feature, delay_stream_interception, which is disabled by default. That feature delays the capture of the next tokens coming from the lexer, in order to reduce the latency between the capture and the calls to the listener even further. It can be used in some situations:
- When the parser must ignore the remaining part of an input stream, if that remaining part contains text the lexer cannot scan. By delaying the capture, it prevents the lexer from issuing a lexical error before the listener can stop the parser. An example is given in watcher, where you can modify the Cargo.toml to change that optional feature and see the impact.
- When the listener intercepts the tokens with
intercept_token(…)to change them, if that change depends on anotherexit_*(…)listener method that may not be called before the token is captured by the parser.
Note that, in both cases, it is usually possible to work around those latency problems by introducing an extra nonterminal whose exit_(…) method will be called earlier. This is illustrated in the watcher (impl) example: the shutdown nonterminal was added to the grammar for that purpose.
lexi-gram
This crate contains Lexi and Gram, the parsers used to read the lexicon and the grammar. We do indeed require a parser to read those definitions, which led to some interesting recursion during the development of Lexigram, since those parsers generate themselves.
Link to the crate documentation
It also provides methods and objects to easily generate the parser source programmatically. This is a viable alternative to the command-line executable if you need to generate the code repeatedly or from a list of lexicons/grammars. The main elements are:
- gen_parser, a module with methods to generate and to get the source of a parser in a string, or to write it to a file.
- options, a module with an option builder used with the gen_parser.
- A couple of macros that can be used with the options.
You will find examples in the repository; look at the gen_* crates to see how gen_parser is used to generate the code in the other crates.
To illustrate one case, the parser used in the microcalc crate is generated with the following code when action = Action::Generate (Action::Verify is used to verify the generated code, similarly to the -v option of the executable).
#![allow(unused)]
fn main() {
static LEXICON_GRAMMAR_FILENAME: &str = "src/microcalc.lg";
static SOURCE_FILENAME: &str = "../microcalc/src/main.rs";
static LEXER_TAG: &str = "microcalc_lexer";
static PARSER_TAG: &str = "microcalc_parser";
const LEXER_INDENT: usize = 4;
const PARSER_INDENT: usize = 4;
fn gen_source_microcalc_lg(action: Action) {
let options = OptionsBuilder::new()
.combined_spec(genspec!(filename: LEXICON_GRAMMAR_FILENAME))
.lexer_code(gencode!(filename: SOURCE_FILENAME, tag: LEXER_TAG))
.indent(LEXER_INDENT)
.parser_code(gencode!(filename: SOURCE_FILENAME, tag: PARSER_TAG))
.indent(PARSER_INDENT)
.libs(["super::listener_types::*"])
.build()
.expect("should have no error");
match try_gen_parser(action, options) {
Ok(log) => {
if action == Action::Generate {
println!("Code generated in {SOURCE_FILENAME}\n{log}");
}
assert!(log.has_no_warnings(), "no warning expected");
}
Err(build_error) => panic!("{build_error}"),
}
}
}
This would correspond to the command-line
lexigram -c src/microcalc.lg \
-l ../microcalc/src/main.rs tag microcalc_lexer --indent 4 \
-p ../microcalc/src/main.rs tag microcalc_parser --indent 4 \
--lib "super::listener_types::*"
lexigram-lib
This crate contains the code that generates a lexer and a parser, except for the high-level API provided in lexi-gram.
Link to the crate documentation
lexigram-lib is normally not meant to be used for other purposes than a dependency of lexigram and lexi-gram, unless you want to create the lexicon and the grammar yourself as objects; for example, if you want to generate them programmatically.
Lexer Generator Reference
Lexi is the part of Lexigram that generates a lexer.
The lexer, or scanner, is the very front-end of a parser. It reads the input text, recognizes symbols, and outputs a stream of tokens corresponding to those symbols. The lexer isn’t aware of the syntax of the input language, only of its lexical rules. The stream of tokens is generally read by a parser, but it can also be used directly in a user application.
The lexical rules are defined in a lexicon, in a human-readable language described in this reference. The code of the lexer proper already exists in the lexigram-core crate; its main object is Lexer. Lexi takes a lexicon and generates the data required by Lexer to scan the language.
The language used to describe a lexicon is very similar to that of the ANTLR tool, though there are a few differences.
Notation
We’re using the following conventions in the lexicon:
- Terminals (or tokens, which means the same in this context) are written in upper camel case:
Assign,Id,StringLiteral. - Fragments are written in upper snake case:
DIGIT,HEX_DIGIT. - Modes are written in upper snake case and end with
_MODE:DEFAULT_MODE,TAG_MODE. - Channels are written in upper snake case and end with
_CHANNEL:COMMENTS_CHANNEL.
Those conventions are not enforced by Lexi.
Main Concepts
The lexicon defines the tokens emitted by the lexer, using simple regular expressions:
Assign : '=';
Add : '+';
If : 'if';
While : 'while';
Id : [_a-z][_a-z0-9]+;
StringLiteral : '"' (~('"'|'\n'|'\r') | '""')* '"';
From the lexicon, Lexi generates the source code of the lexer:
- The symbol and the finite state machine tables, other supporting data.
- A
build_lexer()method that creates aLexerfrom those data.
Lexicon Structure
A lexicon can be made up of several elements.
It begins with an optional header:
lexicon LexiLexer;
After the optional header, the lexicon is a list of items, each of which is an option, a rule, or a declaration.
The only option currently implemented is a list of additional channels:
channels { COMMENTS_CHANNEL, ATTRIBUTE_CHANNEL }
A rule is the definition of a token, which is what the lexer outputs, or a fragment, which is akin to an alias or a constant:
fragment DIGIT : [0-9];
Number : DIGIT+;
The only declaration currently implemented is the mode section, in which rules are active only when that mode is the current one. The starting mode, DEFAULT_MODE, is implicitly declared at the top, and thus the default mode of the lexer.
mode COMMENT_MODE;
Matching rule
Among the rules of the current mode, the one that matches the longest input is selected to issue a token. If you need the keywords "down", "downto", and "to" in your language, you can safely define the three tokens in whichever order you prefer. The resulting lexer will be the same, and it won’t cut the input "downto" in two tokens "down" and "to".
Down: 'down';
DownTo: 'downto';
To: 'to';
When there is a conflict between two rules, the first token defined in the lexicon takes precedence. This typically occurs with variable tokens:
Down: 'down';
DownTo: 'downto';
To: 'to';
Id: [_a-z][_a-z0-9]*;
"to"is scanned asTo. There is a conflict betweenToandId("to"), soTois selected because it’s defined first."total"is scanned asId("total"). There is no conflict after reading the first two characters"to"because*and+are greedy by default, soIdcaptures the whole word"total". The lexer chooses the token matching the longer input, which isId("total").
Were Id defined first, none of the other tokens would ever be selected. However, Lexigram would see that some tokens are never output, so it would produce an error before generating the lexer, suggesting to re-order the definitions.
With a non-greedy repetition, we would get a different result for "total": Once the first two characters "to" are read by the lexer, To and Id("to") are in conflict this time:
To: 'to';
Id: [_a-z][_a-z0-9]*?;
"to"is scanned asTo. Like above, there is a conflict betweenToandId, which is resolved by taking the first definition."total"is scanned asTo,Id("tal"). Once the first two characters"to"are read, there is a conflict because of the non-greedy*?.Tois selected because it’s defined first, which leaves the rest of the input"tal". That remaining input is scanned asId("tal")
In conclusion, the usual way to design a lexicon is to put the fixed tokens first and the variable tokens last, unless there is no risk of conflict.
Fragments
The fragment keyword defines a fragment of lexemes that may be used several times, like an alias or a constant. It can also be used to split a rule into more explicit components.
fragment DIGIT : [0-9];
fragment HEX_DIGIT : ( DIGIT | [A-Fa-f] );
fragment DEC_INTEGER : DIGIT ( '_' | DIGIT )* DIGIT | DIGIT;
fragment HEX_INTEGER : HEX_DIGIT ('_' | HEX_DIGIT)* HEX_DIGIT | HEX_DIGIT;
Integer : DEC_INTEGER | '0x' HEX_INTEGER;
Fragments are globally accessible, so they can be used in any mode section.
Regular Expressions
The regular expressions used in each rule or fragment can include the following elements:
'a','script': string literals that must be matched entirely (single quotes are used for both characters and strings)\n,\r,\t,\', and\\: common escape codes for characters inside string literals\u{0}-\u{d7ff},\u{e000}-\u{10ffff}: general Unicode escape codes
- character sets:
- inside square brackets:
[abc],[a-z0-9_],[+\-*/]: explicit character set; ‘-’ is used for inclusive ranges inside square brackets\n,\r,\t,\\,\],\-: escape codes for characters inside square brackets (both[and\[are accepted)
- outside square brackets:
'a'..'z': inclusive range of characters.: represents any character
- valid both inside and outside square brackets:
\w: represents a single word character:[_0-9A-Za-z]\d: represents a decimal digit:[0-9]
- inside square brackets:
(and)around item(s): groups the content to avoid conflicts of precedence with other operators~: negates the character set on the right:~[ \t\n\r]: any non-space character~( 'a' | 'b' | 'c' ): any character but ‘a’, ‘b’, or ‘c’, equivalent to~'a'..'c'
*: zero or more instances of the item on the left:[a-z]*: zero or several lettersab*: ‘a’ followed by zero or more ‘b’ characters('--'|'==')*: zero or more pairs of ‘-’ or ‘=’
+: one or more instances of the item on the left (same principle as*)?: zero or one instance of the item on the left (same principle as*)- after
*or+: lazy repetition; for example,'/*' .*? '*/'(avoids the ending"*/"characters to be captured by.*)
- after
|: alternative between several items; for example,'if' | 'while' | [a-z]+- fragment; for example,
( DIGIT | [a-f] )+whereDIGIThas been defined as a fragment
Channels
The lexer can output tokens on different channels, if parts of the input must be processed differently. For example, it can be used to isolate a sublanguage that is better parsed with another parser, or it can extract comments that must be put back in the output of a transpiler but would be too complex to handle in the grammar.
Tokens are normally output on the predefined DEFAULT_CHANNEL (channel #0). If other channels are required, they must be declared with the channels option, and the channel(<channel name>) action must then be used in rules to output the corresponding tokens to that channel.
channels { COMMENTS_CHANNEL };
Comma : ',';
Keyword : [a-zA-Z]+;
CommentLine : '//' ~[\r\n]* -> channel(COMMENTS_CHANNEL);
WhiteSpaces : [ \t\r\n]* -> skip;
CommaandKeywordtokens are output on channel #0.CommentLinetokens are output on channel #1 (COMMENTS_CHANNEL).
The lexer provides two methods to output the tokens:
get_token(&mut self) -> Result<Option<LexerToken>, LexerError>tokens(&mut self) -> LexInterpretIter, which returns an iterator ofLexerTokenitems.
Both methods provide LexerToken items, defined as the following tuple, where the channel is the 2nd element:
#![allow(unused)]
fn main() {
pub type LexerToken = (TokenId, ChannelId, String, PosSpan);
}
The TokenSpliterator trait defines a few convenient methods which can be used with tokens() to filter the tokens. In particular, keep_channel0() discards all the channels but the default one, and its output can be directly connected to a parser generated by Lexigram. An example can be found in the tutorial.
Modes
The lexicon can be divided into mode sections where the rules are independent of the other modes. It is very useful to scan a sublanguage.
For example, Lexi uses square brackets to define a character set: [a-z]. The lexicon used to generate Lexi handles the content of square brackets in a separate mode to simplify the definition and to avoid any conflict with the other rules:
fragment ESC_SET_CHAR : '\\' [nrt\\[\]\-]; // simplified for this example
// ...
Minus : '-';
FixedSet : ('\\w' | '\\d');
LSbracket : '[' -> push(SET_CHAR_MODE);
mode SET_CHAR_MODE;
RSbracket : ']' -> pop;
Minus2 : '-' -> type(Minus);
SetChar : (ESC_SET_CHAR | ~[\n\r\t\\\]\-]);
FixedSet2 : ('\\w' | '\\d') -> type(FixedSet);
When the next input character is "[", the LSbracket token is emitted and the lexer switches to SET_CHAR_MODE. In that mode:
\w,\d, and-emit the same tokens as they would in the default mode- unescaped
"]"emitsRSbracketand returns to the default mode - other valid characters and escape sequences emit
SetChar.
If this had to be written as a single rule, it would require a complex repetition of both escaped and unescaped codes—of different lengths—while avoiding an accidental capture of the ending "]". It would also emit a single token for the whole expression, so the parser would have to interpret it on its side.
With the use of this separate mode, the definitions are simple, and the separate components are sent to the parser, simplifying its job considerably. Here is the relevant part of Lexi’s grammar:
char_set:
LSbracket (char_set_one)+ RSbracket
// ...
;
char_set_one:
SetChar Minus SetChar // adds the range to the current set
| SetChar // adds the character to the current set
| FixedSet // adds the set to the current set
;
Actions
When a rule is matched, the default action of the lexer is to emit the token. Other actions can be specified by placing them on the right side of a -> symbol. Several actions are separated by a comma:
TagOpen: '<' -> mode(TAG_MODE), more;
List of actions:
channel(<channel>)emits the token on a specific channel (which must be defined in achannelsrule).DEFAULT_CHANNELis the name of channel 0, which is the default output channel for all tokens:channels { COMMENTS_CHANNEL, WHITESPACES_CHANNEL }; CommentLine : '//' ~[\r\n]* -> channel(COMMENTS_CHANNEL);skipdiscards the match:WhiteSpaces : [ \t\r\n]* -> skip;mode(<mode>)changes the mode.TagOpen: '<' -> mode(TAG_MODE), more;push(<mode>)changes the mode after pushing the current mode on the stack:ScriptOpen : '<script' -> push(SCRIPT_MODE); Id : [A-Za-z][A-Za-z0-9]+; mode SCRIPT_MODE; ScriptClose : '>' -> pop; ScriptInclude : 'include'; ScriptExecute : 'execute'; ScriptId : [a-z][a-z0-9]+;poppops a mode from the stack and changes to that mode.morecontinues evaluating the rules but keeps all the characters that have been matched so far for the next token. Below, theTagtoken is associated with the<and>delimiters and all the characters in-between:TagOpen : '<' -> mode(TAG_MODE), more; mode TAG_MODE: TagMore : [a-zA-Z][a-zA-Z0-9]* -> more; Tag : '>' -> mode(DEFAULT_MODE);type(<token>)emits another defined token instead of one with the rule’s name. This action can be used if the same token must be emitted from different modes. Below, bothMinusinDEFAULT_MODEandMinus2inSET_CHAR_MODEemit aMinustoken for the"-"input:fragment ESC_SET_CHAR : '\\' [nrt\\[\]\-]; Minus : '-'; LSbracket : '[' -> push(SET_CHAR_MODE); mode SET_CHAR_MODE; RSbracket : ']' -> pop; Minus2 : '-' -> type(Minus); SetChar : (ESC_SET_CHAR | ~[\n\r\t\\\]\-]); FixedSet2 : ('\\w' | '\\d') -> type(FixedSet);hookmarks a token as interceptable. Whenever there’s a possibility that this token is parsed next, the parser calls a specific listener methodhook(…), and the user implementation can change the token before the parsing continues. See also Token Interception.Id : [a-zA-Z][a-zA-Z_0-9]*; (Type) -> hook;
If skip or more is in the list of actions, no token is emitted. Otherwise, the rule emits a token corresponding to the rule name or, if specified, to the type(<token>) action. In the example below, the String and StringContent rules don’t emit any token; they accumulate the input characters. The StringClose rule emits the final String token with all the accumulated characters, including the surrounding double quotes.
String : '"' -> push(STRING_MODE), more;
mode STRING_MODE;
StringClose : '"' -> pop, type(String);
StringContent : [ -z]* -> more;
Fixed / Variable Tokens
Tokens can correspond to a fixed string literal, like "-", or to an expression that can match different strings, like [a-z]+ or "a" | "A". In the documentation, we call them fixed and variable tokens (or terminals), respectively. The distinction is important if you get the symbol table from Lexi, or when implementing the listener generated by Lexigram.
Fixed tokens have an associated (constant) string literal in the symbol table, and they’re not represented in the context argument of the exit_*(…) listener methods. Variable tokens don’t have an associated string in the symbol table, but they’re represented in the context arguments with the captured input, which is generally useful for the parser: an identifier’s name, an integer literal, and so on.
For example, with the following lexicon / grammar:
lexicon Example;
Equal : '=';
Id : [a-z]+;
Num : [0-9]+;
grammar Example;
assign:
Id Equal Num
;
The generated listener has the following method, where Equal is not present in the context because it’s fixed:
#![allow(unused)]
fn main() {
fn exit_assign(&mut self, ctx: CtxAssign) -> SynAssign {
// assign -> Id "=" Num
let CtxAssign::V1 { id, num } = ctx; // id, num: String
SynAssign()
}
}
Note
If the same token is emitted by several rules, the corresponding token is fixed only if all the literals are equal:
Interval : '..'; LSbracket : '[' -> push(SET_CHAR_MODE); mode SET_CHAR_MODE; RSbracket : ']' -> pop; Interval2 : '-' -> type(Interval); SetChar : [a-z];Since
Intervalcan be either".."or"-", it is a variable token.Only emitted tokens are considered. In this somewhat contrived example,
A0 : 'a' -> type(A); A : 'A' -> skip;the generated lexer is such that the input
"a"emits the token 0 =(A, Some("a")), not(A, None), so it won’t be in the context of the listener methods."A"is valid but emits no token.
Advanced Concepts
Token Indices
Each token has an index, which is the index in the symbol table and the value used internally by the lexer and the parser. Normally, it’s transparent to the Lexigram user, who doesn’t need to deal with those values, except when the tokens must be intercepted. In that case, an option of Lexigram generates an optional type that represents the tokens in an enum: --token-enums (or the option builder method token_enums(true), if the lexer is generated from Rust code). See Token Interception for more details.
As a general rule, the index corresponds to the position of the rule in the lexicon. There are a few nuances:
- Rules that don’t produce a token don’t count in the indexing, in order to avoid gaps in the indices.
- If a rule has a
type(Other)action, its definition is used for matching the input, but the output token isOtherand its index is the index given to theOtherrule. Note that if theOtherrule itself doesn’t produce anything, it’s now considered as “productive” because its reference in another productive rule.
produces a lexer such that the inputZero: '0'; NotAToken: 'x' -> type(Other); One: '1'; Other: 'y' -> skip;"0"emits token 0 =(Zero, Some("0"))"1"emits token 1 =(One, Some("1"))"x"emits token 2 =(Other, Some("x"))"y"is valid but doesn’t emit any token
Token Interception
If a rule has the name of the token between parentheses and no definition, it’s a reserved token. Lexi reserves the token index but doesn’t create any lexer rule that outputs the token, unless another rule specifies the token with the type action.
Reserved tokens are mainly used when the tokens are intercepted dynamically by the listener, where they can be changed to another token.
In this example of lexicon, Id takes all the identifiers, even those that are possibly types. Type is reserved, and it’s used in the hook(…) method when the identifier is indeed a language type or a user-defined type.
Id : [a-zA-Z][a-zA-Z_0-9]*;
(Type) -> hook;
Its accompanying grammar has, among other rules, decl:
program:
(<L=decl_i> decl)* (<L=inst_i> inst)+
;
decl:
Type Id (<L=id_i> Comma Id)* SemiColon
| Typedef Type Id SemiColon
;
inst:
Let Id Eq expr SemiColon
| Print expr SemiColon
;
// ...
The two alternatives include Type, though at a different position.
- When time
declis the next nonterminal to derive, or when there’s a possibilitydeclis predicted,hook(…)is called to determine the actual token, to see if the first production should be chosen. - Also, if the 2nd alternative has been predicted,
hook(…)is called afterTypedefhas been successfully matched. At this point, if the method returns anything else than the tokenType, the parser issues a syntax error.
#![allow(unused)]
fn main() {
impl TypedefListener for TypeListener<'_> {
// ...
fn hook(&mut self, token: TokenId, text: &str, span: &PosSpan) -> TokenId {
match text {
"int" | "float" | "double" => Term::Type as u16,
t => {
if self.types.contains_key(t) {
Term::Type as u16
} else {
token
}
}
}
}
}
}
(Term is an enum generated with the --token-enums option)
Warning
Since the method is called each time the hooked token may occur at any point in the grammar, it will potentially encounter all the other tokens that could also occur at that point instead.
In the grammatical definition of
declabove, theTypein the first alternative is “competing” with theTypedefof the second alternative. Therefore, the method could getTerm::TypeorTerm::Typedef. That’s not all:instis competing withdeclas next predicted nonterminal inprogram, so ``LetandPrintare two other arguments thathook(…)` receive.
The listener offers another way to intercept all the tokens: the parser calls the listener’s intercept_token(…) method each time it receives a token from the lexer. Its default implementation is empty, so it’s optimized out by the compiler. Implementing it to intercept tokens may have a bigger performance impact than hook(…), and it requires to take all the incoming tokens into account.
See also the gen_typedef and typedef examples, which illustrate the two interception methods.
Formal Grammar Definition
Here are the lexicon and the grammar of Lexi, using the Lexigram syntax. Those files can be found in the repository, in the build-stage1 directory.
Lexi’s Lexicon
lexicon LexiLexer;
fragment BLOCK_COMMENT : '/*' .*? '*/';
fragment LINE_COMMENT : '//' ~[\r\n]*;
fragment HEX_DIGIT : [0-9a-fA-F];
fragment UNICODE_ESC : 'u{' HEX_DIGIT+ '}';
fragment ESC_CHAR : '\\' ([nrt'\\] | UNICODE_ESC);
fragment CHAR : ESC_CHAR | ~[\n\r\t'\\];
fragment CHAR_LITERAL : '\'' CHAR '\'';
fragment STR_LITERAL : '\'' CHAR CHAR+ '\'';
// Char inside a '[' ']' set
fragment ESC_SET_CHAR : '\\' ([nrt\\[\]\-] | UNICODE_ESC);
Arrow : '->'; /* // first token // */
Colon : ':';
Comma : ',';
Dot : '.';
Ellipsis : '..';
Lbracket : '{';
Lparen : '(';
Negate : '~';
Minus : '-';
Plus : '+';
Or : '|';
Question : '?';
Rbracket : '}';
Rparen : ')';
Semicolon : ';';
Star : '*';
Channels : 'channels';
Fragment : 'fragment';
Lexicon : 'lexicon';
Mode : 'mode';
Pop : 'pop';
Push : 'push';
More : 'more';
Skip : 'skip';
Type : 'type';
Channel : 'channel';
Hook : 'hook';
SComment : BLOCK_COMMENT -> skip;
SLineComment : LINE_COMMENT -> skip;
SWhiteSpace : [ \n\r\t]+ -> skip;
Id : [a-zA-Z][a-zA-Z_0-9]*;
CharLit : CHAR_LITERAL;
StrLit : STR_LITERAL;
FixedSet : ('\\w' | '\\d');
LSbracket : '[' -> push(SET_CHAR_MODE);
mode SET_CHAR_MODE;
RSbracket : ']' -> pop;
Minus2 : '-' -> type(Minus);
SetChar : (ESC_SET_CHAR | ~[\n\r\t\\\]\-]);
FixedSet2 : ('\\w' | '\\d') -> type(FixedSet);
Lexi’s Grammar
grammar LexiParser;
file: header? file_item*;
file_item:
option | declaration | rule
;
header:
Lexicon Id Semicolon
;
declaration:
Mode Id Semicolon
;
option:
Channels Lbracket Id (Comma Id)* Rbracket
;
rule:
rule_fragment_name Colon match Semicolon
| rule_terminal_name Colon match (Arrow actions)? Semicolon
// only reserves the token, but doesn't add a rule to scan it:
| Lparen rule_terminal_name Rparen opt_str_lit (Arrow Hook)? Semicolon
// "grammar" Id ";" => start of grammar section
| rule_terminal_name Id Semicolon
;
opt_str_lit:
(Colon StrLit)?
;
rule_fragment_name:
Fragment Id
;
rule_terminal_name:
Id
;
actions:
action (Comma action)*
;
action:
Mode Lparen Id Rparen
| Push Lparen Id Rparen
| Pop
| Skip
| More
| Type Lparen Id Rparen
| Channel Lparen Id Rparen
| Hook
;
match:
alt_items
;
alt_items:
alt_item (Or alt_item)*
;
alt_item:
repeat_item+
;
repeat_item:
item Star Question?
| item Plus Question?
| item Question?
;
item:
Id
| CharLit (Ellipsis CharLit)?
| StrLit
| char_set
| Lparen alt_items Rparen
| Negate item
;
char_set:
LSbracket (char_set_one)+ RSbracket
| Dot
| FixedSet
;
char_set_one:
SetChar Minus SetChar
| SetChar
| FixedSet
;
Parser Generator Reference
Gram is the part of Lexigram that generates a parser.
The parser consumes an input stream of tokens issued by the lexer and verifies that it can be generated by the grammar. During that process, the derivations let the parser call grammar-related trait methods, which allows the user to retrieve the correspondence between grammar elements (nonterminals and terminals) and the input text.
The grammar production rules are defined in a human-readable language described in this reference. Similarly to the lexer, the code of the “low-level” parser already exists in the lexigram-core crate; its main object is Parser. Gram takes a grammar and generates the data required by Parser and the source code that allows the user to interact with Parser while it parses the input.
The language used to describe the grammar is very similar to that of the ANTLR. However, the generated code is very different, since Lexigram generates a predictive, non-recursive LL(1) parser, not a recursive-descent, adaptive LL(*) parser.
Notation
We’re using the following conventions in the grammar:
- Terminals are written in upper camel case (since they come from the lexicon):
Colon,SymEof. - Nonterminals are written in lower snake case:
rule,rule_name. - We usually write each production alternative on one line, indented, with the
|in the first column:prod: prod_alt | prod Or prod_alt ;
Those conventions are not enforced by Gram.
We also use a “light” notation for the sake of clarity, when it would be cumbersome to show both the lexicon and the grammar. In this notation, we represent the fixed terminals by their string literal, and we don’t put each alternative on a separate line. To distinguish it from the canonical grammar, we use the symbol -> to separate the left nonterminal from its definition.
prod -> prod_alt | prod "|" prod_alt;
Main Concepts
The grammar defines the structure of the language to be parsed. From the grammar, Gram generates the source code of the parser, which comprises the following parts:
- The parsing table for a predictive top-down parser, the symbol table, and other data.
- The
build_parser()method that creates aParserinstance from those data—what we call the “low-level parser”. - The listener, a trait that the user must implement to interact with the
Parserinstance. - The wrapper, the connection between the
Parserinstance and the listener. - Optionally, a template of the nonterminal types, and a template of a basic listener implementation.
Grammar Structure
The grammar begins with a header that defines the name of the parser:
grammar LexiParser;
The rest of the grammar is a set of rules, where the first defines the top level of the grammar—the parser’s starting rule.
Each rule defines a nonterminal, the name on the left-hand side of :. The productions on the right-hand side are a string of symbols, which are terminals defined in the lexicon and nonterminals defined in the grammar, and operators.
file:
header? file_item*
;
file_item:
option
| declaration
| rule
;
header:
Lexicon Id Semicolon
;
declaration:
Mode Id Semicolon
;
option:
Channels Lbracket Id (Comma Id)* Rbracket
;
// ...
A normalized grammar rule is made of one or several production alternatives. The only operator allowed in a normalized rule is the “or” symbol |, which separates the alternatives:
file_item:
option
| declaration
| rule
;
Gram accepts non-normalized grammars, where other operators can be used:
*and+for repetitions of an item:actions: action (Comma action)*;?for an optional item:file: header? file_item*;(and)for grouping items:message: (Note | Warning | Error) Message;<...>for attributes that guide the parser generation:expr: Sub expr | expr <R> Exp expr // right-associative | expr (Mul | <P> Div) expr // same precedence | expr (Add | <P> Sub) expr | Id | Num ;
The original grammar undergoes a series of transformations to obtain a normalized LL(1) grammar, before Gram generates the parsing table and the source code of the listener and the wrapper.
Listener
The listener is mainly a trait that declares methods for each rule. The user implements those trait methods with the actions that must be performed when the rules are processed during the parsing. To some extent, it corresponds to the source code written on the right-hand side of grammar rules in some parser generators like Yacc and Bison.
Each nonterminal (rule name) can be associated to a value of a user-defined type. The parser calls a trait method bound to each nonterminal once its rule has been fully parsed, and the user implementation code of the method returns the nonterminal value based on the given context, which includes the values of the rule items: other nonterminals and variable tokens like string and numerical literals.
As an example, let’s consider a simple lexicon and grammar. Id and Num are variable tokens representing an identifier name and a numerical value, respectively, and the grammar defines two rules for the nonterminals s and val:
// lexicon file
Equal : "="; // fixed token
Exit : "exit"; // fixed token
Return : "return"; // fixed token
Id : [a-z][a-z0-9]*; // variable token
Num : [1-9][0-9]*; // variable token
// grammar file
s:
Id Equal val
| Exit
| Return val
;
val:
Id
| Num
;
Init and Exit Methods
Gram generates the following listener trait, that the user must implement on a listener object:
#![allow(unused)]
fn main() {
pub trait TestListener {
// ...
fn init_s(&mut self) {}
fn exit_s(&mut self, ctx: CtxS) -> SynS;
fn init_val(&mut self) {}
fn exit_val(&mut self, ctx: CtxVal) -> SynVal;
}
}
It also generates the following types, used in the callback methods:
#![allow(unused)]
fn main() {
pub enum CtxS {
/// `s -> Id "=" val`
V1 { id: String, val: SynVal },
/// `s -> "exit"`
V2,
/// `s -> "return" val`
V3 { val: SynVal },
}
pub enum CtxVal {
/// `val -> Id`
V1 { id: String },
/// `val -> Num`
V2 { num: String },
}
}
To make it easier to read, we omit the Rust attributes generated for each type, typically #[derive(Debug)]), which might be necessary for the user.
For simple rules like above, a first call is made when the next rule is determined by the parser; for example, if the “return” token is received when s is one of the potential next nonterminals, the rule s -> Return val is selected and the init_s(&mut self) method is called. It can be used to initialize any data before the rule is actually parsed, so before any subrules is triggered, like val. There are default empty implementations for the initialization methods, so they’re optional.
Another call is made once the rule has been entirely parsed, including all the nonterminals it contains (here, val). This second call gives all the necessary information about what has been parsed in a context object, which specifies which rule alternative was parsed and the values of the relevant nonterminals and tokens. In the example above, the exit_s(&mut self, ctx: CtxS) method is called with ctx equals to CtxS::V3 { val }, val being the value of the nonterminal.
If the parsed text is “return 5”, the following calls are made:
#![allow(unused)]
fn main() {
listener.init_s(); // "return" selected the rule `s -> Return val`
listener.init_val(); // "5" selected the rule `val -> Num`
let val = listener.exit_val(CtxVal::V2 { num: "5".to_string() });
let s = listener.exit_s(CtxS::V3 { val });
}
Each nonterminal can have a value, but it’s only optional. This choice is specified in the options given to Lexigram when it generates the parser source code. If the nonterminal s has a value, exit_s is generated with a return value of type SynS, this type being defined by the user. If s has no value, exit_s has no return value and the trait definition includes a default empty implementation.
Other General Methods
There are a few other methods that the parser calls in some circumstances:
#![allow(unused)]
fn main() {
pub trait TestListener {
fn check_abort_request(&self) -> bool { false }
fn get_mut_log(&mut self) -> &mut impl Logger;
fn exit(&mut self, _s: SynS) {}
// ...
}
}
The check_abort_request(&self) method is called after each init and exit call. It can be used to return true if an error is detected; for example, if Num cannot be parsed as a number or Id hasn’t been declared. This requests the parser to abort its parsing procedure immediately. It’s the user’s choice to use this request or to manage the errors on their own, but it’s often easier to abort if there’s a risk that an undefined symbol generates too much problems down the line.
A simple way to use the abort request is to set an abort flag in the listener object and to return its value in that method:
impl TestListener for MyTestListener {
fn check_abort_request(&self) -> bool {
self.abort
}
// ...
}
For practical reasons, the listener must own a log object, in which all the warnings and errors will be recorded. The get_mut_log(&mut self) method returns a reference to that log object whenever the parser needs to record something.
pub struct MyTestListener {
log: BufLog,
// ...
}
impl TestListener for MyTestListener {
fn get_mut_log(&mut self) -> &mut impl Logger {
&mut self.log
}
// ...
}
Finally, the exit(&mut self, s: SynS) method is called once the top rule exits, and gives the value of its nonterminal, if it exists, as a SynS type. There is a default empty implementation, which makes it optional. It can be used to store the final result of the parsing in the listener object or to call any post-processing required at that level, like finalizing the AST tree.
So the entire series of calls, if the parsed text is “return 5”, is actually closer to the following sequence:
#![allow(unused)]
fn main() {
listener.init_s(); // "return" selected the rule `s -> Return val`
listener.init_val(); // "5" selected the rule `val -> Num`
let val = listener.exit_val(CtxVal::V2 { num: "5".to_string() });
if listener.check_abort_request() { return Err(ParserError::AbortRequest) }
let s = listener.exit_s(CtxS::V3 { val });
if listener.check_abort_request() { return Err(ParserError::AbortRequest) };
listener.exit(s);
return Ok(());
}
Wrapper
The wrapper, which is entirely implemented in generated source code, calls the listener methods and keeps track of the data: nonterminal and terminal values, and position of the symbols in the original input text.
#![allow(unused)]
fn main() {
pub struct Wrapper<T> {
listener: T, // user's listener instance
stack: Vec<EnumSynValue>, // nonterminal values
stack_t: Vec<String>, // terminal values
stack_span: Vec<PosSpan>, // positions
// ...
}
impl<T: TestListener> Wrapper<T> {
pub fn new(listener: T, verbose: bool) -> Self {
Wrapper {
listener,
// ...
}
}
pub fn get_listener(&self) -> &T {
&self.listener
}
pub fn get_listener_mut(&mut self) -> &mut T {
&mut self.listener
}
pub fn give_listener(self) -> T {
self.listener
}
// ...
}
impl<T: TestListener> ListenerWrapper for Wrapper<T> {
fn switch(&mut self, call: Call, nt: VarId, alt_id: AltId, t_data: Option<Vec<String>>) {
// ...
match call {
Call::Enter => {
match nt {
0 => self.listener.init_s(), // s
1 => self.listener.init_val(), // val
_ => panic!("unexpected enter nonterminal id: {nt}")
}
}
Call::Exit => {
match alt_id {
0 | // s -> Id "=" val
1 | // s -> "exit"
2 => self.exit_s(alt_id), // s -> "return" val
3 | // val -> Id
4 => self.exit_val(alt_id), // val -> Num
_ => panic!("unexpected exit alternative id: {alt_id}")
}
}
// ...
}
}
}
}
The listener is given to the wrapper before parsing the input; it can be recovered later if necessary with wrapper.give_listener(). A mutable reference of that wrapper is then given to the parse(…) method of the Parser instance.
The top-level user parser can be summarized by the following steps:
#![allow(unused)]
fn main() {
pub fn parse(text: &str) -> Result<BufLog, BufLog> {
// instantiates the lexer and low-level parser:
let mut lexer = build_lexer();
let mut parser = build_parser();
// instantiates the user listener and the wrapper:
let listener = MyTestListener::new();
let mut wrapper = Wrapper::new(listener, VERBOSE_WRAPPER);
// creates a stream from the input text and attaches it to the lexer:
let stream = CharReader::new(Cursor::new(text));
lexer.attach_stream(stream);
// passes the tokens from the lexer to the parser:
let tokens = self.lexer.tokens().keep_channel0();
// parses the tokens using the wrapper and the listener:
let result = self.parser.parse_stream(&mut wrapper, tokens);
// handles the results:
let MyTestListener { mut log, .. } = wrapper.give_listener();
if let Err(e) = result {
log.add_error(e.to_string());
}
if log.has_no_errors() {
Ok(log)
} else {
Err(log)
}
}
}
Nonterminal Values
Nonterminals can have a value, though this isn’t mandatory. When a nonterminal has no value, the generated code still includes init and exit methods, but the nonterminals are not present in the contexts of other nonterminals that include them.
For example, consider the following rules:
a -> b c | c
b -> Op c
c -> Id
where Op and Id are variable tokens that contain string data.
If a, b, and c have a value, the following code is generated:
#![allow(unused)]
fn main() {
pub enum CtxA {
/// `a -> b c`
V1 { b: SynB, c: SynC },
/// `a -> c`
V2 { c: SynC },
}
pub enum CtxB {
/// `b -> Op c`
V1 { op: String, c: SynC },
}
pub enum CtxC {
/// `c -> Id`
V1 { id: String },
}
pub trait TestListener {
// ... (other general methods)
fn init_a(&mut self) {}
fn exit_a(&mut self, ctx: CtxA) -> SynA;
fn init_b(&mut self) {}
fn exit_b(&mut self, ctx: CtxB) -> SynB;
fn init_c(&mut self) {}
fn exit_c(&mut self, ctx: CtxC) -> SynC;
}
}
As expected, the context given for the first alternative of a includes the data for the nonterminals b and c, with user-defined types SynB and SynC, while the second alternative only includes c. Similarly, CtxB includes a field for the c nonterminal.
If c has no value, the code becomes:
#![allow(unused)]
fn main() {
pub enum CtxA {
/// `a -> b c`
V1 { b: SynB },
/// `a -> c`
V2,
}
pub enum CtxB {
/// `b -> Op c`
V1 { op: String },
}
pub enum CtxC {
/// `c -> Id`
V1 { id: String },
}
pub trait TestListener {
// ... (other general methods)
fn init_a(&mut self) {}
fn exit_a(&mut self, ctx: CtxA) -> SynA;
fn init_b(&mut self) {}
fn exit_b(&mut self, ctx: CtxB) -> SynB;
fn init_c(&mut self) {}
fn exit_c(&mut self, _ctx: CtxC) {}
}
}
The variant CtxA::V2 is empty since there is no data, but it still exists so that the user knows which alternative of the a rule was parsed.
The trait still allows for callbacks before and after c is parsed, but there is no return value. This also makes the implementation of exit_c optional, so a default empty implementation is provided.
If both b and c have no value, the code becomes:
#![allow(unused)]
fn main() {
pub enum CtxA {
/// `a -> b c`
V1,
/// `a -> c`
V2,
}
pub enum CtxB {
/// `b -> Op c`
V1 { op: String },
}
pub enum CtxC {
/// `c -> Id`
V1 { id: String },
}
pub trait TestListener {
// ...
fn init_a(&mut self) {}
fn exit_a(&mut self, ctx: CtxA) -> SynA;
fn init_b(&mut self) {}
fn exit_b(&mut self, _ctx: CtxB) {}
fn init_c(&mut self) {}
fn exit_c(&mut self, _ctx: CtxC) {}
}
}
Here, all the variants of the CtxA context are empty. If a only had one alternative, the context would still have one empty variant, even though it carries no information at all. It would be generated to keep the API consistent, but it could be ignored entirely.
When a nonterminal has a value, the user must define its type. By default, the type name is Syn<nonterminal>, where <nonterminal> is the upper camel case of the nonterminal’s name. For example, a is associated to the SynA type, which can be a genuine type or an alias.
Lexigram generates a template on demand, which can be copied the first time or when the grammar is changed:
/// User-defined type for `a`
#[derive(Debug, PartialEq)]
pub struct SynA();
Finally, Lexigram takes care of generating the types of the nonterminals it adds when transforming the grammar. We indicate that in the generated code, in the following sections.
Grammar Transformations
Lexigram accepts non-LL(1) grammars if it’s able to transform them into LL(1). It tries to provide some degree of transparency by not showing the transformation in the listener trait and types, but some patterns require specific listener calls.
The transformations done by Lexigram are:
?operator:a -> X?becomesX | ε*operators:a -> X*becomesa -> a_1,a_1 -> X a_1 | ε+operators:a -> X+becomesa -> a_1,a_1 -> X a_1 | X- left recursion:
a -> a X | Ybecomesa -> Y a_1,a_1 -> X a_1 | εa -> a X a | Yand more complex ambiguities, typically used in expression parsing, are transformed with the Clarke method, which is detailed later
- left factorization:
a -> A B | A Cbecomesa -> A a_1,a_1 -> B | C
A few rules have two implementations: normal or low-latency (<L> attribute). Firstly in the + and * repetitions:
a -> X*becomesa -> a_1,a_1 -> X a_1 | ε, but thea_1iterator nonterminal is not visible to the user. The generated code gathers all the values and puts them in the context of theexit_amethod.a -> (<L=i> X)*becomesa -> i,i -> X i | ε, as above, but a callback method is generated foriso that the user can manage the iterations manually, almost as if the user had written the transformed rules in the original grammar. Note that the iterator nonterminal is explicitly named by the user.- Similar transformations apply to
a -> X+anda -> (<L=i> X)+.
Secondly, in right-recursive rules:
a -> X a | Ybehaves as expected: since it’s right recursive, a text “XXY” is decomposed asa -> X (a -> X (a -> Y)). The firstexit_acall occurs on the end of the iterations, witha -> Y, and the next two calls backtrack on the 2nd “X” then on the first “X”. This means that the first call occurs after the parser has already parsed the entire sequence and put all the intermediate values on the stack, which it empties one element at a time when callingexit_a.a -> X <L> a | Yis offering a shortcut to avoid the latency of the normal behaviour, by callingexit_aon each newly found value, a little as if it was a left-recursive rule. The firstexit_acall occurs on the first “X”, the 2nd on the next “X”, and the last on the “Y”. It is then up to the user to rebuild the AST in the desired order.
The purpose of the <L> attribute is to offer the ability of parsing a potentially infinite stream without storing all the intermediate items, or to produce an immediate output before reaching the end of the sequence in the stream.
Repetitions With *
a -> A B* C
is transformed into
a -> A a_1 C
a_1 -> B a_1 | ε
The generated code is as follows:
#![allow(unused)]
fn main() {
pub enum CtxA {
/// `a -> A B* C`
V1 { a: String, star: SynA1, c: String },
}
pub struct SynA1(pub Vec<String>);
pub trait TestListener {
// ...
fn init_a(&mut self) {}
fn exit_a(&mut self, ctx: CtxA) -> SynA;
}
}
Sequence of the operations:
init_a(…)is called before the rule is parsed.exit_a(…)is called aftera -> A B* Cis parsed. The repetition is available in thestarfield of thectx: CtxAargument. It wraps a vector containing all the parsed values.
Repetitions With +
a -> A B+ C
is transformed into
a -> A a_1 C
a_1 -> B a_2
a_2 -> a_1 | ε
The generated code is very similar to the * repetition:
#![allow(unused)]
fn main() {
pub enum CtxA {
/// `a -> A B+ C`
V1 { a: String, plus: SynA1, c: String },
}
/// Computed `B+` array in `a -> A ►► B+ ◄◄ C`
pub struct SynA1(pub Vec<String>);
pub trait TestListener {
// ...
fn init_a(&mut self) {}
fn exit_a(&mut self, ctx: CtxA) -> SynA;
}
}
Token-Separated Items
A special case of the * repetition recognized by Lexigram is a list of identical items separated by one or several fixed tokens:
a -> "var" Id ("," Id)* ";";
is transformed into
a -> "var" Id a_1 ";"
a_1 -> "," Id a_1
a_1 -> ε
The generated code is as follows:
#![allow(unused)]
fn main() {
pub enum CtxA {
/// `a -> "var" Id ("," Id)* ";"`
V1 { star: SynA1 },
}
pub struct SynA1(pub Vec<String>);
pub trait TestListener {
// ...
fn init_a(&mut self) {}
fn exit_a(&mut self, ctx: CtxA, spans: Vec<PosSpan>) -> SynA;
}
}
The differences from a normal * repetition are:
- The first
Iditem (before("," Id)*) is moved into the vector, asstar.0[0]. - That first item isn’t present in the
CtxA::V1variant.
Repetitions With <L>*
a -> A (<L=i> B)* C
is transformed into
a -> A i C
i -> B i | ε
Note that the nonterminal used to represent the iterations must be defined by the user, since its type is user-defined (here, i of type SynI).
The generated code is as follows:
#![allow(unused)]
fn main() {
pub enum CtxA {
/// `a -> A (<L> B)* C`
V1 { a: String, star: SynI, c: String },
}
pub enum CtxI {
/// `<L> B` iteration in `a -> A ( ►► <L> B ◄◄ )* C`
V1 { b: String },
}
pub trait TestListener {
// ...
fn init_a(&mut self) {}
fn exit_a(&mut self, ctx: CtxA, spans: Vec<PosSpan>) -> SynA;
fn init_i(&mut self) -> SynI;
fn exit_i(&mut self, acc: &mut SynI, ctx: CtxI, spans: Vec<PosSpan>);
fn exitloop_i(&mut self, acc: &mut SynI) {}
}
}
init_a(…)is called before the ruleais parsed.init_i(…)is called before the ruleiis parsed for the first time. This is a good place to initialize the whole series; the method must return the accumulator variable that will be updated at each iteration.exit_i(…)is called after parsing each item of the repetition, if there are any items. It receives the current value of the accumulatoraccand the values of the parsed items in the context variant, hereb. The accumulator can be updated thanks to the mutable reference.exitloop_i(…)is called after the last iteration. It can be used if the accumulator requires post-processing once the whole series has been parsed.
Repetitions With <L>+
a -> A (<L=i> B)+ C
is transformed into
a -> A i C
i -> B a_1
a_1 -> i | ε
The generated code is as follows:
#![allow(unused)]
fn main() {
pub enum CtxA {
/// `a -> A (<L> B)+ C`
V1 { a: String, plus: SynI, c: String },
}
pub enum CtxI {
/// `<L> B` iteration in `a -> A ( ►► <L> B ◄◄ )+ C`
V1 { b: String, last_iteration: bool },
}
pub trait TestListener {
// ...
fn init_a(&mut self) {}
fn exit_a(&mut self, ctx: CtxA, spans: Vec<PosSpan>) -> SynA;
fn init_i(&mut self) -> SynI;
fn exit_i(&mut self, acc: &mut SynI, ctx: CtxI, spans: Vec<PosSpan>);
}
}
The differences from a repetition with <L>* are:
- A
last_iterationfield that flags the last iteration, when thea_1 -> εalternative is reached. - No
exitloop_i(…)method, which has been replaced by the flag above because the last iteration contains the same data as the previous iterations. exit_i(…)is sure to be called at least once.- The resulting field in the
CtxA::V1variant is namedplusinstead ofstar.
Token-Separated Items With <L>
A list of identical items separated by one or several fixed tokens is also recognized in a <L>* repetition:
a -> "var" Id (<L=i> "," Id)* ";";
is transformed into
a -> "var" Id i ";"
i -> "," Id i
i -> ε
The generated code is as follows:
#![allow(unused)]
fn main() {
pub enum CtxA {
/// `a -> "var" Id (<L> "," Id)* ";"`
V1 { star: SynI },
}
pub enum InitCtxI {
/// value of `Id` before `<L> "," Id` iteration in
/// `a -> "var" Id ( ►► <L> "," Id ◄◄ )* ";"`
V1 { id: String },
}
pub enum CtxI {
/// `<L> "," Id` iteration in
/// `a -> "var" Id ( ►► <L> "," Id ◄◄ )* ";"`
V1 { id: String },
}
pub trait TestListener {
// ...
fn init_a(&mut self) {}
fn exit_a(&mut self, ctx: CtxA, spans: Vec<PosSpan>) -> SynA;
fn init_i(&mut self, ctx: InitCtxI, spans: Vec<PosSpan>) -> SynI;
fn exit_i(&mut self, acc: &mut SynI, ctx: CtxI, spans: Vec<PosSpan>);
fn exitloop_i(&mut self, acc: &mut SynI) {}
}
}
The differences from a normal repetition with <L>* are:
- The first
Iditem (before(<L> "," Id)*) is given in thectx: InitCtxIargument of theinit_i(…)method. - That first item isn’t present in the exit method context
CtxI.
The subsequent Id items (in (<L> "," Id)*) are given in the ctx: CtxI argument of the exit_i(…) method, as usual.
Right Recursion
expr -> Id "." expr | "(" Num ")"
Right-recursive rules don’t need to be modified. The generated code is as follows:
#![allow(unused)]
fn main() {
pub enum CtxExpr {
/// `expr -> Id "." expr`
V1 { id: String, expr: SynExpr },
/// `expr -> "(" Num ")"`
V2 { num: String },
}
pub trait TestListener {
// ...
fn init_expr(&mut self) {}
fn exit_expr(&mut self, ctx: CtxExpr) -> SynExpr;
}
}
The independent alternative is given first by exit_expr(…), then the recursive alternatives are given in reverse order, accordingly to the rule.
Id0.Id1(5)
yields the following calls:
init_expr()exit_expr(ctx = CtxA::V2 { num: "5" }) -> xexit_expr(ctx = CtxA::V1 { id: "Id1", expr: x }) -> yexit_expr(ctx = CtxA::V1 { id: "Id0", expr: y }) -> z- (the production that includes
exprreceives the valuez)
Right Recursion With <L>
expr -> <L> Id "." expr | "(" Num ")"
The generated code is as follows:
#![allow(unused)]
fn main() {
pub enum CtxExpr {
/// `expr -> <L> Id "." expr`
V1 { id: String },
/// `expr -> "(" Num ")"`
V2 { num: String },
}
pub trait TestListener {
// ...
fn init_expr(&mut self) -> SynExpr;
fn exit_expr(&mut self, acc: &mut SynExpr, ctx: CtxExpr, spans: Vec<PosSpan>);
}
}
Contrary to the * and + repetitions, the user mustn’t define a name in the attribute, since there is no transformation adding another nonterminal.
Since this form uses an accumulator, it is necessary to initialize it with init_expr(…).
This <L> version gives the values in the incoming order:
Id0.Id1(5)
yields the following calls:
init_expr() -> wexit_expr(*acc: w, ctx = CtxA::V1 { id: "Id0" }),*acc = xexit_expr(*acc: x, ctx = CtxA::V1 { id: "Id1" }),*acc = yexit_expr(*acc: z, ctx = CtxA::V2 { num: "5" }),*acc = z- (the production that includes
exprreceives the valuez)
Left Recursion
e -> f | e "." Id
is transformed into
e -> f e_1
e_1 -> "." Id e_1 | ε
The generated code is as follows:
#![allow(unused)]
fn main() {
pub enum CtxE {
/// `e -> f`
V1 { f: SynF },
/// `e -> e "." Id`
V2 { e: SynE, id: String },
}
pub trait TestListener {
// ...
fn init_e(&mut self) {}
fn exit_e(&mut self, ctx: CtxE) -> SynE;
fn exitloop_e(&mut self, _e: &mut SynE) {}
}
}
This is similar to an <L> repetition, except the initialization part. The accumulator is not returned by init_e, but it’s created at the first iteration, when the rule matches a non-recursive alternative of the rule; in this case, e -> f.
init_e(…)is called before the rule is parsed.exit_e(…)is called after each variant is parsed.- The first time, it’s a non-recursive alternative; here,
e -> f. It must return the initial value of the accumulator (which may be its final value if there are no iterations). - If there are iterations of the recursive alternative,
exit_e(…)is called for each iteration, with the current value of the accumulator in context and the parsed values of that iteration. It must return the updated accumulator.
- The first time, it’s a non-recursive alternative; here,
exitloop_e(…)is called after the last iteration. It can be used if the accumulator requires post-processing once the whole series has been parsed.
Left Factorization
a -> A | A B | A B C | A B D | E
which can be rewritten as
a -> A (ε | B (ε | C | D)) | E
------------ a2
------------------- a1
is transformed into
a -> A a_1
a -> E
a_1 -> B a_2
a_1 -> ε
a_2 -> C
a_2 -> D
a_2 -> ε
The generated code is as follows:
#![allow(unused)]
fn main() {
pub enum CtxA {
/// `a -> A`
V1 { a: String },
/// `a -> A B`
V2 { a: String, b: String },
/// `a -> A B C`
V3 { a: String, b: String, c: String },
/// `a -> A B D`
V4 { a: String, b: String, d: String },
/// `a -> E`
V5 { e: String },
}
pub trait TestListener {
// ...
fn init_a(&mut self) {}
fn exit_a(&mut self, ctx: CtxA) -> SynA;
}
}
Advanced Transformations
Ambiguous Left and Right Recursion
Rules that are both left- and right-recursive are ambiguous in an LL(1) grammar, since they don’t define the precedence nor the associativity. The typical use of such rules are for expressions containing binary operators.
Lexigram automatically transforms those rules using the Clarke method, which preserves the precedence and left-/right-associativity. The same rule can contain binary and prefixed / suffixed unary operators (e.g. expr "+" expr, "-" expr and expr "?", respectively).
The precedence is determined by the order in which the operators are defined in the rule:
-
The alternatives defined first have higher precedence; in this example,
*has higher precedence than+expr -> expr "*" expr | expr "+" expr | Num -
The
<P>attribute indicates an operator with the same level of precedence as the previous one on its left. For example, it can be used in a rule likeexpr -> expr "+" expr | <P> expr "-" expr | Id | Num
By default, binary operators are left-associative. The <R> attribute indicates a right-associative operator. For example,
expr -> <R> expr "^" expr | expr "*" expr | expr "/" expr | Num
The attributes can be placed anywhere in the rule alternative, but we usually place them at the beginning to be consistent.
The rule
e -> e "*" e | <R> e "!" e | e "+" e | Num
is transformed into
e -> e_4 e_1
e_1 -> "*" e_4 e_1 | "!" e_2 e_1 | "+" e_2 e_1 | ε
e_2 -> e_4 e_3
e_3 -> "*" e_4 e_3 | "!" e_2 e_3 | ε
e_4 -> Num
The generated code is as follows:
#![allow(unused)]
fn main() {
pub enum CtxE {
/// `e -> e "*" e`
V1 { e: [SynE; 2] },
/// `e -> <R> e "!" e`
V2 { e: [SynE; 2] },
/// `e -> e "+" e`
V3 { e: [SynE; 2] },
/// `e -> Num`
V4 { num: String },
}
pub trait TestListener {
// ...
fn init_e(&mut self) {}
fn exit_e(&mut self, ctx: CtxE) -> SynE;
}
}
Since there are two identical symbols in the same alternatives, they are regrouped in an array e: [SynE; 2], where e[0] is the leftmost nonterminal and e[1] is the rightmost one.
The first variant V1 is related to the operator of highest precedence, and the last variant is related to the lowest precedence.
Repetitions of Longer Patterns
a -> (b A b B A)+
is transformed into
a -> a_1
a_1 -> b A b B A a_2
a_2 -> a_1 | ε
The generated code is as follows:
#![allow(unused)]
fn main() {
pub enum CtxA {
/// `a -> (b A b B A)+`
V1 { plus: SynA1 },
}
/// Computed `(b A b B A)+` array in `a -> ►► (b A b B A)+ ◄◄ `
pub struct SynA1(pub Vec<SynA1Item>);
/// `b A b B A` item in `a -> ( ►► b A b B A ◄◄ )+`
pub struct SynA1Item { pub b: [SynB; 2], pub a: [String; 2], pub b1: String }
pub trait TestListener {
// ..
fn init_a(&mut self) {}
fn exit_a(&mut self, ctx: CtxA) -> SynA;
}
}
In comparison to the base * repetition, we see that the type of the items in the list, SynA1Item, is a struct rather than a string. The same symbols have been regrouped in arrays:
b: [SynB; 2]for the two nonterminalsb.a: [String; 2]for the two tokensA(note that the field must be lowercase, despite representing a token), which contain a string.
Also, the token B is represented by b1, since the name b was already taken. The fields are in the same order as the first symbol they represent in the rule; here, b is first, A second, and B third. Their type is another indication that helps set them apart: the type of a token is a string, while a nonterminal has a custom type whose name begins by “Syn”, for synthesized.
Nested Repetitions
a -> (A (b ",")* ";")* C
is transformed into
a -> a_2 C
b -> B
a_1 -> b "," a_1 | ε
a_2 -> A a_1 ";" a_2 | ε
The generated code is as follows:
#![allow(unused)]
fn main() {
pub enum CtxA {
/// `a -> (A (b ",")* ";")* C`
V1 { star: SynA2, c: String },
}
/// Computed `(b ",")*` array in `a -> (A ►► (b ",")* ◄◄ ";")* C`
pub struct SynA1(pub Vec<SynB>);
/// Computed `(A (b ",")* ";")*` array in `a -> ►► (A (b ",")* ";")* ◄◄ C`
pub struct SynA2(pub Vec<SynA2Item>);
/// `A (b ",")* ";"` item in `a -> ( ►► A (b ",")* ";" ◄◄ )* C`
pub struct SynA2Item { pub a: String, pub star: SynA1 }
pub trait TestListener {
// ...
fn init_a(&mut self) {}
fn exit_a(&mut self, ctx: CtxA) -> SynA;
}
}
When a repetition is nested in another, the resulting vector becomes a field of the outer repetition’s type. Here, SynA2Item::star of type SynA1 contains a Vec<SynB>, where SynB contains the only data of the inner loop, which is the nonterminal b. Each generated type has a doc comment that emphasizes the part of the original rule it represents between “►►” and “◄◄” arrows.
It’s also possible to nest any combination of non-<L> and <L> repetitions. The principle is always the same: non-<L> repetitions are represented by a Vec of values gathered automatically by the parser, and an <L> repetition is managed manually by the user through extra nonterminal init(…)/exit(…) methods and is represented by a user-defined type.
Alternatives in Repetitions
a -> A (B | b C b B C | E)* F
is transformed into
a -> A a_1 F
a_1 -> B a_1 | b C b B C a_1 | E a_1 | ε
The generated code is as follows:
#![allow(unused)]
fn main() {
pub enum CtxA {
/// `a -> A (B | b C b B C | E)* F`
V1 { a: String, star: SynA1, f: String },
}
pub enum CtxB {
/// `b -> D`
V1 { d: String },
}
/// Computed `(B | b C b B C | E)*` array in `a -> A ►► (B | b C b B C | E)* ◄◄ F`
pub struct SynA1(pub Vec<SynA1Item>);
pub enum SynA1Item {
/// `B` item in `a -> A ( ►► B ◄◄ | b C b B C | E)* F`
V1 { b: String },
/// `b C b B C` item in `a -> A (B | ►► b C b B C ◄◄ | E)* F`
V2 { b: [SynB; 2], c: [String; 2], b1: String },
/// `E` item in `a -> A (B | b C b B C | ►► E ◄◄ )* F`
V3 { e: String },
}
pub trait TestListener {
// ...
fn init_a(&mut self) {}
fn exit_a(&mut self, ctx: CtxA) -> SynA;
}
}
Since each iteration can be one of 3 strings of symbols, its type SynA1Item is an enum with 3 variants, each containing the corresponding data.
Alternatives in Repetitions with <L>
a -> A (<L=i> (<L=j> b C b B C | D)* E | F)* G
is transformed into:
a -> A i G
i -> j E i | F i | ε
j -> b C b B C j | D j | ε
The generated code is as follows:
#![allow(unused)]
fn main() {
pub enum CtxA {
/// `a -> A (<L> (<L> b C b B C | D)* E | F)* G`
V1 { a: String, star: SynI, g: String },
}
pub enum CtxI {
/// `<L> (<L> b C b B C | D)* E` iteration in `a -> A ( ►► <L> (<L> b C b B C | D)* E ◄◄ | F)* G`
V1 { star: SynJ, e: String },
/// `F` iteration in `a -> A (<L> (<L> b C b B C | D)* E | ►► F ◄◄ )* G`
V2 { f: String },
}
pub enum CtxJ {
/// `<L> b C b B C` iteration in `a -> A (<L> ( ►► <L> b C b B C ◄◄ | D)* E | F)* G`
V1 { b: [SynB; 2], c: [String; 2], b1: String },
/// `D` iteration in `a -> A (<L> (<L> b C b B C | ►► D ◄◄ )* E | F)* G`
V2 { d: String },
}
pub trait TestListener {
// ...
fn init_a(&mut self) {}
fn exit_a(&mut self, ctx: CtxA, spans: Vec<PosSpan>) -> SynA;
fn init_i(&mut self) -> SynI;
fn exit_i(&mut self, acc: &mut SynI, ctx: CtxI, spans: Vec<PosSpan>);
fn exitloop_i(&mut self, acc: &mut SynI) {}
fn init_j(&mut self) -> SynJ;
fn exit_j(&mut self, acc: &mut SynJ, ctx: CtxJ, spans: Vec<PosSpan>);
fn exitloop_j(&mut self, acc: &mut SynJ) {}
}
}
Notes:
- In case of
<L> +repetitions, alast_iterationboolean field flags the last repetition instead of aexitloop_*(…)method call, as seen before.
Formal Grammar Definition
Here are the lexicon and the grammar of Gram, using the Lexigram syntax. Those files can be found in the repository, in the build-stage1 directory.
Gram’s Lexicon
lexicon GramLexer;
fragment COMMENT : '/*' .*? '*/';
fragment LINECOMMENT : '//' ~[\r\n]*;
fragment WHITESPACE : [ \n\r\t]+;
fragment ID : [a-zA-Z][a-zA-Z_0-9]*;
Comment : COMMENT -> skip;
LineComment : LINECOMMENT -> skip;
WhiteSpace : WHITESPACE -> skip;
Colon : ':';
Lparen : '(';
Or : '|';
Plus : '+';
Question : '?';
Rparen : ')';
Semicolon : ';';
Star : '*';
Grammar : 'grammar';
SymEof : 'EOF';
Lform : '<L' ('=' ID)? '>';
Rform : '<R>';
Pform : '<P>';
Greedy : '<G>';
Id : ID;
Gram’s Grammar
grammar GramParser;
file: header rules;
header:
Grammar Id Semicolon
;
rules:
rule
| rules rule
;
rule:
rule_name Colon prod SymEof? Semicolon
;
rule_name:
Id
;
prod:
prod_alt
| prod Or prod_alt
;
prod_alt:
prod_factor*
;
prod_factor:
prod_atom (Plus | Star | Question)?
;
prod_atom:
Id
| Lform
| Rform
| Pform
| Greedy
| Lparen prod Rparen
;