Grammar
Let’s start with the top structure of the grammar. In this section, we’ll build the grammar and get a peek at the generated code. It’s a good opportunity to show off some features and to provide a little guidance. We’ll implement the listener for the generated code in the next section, once the grammar is fully defined.
The top level is a succession of four groups:
- definitions (
"def") - lexer specifications (
"lexer") - parser specifications (
"parser") - general options (
"options")
The definitions contain zero or more items that all have the same format, so it’s pretty simple. We want to be able to use those “constants” for any value, so there are different types, like a string literal (filename or tag name), an integer literal (indent level), and other possibilities we’ll discover as we come across all the options.
The lexer, parser, and general options contain between brackets a list of comma-separated options. We’ll see the inner options when we tackle those nonterminals.
How are the top-level items organized? They could either follow one another in the order shown above, or there could just be a succession of those items in no particular order and with possible repetitions. The latter would offer more flexibility, but we’ll choose the former because it’s cleaner.
Optional Items
The starting nonterminal, that we’re naming config, simply lists the four parts:
config:
definitions
lexer
parser
options
;
lexer is mandatory, but parser and options are optional; we can require either a lexer or the whole lexer + parser + wrapper code. Since Lexigram offers a few elements of syntactic sugar for optional and repetitive items, such as α?, α*, and α+, we can use that in the definition of parser and options: (we have omitted the actual content of the grammatical rules for now)
definitions: …;
lexer: …;
parser: (…)?;
options: (…)?;
The first task of Lexigram is to normalize the rules, transforming
α?toα | ε(whereεmeans “empty”, nothing),α*to a new nonterminalx, wherex -> α x | εα+to a new nonterminalx, wherex -> α x | α
This is transformed into the final productions below:
config -> definitions lexer parser options definitions -> … lexer -> … parser -> … parser -> ε options -> … options -> ε
parser and options have an empty alternative, which is fine. It’s reflected directly in the listener trait methods as two variants of an enum context type, so the user knows immediately whether the option is used or not.
There’s another way to write the grammar: it could be tempting to put the ? directly into config:
config:
definitions
lexer
parser?
options?
;
definitions: …;
lexer: …;
parser: …;
options: …;
And it would work, but it wouldn’t be optimal. Why?
Each of those options would split the original production when it’s normalized, which would end up as a list of all the possible combinations:
config -> definitions lexer parser options config -> definitions lexer parser config -> definitions lexer options config -> definitions lexer
It’s not too long—though we’d have eight alternatives if the lexer specs were also optional—but it’s cumbersome. And that’s not all; since those alternatives need to be left-factorized, the rule would be transformed again later:
config -> definitions lexer config_1 . config_1 -> parser config_2 . config_1 -> options . config_1 -> ε . config_2 -> options . config_2 -> ε definitions -> … lexer -> … parser -> … options -> …
The user implementation of the trait method would reflect the same pattern, as shown below. It’s possible to work with it, but it’s annoying to handle the repetitions, especially if we need to handle nonterminal values for definitions, lexer, parser, and options.
#![allow(unused)]
fn main() {
// user-code template:
impl ConfigListener for Listener {
fn exit_config(&mut self, ctx: CtxConfig) -> SynConfig {
match ctx {
// config -> definitions lexer parser options
CtxConfig::V1 { definitions, lexer, parser, options } => {}
// config -> definitions lexer parser
CtxConfig::V2 { definitions, lexer, parser } => {}
// config -> definitions lexer options
CtxConfig::V3 { definitions, lexer, options } => {}
// config -> definitions lexer
CtxConfig::V4 { definitions, lexer } => {}
}
SynConfig()
}
// ...
}
}
In this case, we probably don’t need a value for those nonterminals. In each option category, we’ll store the parsed result for each of them in the listener instead, so it’s not a big problem, but it’s not great from the perspective of the final LL(1) grammar.
Compared to the first solution, which yields 1 production in the first and the “non-?” nonterminals, and 2*n productions in the ? nonterminals, we get 2^n productions in the first (plus left factorizations!) and 1 production in the others. So we see why it’s better to use the ? in the secondary nonterminals:
- For 2 optional items out of 4, we get 7 productions instead of 10.
- For 3 optional items out of 4, we get 8 productions instead of 16.
The take-away is to avoid using this particular syntactic sugar several times in the same alternative, because the desugaring makes it grow exponentially when it’s distributed:
x: α? β? γ?;
becomes
x -> (α | ε) (β | ε) (γ | ε) -> α β γ | α β | α γ | α | β γ | β | γ | ε
If it’s used in separate alternatives, it’s fine because it only adds one ε, which is shared between them:
x: α? | β? | γ?;
becomes
x -> α | β | γ | ε
Loops
Now that we have config, let’s see the secondary nonterminals, starting with definitions.
Each definition has the syntax below, where we give a temporary definition for value:
lexer Config;
Equal : '=';
Semicolon : ';';
Id : [a-zA-Z][a-zA-Z_0-9]*;
NumLiteral : [0-9]+;
StrLiteral : '"' Char+ '"';
parser Config;
definition:
Def Id Equal value Semicolon
;
value:
NumLiteral
| StrLiteral
| Id
;
Or, if we’re using the simple grammar notation to make terminals like "def" and "=" more visible, instead of using Def, Equal, and so on:
definition -> "def" Id "=" value ";"
value -> NumLiteral | StrLiteral | Id;
But how can we express 0 or more repetitions of definition? It’s time to see the different ways of doing grammar loops.
Right Recursion
The right recursion is the only valid loop in an LL(1) grammar.
If we use this rule for definitions:
definitions -> "def" Id "=" value ";" definitions | ;
The missing symbols on the right of | simply means it’s empty. Since the syntax is recursive, we need an alternative to end the recursion: here, we use an empty production. In practice, the parser will check the next token after ";" to predict which alternative of productions follows.
- If the token is
"def", it predicts the left, recursive alternative. - If it’s
"lexer", it predicts the right, empty alternative. Indeed, when the alternative is “nothing”, the parser looks at the next possibilities in theconfignonterminal. Here,configexpectslexerafterdefinitions, and thelexersyntax begins with the terminal"lexer"(even though we haven’t written that explicitly yet).
There is no transformation for definitions,
definitions -> "def" Id "=" value ";" definitions definitions -> ε
We get to implement this listener method:
#![allow(unused)]
fn main() {
// user-code template:
impl ConfigListener for Listener {
fn init_definitions(&mut self) {}
fn exit_definitions(&mut self, ctx: CtxDefinitions) -> SynDefinitions {
match ctx {
// definitions -> "def" Id "=" value ";" definitions
CtxDefinitions::V1 { id, value, definitions } => {}
// definitions -> ε
CtxDefinitions::V2 => {}
}
SynDefinitions()
}
// ...
}
}
This appears to be fine, but there is a catch: since it’s a right-recursive rule, we’ll get the last item first!
Let’s consider the following input, which has three definitions:
def a = 0;
def b = 1;
def c = 2;
The reductions are performed as follows:
definitions
─────┬─────
│
─────────────────────┴──────────────────────
"def" Id("a") "=" value("0") ";" definitions
─────┬─────
│
─────────────────────┴──────────────────────
"def" Id("b") "=" value("1") ";" definitions
─────┬─────
│
─────────────────────┴──────────────────────
"def" Id("c") "=" value("2") ";" definitions
─────┬─────
│
ε
Since the exit_definitions(…) method is called after all the items have been parsed in the production, we must wait until all the productions have been parsed. The successive calls are, if we simplify the notation somewhat:
init_definitions()exit_definitions(CtxDefinitions::V2), returnswexit_definitions(CtxDefinitions::V1 { id: "c", value: "2", definitions: w }) -> xexit_definitions(CtxDefinitions::V1 { id: "b", value: "1", definitions: x }) -> yexit_definitions(CtxDefinitions::V1 { id: "a", value: "0", definitions: y }) -> z
And the value of definitions in exit_config(…), once it’s called, will be z.
If we accumulated the values in a vector, we would ultimately get the items in reverse order, unless we used a VecDeque to push the items in front. It could also be annoying if you wanted to check that an indentifier hadn’t already been declared before. And if an identifier was the same as another one (def c = a), you’d need another pass to resolve it.
Since we don’t have any way of knowing when exit_definitions(…) is called for the last time, any post-processing to fix those issues must be done wherever definitions is used. In our case, it’s used only once in exit_config(…), but if it was used in several productions, we’d have to call a post-processing method from each location (and not forget about it if new instances are made!). Or we would have to add an intermediate nonterminal to keep things tidier.
Another consequence, which isn’t obvious at first, is that all the values SynDefinitions are stored in the wrapper stack until the end of the list, and all the production symbols are stored in the parser stack, too. Not only do you get the last item first, after the whole list had to be parsed, but it consumes more memory to store the intermediate data and steps.
Right-recursive loops are simple and useful when you need to preserve that reverse order, but otherwise it’s not something you’ll want to use in a grammar very often.
Star
We’ve already seen the syntactic sugar for repetitions of zero or more items: definition*. We can even put it directly into the definitions nonterminal:
definitions -> ("def" Id "=" value ";")*;
Lexigram transforms it as follows:
definitions -> definitions_1 . definitions_1 -> "def" Id "=" value ";" definitions_1 . definitions_1 -> ε
The generated code hides that transformation by giving the end result once definitions is entirely parsed. There are no listener calls corresponding to the nonterminal definitions_1.
#![allow(unused)]
fn main() {
pub enum CtxDefinitions {
/// `definitions -> ("def" Id "=" value ";")*`
V1 { star: SynDefinitions1 },
}
pub struct SynDefinitions1(pub Vec<SynDefinitions1Item>);
pub struct SynDefinitions1Item { pub id: String, pub value: SynValue }
// user-code template:
impl ConfigListener for Listener {
fn exit_definitions(&mut self, ctx: CtxDefinitions) -> SynDefinitions {
// definitions -> ("def" Id "=" value ";")*
let CtxDefinitions::V1 { star } = ctx;
SynDefinitions()
}
// ...
}
}
The wrapper implements the glue logic to collect each item of the repetition and push it into a Vec. When we implement exit_definitions(…), we get a vector star containing all the (id, value) pairs. It’s often the most convenient way for small lists of items.
As we’ve already seen in the overview, valueless symbols like "def", "=", and ";" are not given in the context variants, so they’re not stored in repetition vectors either.
If there are alternatives in the repetition, like x -> (α | β)*, the type inside the vector is an enum with the variants, so that each item can be either α or β. It can be very handy in simple situations, but it’s not to be abused; when the repeated item becomes too complex, another loop type may be preferable.
At this point, you might be wondering: isn’t definitions_1 right-recursive? And didn’t we say that a right recursion produced the results in reverse order and had a few other downsides? Are we getting the items in reverse order here too?
While it’s indeed right-recursive, which is the only way to loop in LL(1), the original rule wasn’t meant as a right recursion. Therefore, Lexigram knows that the trailing definitions_1 of production 1 below doesn’t need to be known before using the items Id and value.
0: definitions -> definitions_1 1: . definitions_1 -> "def" Id "=" value ";" definitions_1 2: . definitions_1 -> ε
Let’s see what happens in more detail:
- When the production 1 is predicted, the wrapper pushes an empty “loop” vector on the value stack.
- The production symbols are then parsed, and those that have a value are pushed on the wrapper stack; here,
Idandvalue. - When the
definitions_1symbol at the end of production 1 is met, it isn’t processed as a normal nonterminal to be derived (which would put another empty vector on the stack), but as a loop nonterminal. When wrapper receives a loop call for that nonterminal, it pops the values from their respective stack and pushes that as a new item into the loop vector, which is now at the top of the value stack. - This process iterates until all the items have been parsed.
- When the loop is finished, production 0 is fully parsed, and the wrapper knows that the value of
definitions_1, which stands for the original*repetition, is at the top of the value stack.
Note
For the curious minds who know how table-driven parsers work in general.
When a production like
definitions_1 -> "def" Id "=" value ";" definitions_1is predicted, the usual algorithm description tells us that the symbols on the right-hand side are put in reverse order on a stack. Here, it would pushdefinitions_1,";",value,"=",Id,"def". The parser would then pop the next symbol and, depending on its nature:
- if it’s a terminal, like
"def", it would check if the next token matches,- if it’s a nonterminal, like
value, it would predict, from the parsing table, which alternative of that nonterminal’s productions must be added onto the stack.Lexigram adds other symbols on the stack to call the wrapper:
- exit symbols for the nonterminals, so that the
exit_*(…)methods are called at the right time,- loop symbols to manage the iterations as we saw above,
- hook symbol to let the user intercept and modify specific tokens.
These symbols are shown in the log (“opcodes”).
In conclusion, the iterations of the parsing loop insert the items in the correct order, and nothing accumulates on the stacks. It’s more straightforward and not as resource-consuming.
The same mechanism is used in other types of loops, like the left recursion and the low-latency loops.
Plus
For repetitions of one or more items, + can be used; the result is identical to * from the point of view of the listener implementation, except the name plus is used instead of star in the context variant.
definitions -> ("def" Id "=" value ";")+;
What we don’t see in the listener trait is that the final LL(1) grammar is slightly more complicated “internally”. The + is first normalized:
definitions -> definitions_1 . definitions_1 -> "def" Id "=" value ";" definitions_1 . definitions_1 -> "def" Id "=" value ";"
Since the productions begin with identical symbols, "def", Id, "=", and so on, this must be left-factorized, too:
definitions -> definitions_1 . definitions_1 -> "def" Id "=" value ";" definitions_2 . definitions_2 -> definitions_1 . definitions_2 -> ε
Left Recursion
As we saw in the LL(1) grammar limitations, left recursion is transformed as a right recursion, too.
definitions ->
definitions "def" Id "=" value ";"
|
;
The transformed rules look identical to what we saw with the * repetition:
0: definitions -> definitions_1 1: . definitions_1 -> "def" Id "=" value ";" definitions_1 2: . definitions_1 -> ε
Warning
Don’t confuse the
εin the production alternative 2 above, which corresponds to the end of the loop, with theεwe had in the right recursion, which corresponded to the “empty” alternative of the original rule. That “empty” is invisible in the case of the left recursion because it’s simplified during the transformation.0: definitions -> ε definitions_1 -> definitions_1 ^^^Remember that
a -> a α | βis transformed asa -> β a_1; a_1 -> α a_1 | ε. Here,βisε.
The only difference from a * loop is in the wrapper. The context variants are shown as left-recursive, and, unlike the * repetition, the wrapper calls the listener trait method at each iteration, leaving you the choice of the loop value type and how to use it.
#![allow(unused)]
fn main() {
// user-code template:
impl ConfigListener for Listener {
fn exit_definitions(&mut self, ctx: CtxDefinitions) -> SynDefinitions {
match ctx {
// definitions -> definitions "def" Id "=" value ";"
CtxDefinitions::V1 { definitions, id, value } => {}
// definitions -> ε
CtxDefinitions::V2 => {}
}
SynDefinitions()
}
fn exitloop_definitions(&mut self, definitions: &mut SynDefinitions) {}
// ...
}
}
Another important difference is the availability of another method called after the last iteration, exitloop_definitions(…). It’s easy to do thanks to the transformation in the grammar: when the alternative 2 is predicted on the last iteration, the wrapper calls the method. In comparison, this wasn’t possible in the case of a right-recursive rule because, when the loop finally exits, the parser must still make all the pending exit calls for the items on the stack. And since there is no separate nonterminal for the loop, it would make calling such an exitloop_*(…) method more complicated.
When Lexigram sees a left recursion, it knows the rightmost nonterminal can be treated differently, and since the original recursion is on the left term, the user expects to see the reductions on the left:
For the same input, the reduction is done as follows:
definitions
─────┬─────
│
─────────────────────┴──────────────────────
definitions "def" Id("c") "=" value("2") ";"
─────┬─────
│
─────────────────────┴──────────────────────
definitions "def" Id("b") "=" value("1") ";"
─────┬─────
│
─────────────────────┴──────────────────────
definitions "def" Id("a") "=" value("0") ";"
─────┬─────
│
ε
The successive calls are, in this case:
init_definitions()exit_definitions(CtxDefinitions::V2), returnswexit_definitions(CtxDefinitions::V1 { id: "a", value: "0", definitions: w }) -> xexit_definitions(CtxDefinitions::V1 { id: "b", value: "1", definitions: x }) -> yexit_definitions(CtxDefinitions::V1 { id: "c", value: "2", definitions: y }) -> zexitloop_definitions(&mut z)
Left vs Right
Our need to encode the loop for definitions didn’t quite illustrate a subtlety between left and right recursion because we had an empty alternative. To remove any confusion and stress the difference between left and right, let’s introduce a non-empty alternative in each case.
Here is a right-recursive version that ends with an “end”:
defs -> "def" Id "=" value ";" defs | "end" Id;
It corresponds to those productions (no transformation):
0: defs -> "def" Id "=" value ";" defs 1: defs -> "end" Id
It can parse an input that ends with “end”, since that’s the only admissible rightmost token in the original rule:
def a = 0;
def b = 1;
end loop
This would generate those successive calls:
init_defs()exit_defs(CtxDefs::V2 { id: "loop" } -> xexit_defs(CtxDefs::V1 { id: "b", value: "1", defs: x }) -> yexit_defs(CtxDefs::V1 { id: "a", value: "0", defs: y }) -> z
And here is a left-recursive rule that begins with “begin”:
defs -> defs "def" Id "=" value ";" | "begin" Id; // defs α | β
or, if that’s easier to visualize,
defs -> "begin" Id | defs "def" Id "=" value ";"; // β | defs α
Lexigram transforms it into those productions:
0: defs -> "begin" Id defs_1 // β defs_1 1: . defs_1 -> "def" Id "=" value ";" defs_1 // α defs_1 2: . defs_1 -> ε // (loop end)
It can parse an input that begins with “begin”, since that’s the only admissible leftmost token in the original rule:
begin loop
def a = 0;
def b = 1;
This would generate those successive calls:
init_defs()exit_defs(CtxDefs::V2 { id: "loop" }) -> xexit_defs(CtxDefs::V1 { id: "a", value: "0", defs: x }) -> yexit_defs(CtxDefs::V1 { id: "b", value: "1", defs: y }) -> zexitloop_definitions(&mut z)
Latency
To close the loop, we’ll have a quick look at a variant of right recursion and */+ repetitions. We saw that both those loop constructs had a higher latency. The right recursion puts everything in the stack until the last iteration, and only then starts calling the listener methods. And the repetitions deliver the vector of collected items once the loop has ended.
Yet they provide useful alternatives to left recursion, by allowing independent terms at the end for the right recursion and none at all for the repetition. Wouldn’t it be worth having a way to get the loop items earlier?
That’s why the “low-latency” attribute was introduced.
In a low-latency right recursion, the rule has to be written as follows (the <L> can be placed anywhere in the recursive alternative):
definitions -> "def" Id "=" value ";" <L> definitions | ;
Lexigram doesn’t change the productions:
1: definitions -> "def" Id "=" value ";" definitions 2: definitions -> ε
But the listener methods do change:
#![allow(unused)]
fn main() {
// user-code template:
impl ConfigListener for Listener {
fn init_definitions(&mut self) -> SynDefinitions {
SynDefinitions(vec![])
}
fn exit_definitions(&mut self, acc: &mut SynDefinitions, ctx: CtxDefinitions) {
match ctx {
// definitions -> "def" Id "=" value ";" <L> definitions
CtxDefinitions::V1 { id, value } => {
acc.0.push((id, value));
}
// definitions -> ε
CtxDefinitions::V2 => {}
}
}
// ...
}
}
Now,
self.init_definitions()returns the initial value of the loop variable; for example, an empty vector.self.exit_definitions(acc, ctx)is called after parsing the independent term and each iteration.
At first, calling init_definitions(…), then exit_definitions(…) on the independent term may sound redundant, since both could provide the initial value, but they allow to separate the initialization of the nonterminal, before any part of it is parsed, and adjust it with the independent term if necessary. The other reason to do so is the mutable reference to the loop variable, acc, which is usually more efficient than moving its value at each iteration.
Since it’s a right recursion, there is still no way to know when it’s the last iteration, but now the data can be processed directly in the incoming order. The example of code above will provide a vector containing the items in the correct order. It would also allow you to check that Id hasn’t already been defined before and, if so, to return the appropriate error message. And it would allow you to use an already defined Id as value, without using a second pass.
The low-latency repetition is written a little differently, since the name of the loop nonterminal is given explicitly this time:
definitions -> (<L=i_def> "def" Id "=" value ";")*;
The resulting productions are the same, except maybe for the nonterminal name:
1: definitions -> i_def 2: . i_def -> "def" Id "=" value ";" i_def 3: . i_def -> ε
But the listener methods change, too:
#![allow(unused)]
fn main() {
// user-code template:
impl ConfigListener for Listener {
fn init_definitions(&mut self) {}
fn exit_definitions(&mut self, ctx: CtxDefinitions) -> SynDefinitions {
// definitions -> (<L> "def" Id "=" value ";")*
let CtxDefinitions::V1 { star } = ctx;
SynDefinitions()
}
fn init_i_def(&mut self) -> SynIDef {
SynIDef(vec![])
}
fn exit_i_def(&mut self, acc: &mut SynIDef, ctx: CtxIDef) {
// `<L> "def" Id "=" value ";"` iteration in
// `definitions -> ( ►► <L> "def" Id "=" value ";" ◄◄ )*`
let CtxIDef::V1 { id, value } = ctx;
acc.0.push((id, value));
}
// ...
}
}
This variant allows you to define the type of the loop variable and to handle it directly.
self.init_i_def()is called after any symbol preceding the repetition in the original production is parsed, and before the first loop production is parsed.self.exit_i_def(acc, ctx)is called after each iteration is parsed.
There is no indication of the last iteration, but since there is an exit_definitions(…), it’s not really necessary; you can do any post-processing on star in that method if necessary.
Note
In the case of the
+repetition, there is an additional flag in the context that indicates the last iteration. Unlike the*repetition, it’s very easy to provide.#![allow(unused)] fn main() { CtxIDef::V1 { id: String, value: SynValue, last_iteration: bool } ^^^^^^^^^^^^^^ }
This concludes the quick tour of the art of loops in LL(1) grammars using Lexigram.
Considering definitions has no independent term, the * repetition, with or without <L>, seems a good way to handle that language element. The low-latency version offers the advantage of directly doing operations that would otherwise require post-processing the final vector or dictionary, so it may be the better choice here.
value & Recap
The value that our constants can represent are varied:
- string literals, for file names, tag names
- unsigned integer literals, for indentation levels
- boolean literals, to enable optional features like spans generation
- identifiers, if one constants equals another
- keywords like
stdoutwhich is a possible output code location
The grammar elements so far are as follows, where some details are not yet defined (the …). This time, we show the full lexicon definitions instead of a simplified grammar:
lexicon Config;
fragment EscChar : '\\' [nrt"\\];
fragment Char : EscChar | ~[\n\r\t"\\];
// punctuation and operators
Equal : '=';
Semicolon : ';';
// keywords
Def : 'def';
Lexer : 'lexer';
Options : 'options';
Parser : 'parser';
Stdout : 'stdout';
// variable terminals
BoolLiteral : 'false' | 'true';
Id : [a-zA-Z][a-zA-Z_0-9]*;
NumLiteral : [0-9]+;
StrLiteral : '"' Char+ '"';
SComment : '/*' .*? '*/' -> skip;
SLineComment : '//' ~[\r\n]* -> skip;
SWhiteSpace : [ \n\r\t]+ -> skip;
grammar Config;
config:
definitions
lexer
parser
options
;
definitions:
(<L=i_def> Def Id Equal value Semicolon)* // any type
;
lexer:
Lexer …
;
parser:
(Parser …)?
;
options:
(Options …)?
;
value:
BoolLiteral
| NumLiteral
| StrLiteral
| Id
| Stdout
;
A few comments on the lexicon:
- The
fragmentacts as an alias for lexical rules, or parts of lexical rules, that you intend to use multiple times or that you’d rather write separately for the sake of clarity. - The rules are expressed like simple regular expressions:
- Matching string literals are between single quotes:
'def'. - Sets of characters can be defined between square brackets (
[a-z]) or as ellipsis-separated characters ('a'..'z'). The dot.means any character at all. - An item can be negated with the prefix
~(normally used with character sets). - Items can be optional (
?) or repeated (*or+); non-greedy repetitions are noted*?and+?. - Escape codes can be used for usual characters:
\n,\r,\t,\\,\', and for UTF-8 codes:\u{3b5}.
- Matching string literals are between single quotes:
- The fixed keywords are defined before variable tokens like
Id, which could conflict with them. For example, “def” corresponds to both the definition ofDefandId. In that case, the first defined keyword prevails, which is what we want here. It means we can’t use “def” as anId, but that’s a usual limitation in programming languages. - The last three tokens have a
-> skipaction—everything after->is called an action.skipdiscards the token, so the parser will never receive it. In this case, we use that to discard all the comments and the white spaces and lines between the other tokens. It’s placed after everything so we’re sure those rules have the lowest priority.
The value productions are straightforward. The four first tokens are variable, so the parsed input is available in the context variants:
#![allow(unused)]
fn main() {
pub enum CtxValue {
V1 { boolliteral: String }, // value -> BoolLiteral
V2 { numliteral: String }, // value -> NumLiteral
V3 { strliteral: String }, // value -> StrLiteral
V4 { id: String }, // value -> Id
V5, // value -> "stdout"
}
pub trait ConfigListener {
// ...
fn init_value(&mut self) {}
fn exit_value(&mut self, ctx: CtxValue) -> SynValue;
}
}
lexer and parser
The lexer and the parser specifications are similar: they include, between brackets, one or more comma-separated options of the type:
combinedandinput: define the location of the lexicon; in the case ofcombined, it also contains the grammar.output: defines the location of the generated code.header: defines header(s) put in front of the generated code; e.g. inner attributes, comments, … (header: { "#![allow(unused)]", "#![cfg(tests)]" })indent: defines the indentation (indent: 4).
An input location is either
- a filename (
input: "config.l"). - a filename and the name of a tag between square brackets (
input: "config.lg" ["lexicon"]).
An output location can be either
- a filename or
stdout. - a filename or
stdout, and the name of a tag between square brackets.
Since both lexer and parser use the same types of options, we’ll use a common io_options rule to parse them.
It would be possible to force the grammar to accept either combined or input, and even to accept combined only for the lexer and not the parser, but it would complicate it significantly. It’s much easier to accept a list of any option and to verify them in the listener.
Similarly, we’ll use value for each argument, even if "indent" only accepts a number (or a constant Id holding a number) and "input" only accepts a string literal. Again, it’s easier to check the correct type in the listener than to surcharge the grammar.
Here are a few punctual additions to the lexicon:
Colon : ':';
Comma : ',';
Lbracket : '{';
LSbracket : '[';
Rbracket : '}';
RSbracket : ']';
and to the grammar:
lexer:
Lexer Lbracket io_options Rbracket
;
parser:
(Parser Lbracket io_options Rbracket)?
;
io_options:
io_option (<L=i_io_opt> Comma io_option)*
;
io_option:
Combined Colon value tag_opt
| Input Colon value tag_opt
| Output Colon value tag_opt
| Indent Colon value
| Headers Colon Lbracket value (Comma value)* Rbracket
;
tag_opt:
(LSbracket value RSbracket)?
;
The lexer part is simple:
#![allow(unused)]
fn main() {
pub enum CtxLexer {
/// `lexer -> "lexer" "{" io_options "}"`
V1 { io_options: SynIoOptions },
}
pub trait ConfigListener {
// ...
fn init_lexer(&mut self) {}
fn exit_lexer(&mut self, ctx: CtxLexer) -> SynLexer;
}
}
The parser part adds an alternative used when there is no parser (the listener methods are similar to the lexer’s):
#![allow(unused)]
fn main() {
pub enum CtxParser {
/// `parser -> "parser" "{" io_options "}"`
V1 { io_options: SynIoOptions },
/// `parser -> ε`
V2,
}
}
The io_options is a low-latency repetition and has a little surprise in reserve for us:
#![allow(unused)]
fn main() {
pub enum CtxIoOptions {
/// `io_options -> io_option (<L> "," io_option)*`
V1 { star: SynIIoOpt },
}
pub enum InitCtxIIoOpt {
/// value of `io_option` before `<L> "," io_option` iteration in
/// `io_options -> io_option ( ►► <L> "," io_option ◄◄ )*`
V1 { io_option: SynIoOption },
}
pub enum CtxIIoOpt {
/// `<L> "," io_option` iteration in
/// `io_options -> io_option ( ►► <L> "," io_option ◄◄ )*`
V1 { io_option: SynIoOption },
}
pub trait ConfigListener {
// ...
fn init_io_options(&mut self) {}
fn exit_io_options(&mut self, ctx: CtxIoOptions) -> SynIoOptions;
fn init_i_io_opt(&mut self, ctx: InitCtxIIoOpt) -> SynIIoOpt;
fn exit_i_io_opt(&mut self, acc: &mut SynIIoOpt, ctx: CtxIIoOpt);
fn exitloop_i_io_opt(&mut self, acc: &mut SynIIoOpt) {}
}
}
There’s an extra context type InitCtxIIoOpt, and CtxIoOptions is missing the io_option field for the nonterminal in front of the repetition!
What happened here is that Lexigram recognized that io_option ("," io_option)* was a list of io_option items separated by a comma. It changed the wrapper to make our work easier with a first call init_i_io_opt(…) handing over the first value (the one before the repetition). The successive calls exit_i_io_opt(…) give the loop items inside the repetition.
Without that feature, exit_i_io_opt(…) would give items 2 to n, and finally exit_io_options(…) would give item 1 and our accumulator for items 2 to n. If star was a vector and nothing else mattered, we could get away by inserting item 1 in front. But if we needed to check that all the items are given in the correct order or without duplicates, we’d need extra post-processing, so we’d lost the benefit of a low-latency loop. Here, we get all the items in time and in the right order.
Since it’s an <L> loop, init_i_io_opt(…) returns the first value of the accumulator, and exit_i_io_opt(…) gets a mutable reference to it while it’s still on the stack. This is more efficient than popping the accumulator from the stack, moving it into the context and then out again, and finally pushing the modified value on the stack again.
We find something similar in io_option, although it’s not as conspicuous:
#![allow(unused)]
fn main() {
pub enum CtxIoOption {
/// `io_option -> "combined" ":" value tag_opt`
V1 { value: SynValue, tag_opt: SynTagOpt },
/// `io_option -> "input" ":" value tag_opt`
V2 { value: SynValue, tag_opt: SynTagOpt },
/// `io_option -> "output" ":" value tag_opt`
V3 { value: SynValue, tag_opt: SynTagOpt },
/// `io_option -> "indent" ":" value`
V4 { value: SynValue },
/// `io_option -> "headers" ":" "{" value ("," value)* "}"`
V5 { star: SynIoOption1 },
}
pub enum CtxTagOpt {
/// `tag_opt -> "[" value "]"`
V1 { value: SynValue },
/// `tag_opt -> ε`
V2,
}
pub trait ConfigListener {
// ...
fn init_io_option(&mut self) {}
fn exit_io_option(&mut self, ctx: CtxIoOption) -> SynIoOption;
}
}
The value in front of the "headers" repetition is also missing in the context, but there’s no extra context or method this time. Since it’s a non-<L> repetition, the wrapper takes everything in charge to provide us with an ordered vector. If we parse the following input:
headers: { "a", "b", "c" }
we get a vector vec![a, b, c] in star.0, where a is the SynValue for “a”, and so on.
This token-separated-items feature is only available for * repetitions, <L> and non-<L>, since it wouldn’t make much sense for +. It’s automatically detected for this pattern:
... α (β α)* ...
where
αis a list of symbols, terminals and/or nonterminals.βcan be any number of terminals, but they can’t hold a value. It also works if there’s noβ, but you may as well writeα+in that case.
A proper syntax will likely be added in the future for those repetitions.
options
The general options are different from the io_options of the lexer and the parser, although indent and headers are common. They include, between brackets, one or more comma-separated options:
headersandindent: define the headers and the indentation level for all the components; the syntax is the same as before.libs: defines extra bindings for the wrapper; e.g.libs: { "super::listener_types", "std::collections::HashMap" }nt-valuedefines which nonterminals hold a value of the typeSyn<nonterminal name>:nt-value: default: the nonterminals defined in the grammar and the<L>nonterminals hold a value (which is the default behaviour).nt-value: none: all the nonterminals are valueless. They’re not in the context variants, and there’s no value to return from theexit_*(…)methods. It’s a good option if all the values can easily be stored in the listener.nt-value: parents: only the nonterminals defined in the grammar hold a value. Any nonterminal generated from transformations like left recursion don’t hold any value.nt-value: set { "a", "b", "c" }: only nonterminalsa,b, andchold a value.
spans: followed by a boolean value
We need a few extra keywords in the lexicon:
Libs : 'libs';
NTValue : 'nt-value';
Options : 'options';
Spans : 'spans';
The main grammar rules follow a similar pattern to what we did for the lexer and parser:
options:
(Options Lbracket global_options Rbracket)?
;
global_options:
global_option (<L=i_global_opt> Comma global_option)*
;
global_option:
Headers Colon Lbracket value (Comma value)* Rbracket
| Indent Colon value
| Libs Colon Lbracket value (Comma value)* Rbracket
| NTValue Colon nt_value
| Spans Colon value
;
For global_options, and for "headers" and the "libs" in global_option, we’re using the same token-separated-items lists as before. Since there’s nothing new, we’re not going to look at the generated code, that we’ll see in the implementation section anyway.
The last nonterminal to define is nt_value. Here are the last keywords we need for the lexer:
Default : 'default';
None : 'none';
Parents : 'parents';
Set : 'set';
And here is the grammar part. We use value for the nonterminal names instead of Id because we want to avoid collisions with the existing keywords. For example, we wouldn’t be able to use nonterminal names like lexer and parser in that list, since the lexer wouldn’t identify them as Id, but as other tokens. There is a way around that with more advanced features of Lexigram, though, as you can see in the gen_typedef / typedef example.
nt_value:
Default
| None
| Parents
| Set Lbracket value (Comma value)* Rbracket
;
Final Grammar
lexicon Config;
fragment EscChar : '\\' [nrt"\\];
fragment Char : EscChar | ~[\n\r\t"\\];
// punctuation and operators
Colon : ':';
Comma : ',';
Equal : '=';
Lbracket : '{';
LSbracket : '[';
Rbracket : '}';
RSbracket : ']';
Semicolon : ';';
// keywords
Combined : 'combined';
Def : 'def';
Default : 'default';
Headers : 'headers';
Indent : 'indent';
Input : 'input';
Lexer : 'lexer';
Libs : 'libs';
None : 'none';
NTValue : 'nt-value';
Options : 'options';
Output : 'output';
Parents : 'parents';
Parser : 'parser';
Set : 'set';
Spans : 'spans';
Stdout : 'stdout';
// variable terminals
BoolLiteral : 'false' | 'true';
Id : [a-zA-Z][a-zA-Z_0-9]*;
NumLiteral : [0-9]+;
StrLiteral : '"' Char+ '"';
SComment : '/*' .*? '*/' -> skip;
SLineComment : '//' ~[\r\n]* -> skip;
SWhiteSpace : [ \n\r\t]+ -> skip;
grammar Config;
config:
definitions
lexer
parser
options
;
definitions:
(<L=i_def> Def Id Equal value Semicolon)*
;
lexer:
Lexer Lbracket io_options Rbracket
;
parser:
(Parser Lbracket io_options Rbracket)?
;
options:
(Options Lbracket global_options Rbracket)?
;
io_options:
io_option (<L=i_io_opt> Comma io_option)*
;
io_option:
Combined Colon value tag_opt
| Input Colon value tag_opt
| Output Colon value tag_opt
| Indent Colon value
| Headers Colon Lbracket value (Comma value)* Rbracket
;
tag_opt:
(LSbracket value RSbracket)?
;
global_options:
global_option (<L=i_global_opt> Comma global_option)*
;
global_option:
Headers Colon Lbracket value (Comma value)* Rbracket
| Indent Colon value
| Libs Colon Lbracket value (Comma value)* Rbracket
| NTValue Colon nt_value
| Spans Colon value
;
value:
BoolLiteral
| NumLiteral
| StrLiteral
| Id
| Stdout
;
nt_value:
Default
| None
| Parents
| Set Lbracket value (Comma value)* Rbracket
;