Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Parser Generator Reference

Gram is the part of Lexigram that generates a parser.

The parser consumes an input stream of tokens issued by the lexer and verifies that it can be generated by the grammar. During that process, the derivations let the parser call grammar-related trait methods, which allows the user to retrieve the correspondence between grammar elements (nonterminals and terminals) and the input text.

The grammar production rules are defined in a human-readable language described in this reference. Similarly to the lexer, the code of the “low-level” parser already exists in the lexigram-core crate; its main object is Parser. Gram takes a grammar and generates the data required by Parser and the source code that allows the user to interact with Parser while it parses the input.

The language used to describe the grammar is very similar to that of the ANTLR. However, the generated code is very different, since Lexigram generates a predictive, non-recursive LL(1) parser, not a recursive-descent, adaptive LL(*) parser.

Notation

We’re using the following conventions in the grammar:

  • Terminals are written in upper camel case (since they come from the lexicon): Colon, SymEof.
  • Nonterminals are written in lower snake case: rule, rule_name.
  • We usually write each production alternative on one line, indented, with the | in the first column:
    prod:
        prod_alt
    |   prod Or prod_alt
    ;
    

Those conventions are not enforced by Gram.

We also use a “light” notation for the sake of clarity, when it would be cumbersome to show both the lexicon and the grammar. In this notation, we represent the fixed terminals by their string literal, and we don’t put each alternative on a separate line. To distinguish it from the canonical grammar, we use the symbol -> to separate the left nonterminal from its definition.

prod -> prod_alt | prod "|" prod_alt;

Main Concepts

The grammar defines the structure of the language to be parsed. From the grammar, Gram generates the source code of the parser, which comprises the following parts:

  • The parsing table for a predictive top-down parser, the symbol table, and other data.
  • The build_parser() method that creates a Parser instance from those data—what we call the “low-level parser”.
  • The listener, a trait that the user must implement to interact with the Parser instance.
  • The wrapper, the connection between the Parser instance and the listener.
  • Optionally, a template of the nonterminal types, and a template of a basic listener implementation.

Grammar Structure

The grammar begins with a header that defines the name of the parser:

grammar LexiParser;

The rest of the grammar is a set of rules, where the first defines the top level of the grammar—the parser’s starting rule.

Each rule defines a nonterminal, the name on the left-hand side of :. The productions on the right-hand side are a string of symbols, which are terminals defined in the lexicon and nonterminals defined in the grammar, and operators.

file: 
    header? file_item*
;

file_item:
    option 
|   declaration 
|   rule
;

header:
    Lexicon Id Semicolon
;

declaration:
    Mode Id Semicolon
;

option:
    Channels Lbracket Id (Comma Id)* Rbracket
;

// ...

A normalized grammar rule is made of one or several production alternatives. The only operator allowed in a normalized rule is the “or” symbol |, which separates the alternatives:

file_item:
    option 
|   declaration 
|   rule
;

Gram accepts non-normalized grammars, where other operators can be used:

  • * and + for repetitions of an item:
    actions: action (Comma action)*;
    
  • ? for an optional item:
    file: header? file_item*;
    
  • ( and ) for grouping items:
    message: (Note | Warning | Error) Message;
    
  • <...> for attributes that guide the parser generation:
    expr:
        Sub expr
    |   expr <R> Exp expr          // right-associative
    |   expr (Mul | <P> Div) expr  // same precedence
    |   expr (Add | <P> Sub) expr
    |   Id | Num
    ;
    

The original grammar undergoes a series of transformations to obtain a normalized LL(1) grammar, before Gram generates the parsing table and the source code of the listener and the wrapper.

Listener

The listener is mainly a trait that declares methods for each rule. The user implements those trait methods with the actions that must be performed when the rules are processed during the parsing. To some extent, it corresponds to the source code written on the right-hand side of grammar rules in some parser generators like Yacc and Bison.

Each nonterminal (rule name) can be associated to a value of a user-defined type. The parser calls a trait method bound to each nonterminal once its rule has been fully parsed, and the user implementation code of the method returns the nonterminal value based on the given context, which includes the values of the rule items: other nonterminals and variable tokens like string and numerical literals.

As an example, let’s consider a simple lexicon and grammar. Id and Num are variable tokens representing an identifier name and a numerical value, respectively, and the grammar defines two rules for the nonterminals s and val:

// lexicon file
Equal   : "=";                // fixed token
Exit    : "exit";             // fixed token
Return  : "return";           // fixed token
Id      : [a-z][a-z0-9]*;     // variable token
Num     : [1-9][0-9]*;        // variable token
 
// grammar file
s:   
    Id Equal val 
|   Exit 
|   Return val
;

val: 
    Id 
|   Num
;

Init and Exit Methods

Gram generates the following listener trait, that the user must implement on a listener object:

#![allow(unused)]
fn main() {
pub trait TestListener {
    // ...
    fn init_s(&mut self) {}
    fn exit_s(&mut self, ctx: CtxS) -> SynS;
    fn init_val(&mut self) {}
    fn exit_val(&mut self, ctx: CtxVal) -> SynVal;
}
}

It also generates the following types, used in the callback methods:

#![allow(unused)]
fn main() {
pub enum CtxS {
    /// `s -> Id "=" val`
    V1 { id: String, val: SynVal },
    /// `s -> "exit"`
    V2,
    /// `s -> "return" val`
    V3 { val: SynVal },
}

pub enum CtxVal {
    /// `val -> Id`
    V1 { id: String },
    /// `val -> Num`
    V2 { num: String },
}
}

To make it easier to read, we omit the Rust attributes generated for each type, typically #[derive(Debug)]), which might be necessary for the user.

For simple rules like above, a first call is made when the next rule is determined by the parser; for example, if the “return” token is received when s is one of the potential next nonterminals, the rule s -> Return val is selected and the init_s(&mut self) method is called. It can be used to initialize any data before the rule is actually parsed, so before any subrules is triggered, like val. There are default empty implementations for the initialization methods, so they’re optional.

Another call is made once the rule has been entirely parsed, including all the nonterminals it contains (here, val). This second call gives all the necessary information about what has been parsed in a context object, which specifies which rule alternative was parsed and the values of the relevant nonterminals and tokens. In the example above, the exit_s(&mut self, ctx: CtxS) method is called with ctx equals to CtxS::V3 { val }, val being the value of the nonterminal.

If the parsed text is “return 5”, the following calls are made:

#![allow(unused)]
fn main() {
listener.init_s();   // "return" selected the rule `s -> Return val`
listener.init_val(); // "5" selected the rule `val -> Num`
let val = listener.exit_val(CtxVal::V2 { num: "5".to_string() });
let s = listener.exit_s(CtxS::V3 { val });
}

Each nonterminal can have a value, but it’s only optional. This choice is specified in the options given to Lexigram when it generates the parser source code. If the nonterminal s has a value, exit_s is generated with a return value of type SynS, this type being defined by the user. If s has no value, exit_s has no return value and the trait definition includes a default empty implementation.

Other General Methods

There are a few other methods that the parser calls in some circumstances:

#![allow(unused)]
fn main() {
pub trait TestListener {
    fn check_abort_request(&self) -> bool { false }
    fn get_mut_log(&mut self) -> &mut impl Logger;
    fn exit(&mut self, _s: SynS) {}
    // ...
}
}

The check_abort_request(&self) method is called after each init and exit call. It can be used to return true if an error is detected; for example, if Num cannot be parsed as a number or Id hasn’t been declared. This requests the parser to abort its parsing procedure immediately. It’s the user’s choice to use this request or to manage the errors on their own, but it’s often easier to abort if there’s a risk that an undefined symbol generates too much problems down the line.

A simple way to use the abort request is to set an abort flag in the listener object and to return its value in that method:

impl TestListener for MyTestListener {
    fn check_abort_request(&self) -> bool {
        self.abort
    }
    
    // ...
}

For practical reasons, the listener must own a log object, in which all the warnings and errors will be recorded. The get_mut_log(&mut self) method returns a reference to that log object whenever the parser needs to record something.

pub struct MyTestListener {
    log: BufLog,
    
    // ...
}

impl TestListener for MyTestListener {
    fn get_mut_log(&mut self) -> &mut impl Logger {
        &mut self.log
    }
    
    // ...
}

Finally, the exit(&mut self, s: SynS) method is called once the top rule exits, and gives the value of its nonterminal, if it exists, as a SynS type. There is a default empty implementation, which makes it optional. It can be used to store the final result of the parsing in the listener object or to call any post-processing required at that level, like finalizing the AST tree.

So the entire series of calls, if the parsed text is “return 5”, is actually closer to the following sequence:

#![allow(unused)]
fn main() {
listener.init_s();   // "return" selected the rule `s -> Return val`
listener.init_val(); // "5" selected the rule `val -> Num`
let val = listener.exit_val(CtxVal::V2 { num: "5".to_string() });
if listener.check_abort_request() { return Err(ParserError::AbortRequest) }
let s = listener.exit_s(CtxS::V3 { val });
if listener.check_abort_request() { return Err(ParserError::AbortRequest) };
listener.exit(s);
return Ok(());
}

Wrapper

The wrapper, which is entirely implemented in generated source code, calls the listener methods and keeps track of the data: nonterminal and terminal values, and position of the symbols in the original input text.

#![allow(unused)]
fn main() {
    pub struct Wrapper<T> {
        listener: T,                // user's listener instance
        stack: Vec<EnumSynValue>,   // nonterminal values
        stack_t: Vec<String>,       // terminal values
        stack_span: Vec<PosSpan>,   // positions
        // ...
    }

    impl<T: TestListener> Wrapper<T> {
        pub fn new(listener: T, verbose: bool) -> Self {
            Wrapper { 
              listener,
              // ...
            }
        }
      
        pub fn get_listener(&self) -> &T {
            &self.listener
        }

        pub fn get_listener_mut(&mut self) -> &mut T {
            &mut self.listener
        }

        pub fn give_listener(self) -> T {
            self.listener
        }
        // ...
    }

    impl<T: TestListener> ListenerWrapper for Wrapper<T> {
        fn switch(&mut self, call: Call, nt: VarId, alt_id: AltId, t_data: Option<Vec<String>>) {
            // ...
            match call {
                Call::Enter => {
                    match nt {
                        0 => self.listener.init_s(),                // s
                        1 => self.listener.init_val(),              // val
                        _ => panic!("unexpected enter nonterminal id: {nt}")
                    }
                }
                Call::Exit => {
                    match alt_id {
                        0 |                                         // s -> Id "=" val
                        1 |                                         // s -> "exit"
                        2 => self.exit_s(alt_id),                   // s -> "return" val
                        3 |                                         // val -> Id
                        4 => self.exit_val(alt_id),                 // val -> Num
                        _ => panic!("unexpected exit alternative id: {alt_id}")
                    }
                }
                // ...
            }
        }
    }
}

The listener is given to the wrapper before parsing the input; it can be recovered later if necessary with wrapper.give_listener(). A mutable reference of that wrapper is then given to the parse(…) method of the Parser instance.

The top-level user parser can be summarized by the following steps:

#![allow(unused)]
fn main() {
pub fn parse(text: &str) -> Result<BufLog, BufLog> {
    // instantiates the lexer and low-level parser:
    let mut lexer = build_lexer();
    let mut parser = build_parser();
  
    // instantiates the user listener and the wrapper:
    let listener = MyTestListener::new();
    let mut wrapper = Wrapper::new(listener, VERBOSE_WRAPPER);
  
    // creates a stream from the input text and attaches it to the lexer:
    let stream = CharReader::new(Cursor::new(text));
    lexer.attach_stream(stream);
  
    // passes the tokens from the lexer to the parser:
    let tokens = self.lexer.tokens().keep_channel0();
  
    // parses the tokens using the wrapper and the listener:
    let result = self.parser.parse_stream(&mut wrapper, tokens);
  
    // handles the results:
    let MyTestListener { mut log, .. } = wrapper.give_listener();
    if let Err(e) = result {
        log.add_error(e.to_string());
    }
    if log.has_no_errors() {
        Ok(log)
    } else {
        Err(log)
    }
}
}

Nonterminal Values

Nonterminals can have a value, though this isn’t mandatory. When a nonterminal has no value, the generated code still includes init and exit methods, but the nonterminals are not present in the contexts of other nonterminals that include them.

For example, consider the following rules:

a -> b c | c
b -> Op c
c -> Id

where Op and Id are variable tokens that contain string data.

If a, b, and c have a value, the following code is generated:

#![allow(unused)]
fn main() {
pub enum CtxA {
    /// `a -> b c`
    V1 { b: SynB, c: SynC },
    /// `a -> c`
    V2 { c: SynC },
}

pub enum CtxB {
    /// `b -> Op c`
    V1 { op: String, c: SynC },
}

pub enum CtxC {
    /// `c -> Id`
    V1 { id: String },
}

pub trait TestListener {
    // ... (other general methods)
    fn init_a(&mut self) {}
    fn exit_a(&mut self, ctx: CtxA) -> SynA;
    fn init_b(&mut self) {}
    fn exit_b(&mut self, ctx: CtxB) -> SynB;
    fn init_c(&mut self) {}
    fn exit_c(&mut self, ctx: CtxC) -> SynC;
}
}

As expected, the context given for the first alternative of a includes the data for the nonterminals b and c, with user-defined types SynB and SynC, while the second alternative only includes c. Similarly, CtxB includes a field for the c nonterminal.

If c has no value, the code becomes:

#![allow(unused)]
fn main() {
pub enum CtxA {
    /// `a -> b c`
    V1 { b: SynB },
    /// `a -> c`
    V2,
}

pub enum CtxB {
    /// `b -> Op c`
    V1 { op: String },
}

pub enum CtxC {
    /// `c -> Id`
    V1 { id: String },
}

pub trait TestListener {
    // ... (other general methods)
    fn init_a(&mut self) {}
    fn exit_a(&mut self, ctx: CtxA) -> SynA;
    fn init_b(&mut self) {}
    fn exit_b(&mut self, ctx: CtxB) -> SynB;
    fn init_c(&mut self) {}
    fn exit_c(&mut self, _ctx: CtxC) {}
}
}

The variant CtxA::V2 is empty since there is no data, but it still exists so that the user knows which alternative of the a rule was parsed.

The trait still allows for callbacks before and after c is parsed, but there is no return value. This also makes the implementation of exit_c optional, so a default empty implementation is provided.

If both b and c have no value, the code becomes:

#![allow(unused)]
fn main() {
pub enum CtxA {
    /// `a -> b c`
    V1,
    /// `a -> c`
    V2,
}

pub enum CtxB {
    /// `b -> Op c`
    V1 { op: String },
}

pub enum CtxC {
    /// `c -> Id`
    V1 { id: String },
}

pub trait TestListener {
    // ...
    fn init_a(&mut self) {}
    fn exit_a(&mut self, ctx: CtxA) -> SynA;
    fn init_b(&mut self) {}
    fn exit_b(&mut self, _ctx: CtxB) {}
    fn init_c(&mut self) {}
    fn exit_c(&mut self, _ctx: CtxC) {}
}
}

Here, all the variants of the CtxA context are empty. If a only had one alternative, the context would still have one empty variant, even though it carries no information at all. It would be generated to keep the API consistent, but it could be ignored entirely.

When a nonterminal has a value, the user must define its type. By default, the type name is Syn<nonterminal>, where <nonterminal> is the upper camel case of the nonterminal’s name. For example, a is associated to the SynA type, which can be a genuine type or an alias.

Lexigram generates a template on demand, which can be copied the first time or when the grammar is changed:

/// User-defined type for `a`
#[derive(Debug, PartialEq)]
pub struct SynA();

Finally, Lexigram takes care of generating the types of the nonterminals it adds when transforming the grammar. We indicate that in the generated code, in the following sections.

Grammar Transformations

Lexigram accepts non-LL(1) grammars if it’s able to transform them into LL(1). It tries to provide some degree of transparency by not showing the transformation in the listener trait and types, but some patterns require specific listener calls.

The transformations done by Lexigram are:

  • ? operator: a -> X? becomes X | ε
  • * operators: a -> X* becomes a -> a_1, a_1 -> X a_1 | ε
  • + operators: a -> X+ becomes a -> a_1, a_1 -> X a_1 | X
  • left recursion:
    • a -> a X | Y becomes a -> Y a_1, a_1 -> X a_1 | ε
    • a -> a X a | Y and more complex ambiguities, typically used in expression parsing, are transformed with the Clarke method, which is detailed later
  • left factorization: a -> A B | A C becomes a -> A a_1, a_1 -> B | C

A few rules have two implementations: normal or low-latency (<L> attribute). Firstly in the + and * repetitions:

  • a -> X* becomes a -> a_1, a_1 -> X a_1 | ε, but the a_1 iterator nonterminal is not visible to the user. The generated code gathers all the values and puts them in the context of the exit_a method.
  • a -> (<L=i> X)* becomes a -> i, i -> X i | ε, as above, but a callback method is generated for i so that the user can manage the iterations manually, almost as if the user had written the transformed rules in the original grammar. Note that the iterator nonterminal is explicitly named by the user.
  • Similar transformations apply to a -> X+ and a -> (<L=i> X)+.

Secondly, in right-recursive rules:

  • a -> X a | Y behaves as expected: since it’s right recursive, a text “XXY” is decomposed as a -> X (a -> X (a -> Y)). The first exit_a call occurs on the end of the iterations, with a -> Y, and the next two calls backtrack on the 2nd “X” then on the first “X”. This means that the first call occurs after the parser has already parsed the entire sequence and put all the intermediate values on the stack, which it empties one element at a time when calling exit_a.
  • a -> X <L> a | Y is offering a shortcut to avoid the latency of the normal behaviour, by calling exit_a on each newly found value, a little as if it was a left-recursive rule. The first exit_a call occurs on the first “X”, the 2nd on the next “X”, and the last on the “Y”. It is then up to the user to rebuild the AST in the desired order.

The purpose of the <L> attribute is to offer the ability of parsing a potentially infinite stream without storing all the intermediate items, or to produce an immediate output before reaching the end of the sequence in the stream.

Repetitions With *

a -> A B* C

is transformed into

a -> A a_1 C
a_1 -> B a_1 | ε    

The generated code is as follows:

#![allow(unused)]
fn main() {
pub enum CtxA {
    /// `a -> A B* C`
    V1 { a: String, star: SynA1, c: String },
}

pub struct SynA1(pub Vec<String>);

pub trait TestListener {
    // ...
    fn init_a(&mut self) {}
    fn exit_a(&mut self, ctx: CtxA) -> SynA;
}
}

Sequence of the operations:

  • init_a(…) is called before the rule is parsed.
  • exit_a(…) is called after a -> A B* C is parsed. The repetition is available in the star field of the ctx: CtxA argument. It wraps a vector containing all the parsed values.

Repetitions With +

a -> A B+ C

is transformed into

a -> A a_1 C
a_1 -> B a_2
a_2 -> a_1 | ε    

The generated code is very similar to the * repetition:

#![allow(unused)]
fn main() {
pub enum CtxA {
    /// `a -> A B+ C`
    V1 { a: String, plus: SynA1, c: String },
}

/// Computed `B+` array in `a -> A  ►► B+ ◄◄  C`
pub struct SynA1(pub Vec<String>);

pub trait TestListener {
    // ...
    fn init_a(&mut self) {}
    fn exit_a(&mut self, ctx: CtxA) -> SynA;
}
}

Token-Separated Items

A special case of the * repetition recognized by Lexigram is a list of identical items separated by one or several fixed tokens:

a -> "var" Id ("," Id)* ";"; 

is transformed into

a -> "var" Id a_1 ";"
a_1 -> "," Id a_1    
a_1 -> ε

The generated code is as follows:

#![allow(unused)]
fn main() {
pub enum CtxA {
    /// `a -> "var" Id ("," Id)* ";"`
    V1 { star: SynA1 },
}

pub struct SynA1(pub Vec<String>);

pub trait TestListener {
    // ...
    fn init_a(&mut self) {}
    fn exit_a(&mut self, ctx: CtxA, spans: Vec<PosSpan>) -> SynA;
}
}

The differences from a normal * repetition are:

  • The first Id item (before ("," Id)*) is moved into the vector, as star.0[0].
  • That first item isn’t present in the CtxA::V1 variant.

Repetitions With <L>*

a -> A (<L=i> B)* C

is transformed into

a -> A i C
i -> B i | ε    

Note that the nonterminal used to represent the iterations must be defined by the user, since its type is user-defined (here, i of type SynI).

The generated code is as follows:

#![allow(unused)]
fn main() {
pub enum CtxA {
    /// `a -> A (<L> B)* C`
    V1 { a: String, star: SynI, c: String },
}

pub enum CtxI {
    /// `<L> B` iteration in `a -> A ( ►► <L> B ◄◄ )* C`
    V1 { b: String },
}

pub trait TestListener {
    // ...
    fn init_a(&mut self) {}
    fn exit_a(&mut self, ctx: CtxA, spans: Vec<PosSpan>) -> SynA;
    fn init_i(&mut self) -> SynI;
    fn exit_i(&mut self, acc: &mut SynI, ctx: CtxI, spans: Vec<PosSpan>);
    fn exitloop_i(&mut self, acc: &mut SynI) {}
}
}
  • init_a(…) is called before the rule a is parsed.
  • init_i(…) is called before the rule i is parsed for the first time. This is a good place to initialize the whole series; the method must return the accumulator variable that will be updated at each iteration.
  • exit_i(…) is called after parsing each item of the repetition, if there are any items. It receives the current value of the accumulator acc and the values of the parsed items in the context variant, here b. The accumulator can be updated thanks to the mutable reference.
  • exitloop_i(…) is called after the last iteration. It can be used if the accumulator requires post-processing once the whole series has been parsed.

Repetitions With <L>+

a -> A (<L=i> B)+ C

is transformed into

a -> A i C
i -> B a_1
a_1 -> i | ε  

The generated code is as follows:

#![allow(unused)]
fn main() {
pub enum CtxA {
    /// `a -> A (<L> B)+ C`
    V1 { a: String, plus: SynI, c: String },
}

pub enum CtxI {
    /// `<L> B` iteration in `a -> A ( ►► <L> B ◄◄ )+ C`
    V1 { b: String, last_iteration: bool },
}

pub trait TestListener {
    // ...
    fn init_a(&mut self) {}
    fn exit_a(&mut self, ctx: CtxA, spans: Vec<PosSpan>) -> SynA;
    fn init_i(&mut self) -> SynI;
    fn exit_i(&mut self, acc: &mut SynI, ctx: CtxI, spans: Vec<PosSpan>);
}
}

The differences from a repetition with <L>* are:

  • A last_iteration field that flags the last iteration, when the a_1 -> ε alternative is reached.
  • No exitloop_i(…) method, which has been replaced by the flag above because the last iteration contains the same data as the previous iterations.
  • exit_i(…) is sure to be called at least once.
  • The resulting field in the CtxA::V1 variant is named plus instead of star.

Token-Separated Items With <L>

A list of identical items separated by one or several fixed tokens is also recognized in a <L>* repetition:

a -> "var" Id (<L=i> "," Id)* ";"; 

is transformed into

a -> "var" Id i ";"
i -> "," Id i    
i -> ε

The generated code is as follows:

#![allow(unused)]
fn main() {
pub enum CtxA {
    /// `a -> "var" Id (<L> "," Id)* ";"`
    V1 { star: SynI },
}

pub enum InitCtxI {
    /// value of `Id` before `<L> "," Id` iteration in 
    ///   `a -> "var" Id ( ►► <L> "," Id ◄◄ )* ";"`
    V1 { id: String },
}

pub enum CtxI {
    /// `<L> "," Id` iteration in 
    ///   `a -> "var" Id ( ►► <L> "," Id ◄◄ )* ";"`
    V1 { id: String },
}
pub trait TestListener {
    // ...
    fn init_a(&mut self) {}
    fn exit_a(&mut self, ctx: CtxA, spans: Vec<PosSpan>) -> SynA;
    fn init_i(&mut self, ctx: InitCtxI, spans: Vec<PosSpan>) -> SynI;
    fn exit_i(&mut self, acc: &mut SynI, ctx: CtxI, spans: Vec<PosSpan>);
    fn exitloop_i(&mut self, acc: &mut SynI) {}
}
}

The differences from a normal repetition with <L>* are:

  • The first Id item (before (<L> "," Id)*) is given in the ctx: InitCtxI argument of the init_i(…) method.
  • That first item isn’t present in the exit method context CtxI.

The subsequent Id items (in (<L> "," Id)*) are given in the ctx: CtxI argument of the exit_i(…) method, as usual.

Right Recursion

expr -> Id "." expr | "(" Num ")"

Right-recursive rules don’t need to be modified. The generated code is as follows:

#![allow(unused)]
fn main() {
pub enum CtxExpr {
    /// `expr -> Id "." expr`
    V1 { id: String, expr: SynExpr },
    /// `expr -> "(" Num ")"`
    V2 { num: String },
}

pub trait TestListener {
    // ...
    fn init_expr(&mut self) {}
    fn exit_expr(&mut self, ctx: CtxExpr) -> SynExpr;
}
}

The independent alternative is given first by exit_expr(…), then the recursive alternatives are given in reverse order, accordingly to the rule.

Id0.Id1(5)

yields the following calls:

  • init_expr()
  • exit_expr(ctx = CtxA::V2 { num: "5" }) -> x
  • exit_expr(ctx = CtxA::V1 { id: "Id1", expr: x }) -> y
  • exit_expr(ctx = CtxA::V1 { id: "Id0", expr: y }) -> z
  • (the production that includes expr receives the value z)

Right Recursion With <L>

expr -> <L> Id "." expr | "(" Num ")"

The generated code is as follows:

#![allow(unused)]
fn main() {
pub enum CtxExpr {
    /// `expr -> <L> Id "." expr`
    V1 { id: String },
    /// `expr -> "(" Num ")"`
    V2 { num: String },
}

pub trait TestListener {
    // ...
    fn init_expr(&mut self) -> SynExpr;
    fn exit_expr(&mut self, acc: &mut SynExpr, ctx: CtxExpr, spans: Vec<PosSpan>);
}
}

Contrary to the * and + repetitions, the user mustn’t define a name in the attribute, since there is no transformation adding another nonterminal.

Since this form uses an accumulator, it is necessary to initialize it with init_expr(…).

This <L> version gives the values in the incoming order:

Id0.Id1(5)

yields the following calls:

  • init_expr() -> w
  • exit_expr(*acc: w, ctx = CtxA::V1 { id: "Id0" }), *acc = x
  • exit_expr(*acc: x, ctx = CtxA::V1 { id: "Id1" }), *acc = y
  • exit_expr(*acc: z, ctx = CtxA::V2 { num: "5" }), *acc = z
  • (the production that includes expr receives the value z)

Left Recursion

e -> f | e "." Id

is transformed into

e -> f e_1       
e_1 -> "." Id e_1 | ε         

The generated code is as follows:

#![allow(unused)]
fn main() {
pub enum CtxE {
    /// `e -> f`
    V1 { f: SynF },
    /// `e -> e "." Id`
    V2 { e: SynE, id: String },
}

pub trait TestListener {
    // ...
    fn init_e(&mut self) {}
    fn exit_e(&mut self, ctx: CtxE) -> SynE;
    fn exitloop_e(&mut self, _e: &mut SynE) {}
}
}

This is similar to an <L> repetition, except the initialization part. The accumulator is not returned by init_e, but it’s created at the first iteration, when the rule matches a non-recursive alternative of the rule; in this case, e -> f.

  • init_e(…) is called before the rule is parsed.
  • exit_e(…) is called after each variant is parsed.
    • The first time, it’s a non-recursive alternative; here, e -> f. It must return the initial value of the accumulator (which may be its final value if there are no iterations).
    • If there are iterations of the recursive alternative, exit_e(…) is called for each iteration, with the current value of the accumulator in context and the parsed values of that iteration. It must return the updated accumulator.
  • exitloop_e(…) is called after the last iteration. It can be used if the accumulator requires post-processing once the whole series has been parsed.

Left Factorization

a -> A | A B | A B C | A B D | E

which can be rewritten as

a -> A (ε | B (ε | C | D)) | E
              ------------ a2
       ------------------- a1

is transformed into

a -> A a_1  
a -> E      
a_1 -> B a_2
a_1 -> ε    
a_2 -> C    
a_2 -> D    
a_2 -> ε    

The generated code is as follows:

#![allow(unused)]
fn main() {
pub enum CtxA {
    /// `a -> A`
    V1 { a: String },
    /// `a -> A B`
    V2 { a: String, b: String },
    /// `a -> A B C`
    V3 { a: String, b: String, c: String },
    /// `a -> A B D`
    V4 { a: String, b: String, d: String },
    /// `a -> E`
    V5 { e: String },
}

pub trait TestListener {
    // ...
    fn init_a(&mut self) {}
    fn exit_a(&mut self, ctx: CtxA) -> SynA;
}
}

Advanced Transformations

Ambiguous Left and Right Recursion

Rules that are both left- and right-recursive are ambiguous in an LL(1) grammar, since they don’t define the precedence nor the associativity. The typical use of such rules are for expressions containing binary operators.

Lexigram automatically transforms those rules using the Clarke method, which preserves the precedence and left-/right-associativity. The same rule can contain binary and prefixed / suffixed unary operators (e.g. expr "+" expr, "-" expr and expr "?", respectively).

The precedence is determined by the order in which the operators are defined in the rule:

  • The alternatives defined first have higher precedence; in this example, * has higher precedence than +

    expr -> expr "*" expr | expr "+" expr | Num

  • The <P> attribute indicates an operator with the same level of precedence as the previous one on its left. For example, it can be used in a rule like

    expr -> expr "+" expr | <P> expr "-" expr | Id | Num

By default, binary operators are left-associative. The <R> attribute indicates a right-associative operator. For example,

expr -> <R> expr "^" expr | expr "*" expr | expr "/" expr | Num

The attributes can be placed anywhere in the rule alternative, but we usually place them at the beginning to be consistent.

The rule

e -> e "*" e | <R> e "!" e | e "+" e | Num

is transformed into

e -> e_4 e_1      
e_1 -> "*" e_4 e_1 | "!" e_2 e_1 | "+" e_2 e_1 | ε          
e_2 -> e_4 e_3    
e_3 -> "*" e_4 e_3 | "!" e_2 e_3 | ε          
e_4 -> Num        

The generated code is as follows:

#![allow(unused)]
fn main() {
pub enum CtxE {
    /// `e -> e "*" e`
    V1 { e: [SynE; 2] },
    /// `e -> <R> e "!" e`
    V2 { e: [SynE; 2] },
    /// `e -> e "+" e`
    V3 { e: [SynE; 2] },
    /// `e -> Num`
    V4 { num: String },
}

pub trait TestListener {
    // ...
    fn init_e(&mut self) {}
    fn exit_e(&mut self, ctx: CtxE) -> SynE;
}
}

Since there are two identical symbols in the same alternatives, they are regrouped in an array e: [SynE; 2], where e[0] is the leftmost nonterminal and e[1] is the rightmost one.

The first variant V1 is related to the operator of highest precedence, and the last variant is related to the lowest precedence.

Repetitions of Longer Patterns

a -> (b A b B A)+

is transformed into

a -> a_1            
a_1 -> b A b B A a_2
a_2 -> a_1 | ε            

The generated code is as follows:

#![allow(unused)]
fn main() {
pub enum CtxA {
    /// `a -> (b A b B A)+`
    V1 { plus: SynA1 },
}

/// Computed `(b A b B A)+` array in `a ->  ►► (b A b B A)+ ◄◄ `
pub struct SynA1(pub Vec<SynA1Item>);

/// `b A b B A` item in `a -> ( ►► b A b B A ◄◄ )+`
pub struct SynA1Item { pub b: [SynB; 2], pub a: [String; 2], pub b1: String }

pub trait TestListener {
    // ..
    fn init_a(&mut self) {}
    fn exit_a(&mut self, ctx: CtxA) -> SynA;
}
}

In comparison to the base * repetition, we see that the type of the items in the list, SynA1Item, is a struct rather than a string. The same symbols have been regrouped in arrays:

  • b: [SynB; 2] for the two nonterminals b.
  • a: [String; 2] for the two tokens A (note that the field must be lowercase, despite representing a token), which contain a string.

Also, the token B is represented by b1, since the name b was already taken. The fields are in the same order as the first symbol they represent in the rule; here, b is first, A second, and B third. Their type is another indication that helps set them apart: the type of a token is a string, while a nonterminal has a custom type whose name begins by “Syn”, for synthesized.

Nested Repetitions

a -> (A (b ",")* ";")* C

is transformed into

a -> a_2 C          
b -> B              
a_1 -> b "," a_1 | ε            
a_2 -> A a_1 ";" a_2 | ε            

The generated code is as follows:

#![allow(unused)]
fn main() {
pub enum CtxA {
    /// `a -> (A (b ",")* ";")* C`
    V1 { star: SynA2, c: String },
}

/// Computed `(b ",")*` array in `a -> (A  ►► (b ",")* ◄◄  ";")* C`
pub struct SynA1(pub Vec<SynB>);

/// Computed `(A (b ",")* ";")*` array in `a ->  ►► (A (b ",")* ";")* ◄◄  C`
pub struct SynA2(pub Vec<SynA2Item>);

/// `A (b ",")* ";"` item in `a -> ( ►► A (b ",")* ";" ◄◄ )* C`
pub struct SynA2Item { pub a: String, pub star: SynA1 }

pub trait TestListener {
    // ...
    fn init_a(&mut self) {}
    fn exit_a(&mut self, ctx: CtxA) -> SynA;
}
}

When a repetition is nested in another, the resulting vector becomes a field of the outer repetition’s type. Here, SynA2Item::star of type SynA1 contains a Vec<SynB>, where SynB contains the only data of the inner loop, which is the nonterminal b. Each generated type has a doc comment that emphasizes the part of the original rule it represents between “►►” and “◄◄” arrows.

It’s also possible to nest any combination of non-<L> and <L> repetitions. The principle is always the same: non-<L> repetitions are represented by a Vec of values gathered automatically by the parser, and an <L> repetition is managed manually by the user through extra nonterminal init(…)/exit(…) methods and is represented by a user-defined type.

Alternatives in Repetitions

a -> A (B | b C b B C | E)* F

is transformed into

a -> A a_1 F        
a_1 -> B a_1 | b C b B C a_1 | E a_1 | ε            

The generated code is as follows:

#![allow(unused)]
fn main() {
pub enum CtxA {
    /// `a -> A (B | b C b B C | E)* F`
    V1 { a: String, star: SynA1, f: String },
}

pub enum CtxB {
    /// `b -> D`
    V1 { d: String },
}

/// Computed `(B | b C b B C | E)*` array in `a -> A  ►► (B | b C b B C | E)* ◄◄  F`
pub struct SynA1(pub Vec<SynA1Item>);

pub enum SynA1Item {
    /// `B` item in `a -> A ( ►► B ◄◄  | b C b B C | E)* F`
    V1 { b: String },
    /// `b C b B C` item in `a -> A (B |  ►► b C b B C ◄◄  | E)* F`
    V2 { b: [SynB; 2], c: [String; 2], b1: String },
    /// `E` item in `a -> A (B | b C b B C |  ►► E ◄◄ )* F`
    V3 { e: String },
}

pub trait TestListener {
    // ...
    fn init_a(&mut self) {}
    fn exit_a(&mut self, ctx: CtxA) -> SynA;
}
}

Since each iteration can be one of 3 strings of symbols, its type SynA1Item is an enum with 3 variants, each containing the corresponding data.

Alternatives in Repetitions with <L>

a -> A (<L=i> (<L=j> b C b B C | D)* E | F)* G

is transformed into:

a -> A i G      
i -> j E i | F i | ε          
j -> b C b B C j | D j | ε          

The generated code is as follows:

#![allow(unused)]
fn main() {
pub enum CtxA {
    /// `a -> A (<L> (<L> b C b B C | D)* E | F)* G`
    V1 { a: String, star: SynI, g: String },
}

pub enum CtxI {
    /// `<L> (<L> b C b B C | D)* E` iteration in `a -> A ( ►► <L> (<L> b C b B C | D)* E ◄◄  | F)* G`
    V1 { star: SynJ, e: String },
    /// `F` iteration in `a -> A (<L> (<L> b C b B C | D)* E |  ►► F ◄◄ )* G`
    V2 { f: String },
}

pub enum CtxJ {
    /// `<L> b C b B C` iteration in `a -> A (<L> ( ►► <L> b C b B C ◄◄  | D)* E | F)* G`
    V1 { b: [SynB; 2], c: [String; 2], b1: String },
    /// `D` iteration in `a -> A (<L> (<L> b C b B C |  ►► D ◄◄ )* E | F)* G`
    V2 { d: String },
}

pub trait TestListener {
    // ...
    fn init_a(&mut self) {}
    fn exit_a(&mut self, ctx: CtxA, spans: Vec<PosSpan>) -> SynA;
    fn init_i(&mut self) -> SynI;
    fn exit_i(&mut self, acc: &mut SynI, ctx: CtxI, spans: Vec<PosSpan>);
    fn exitloop_i(&mut self, acc: &mut SynI) {}
    fn init_j(&mut self) -> SynJ;
    fn exit_j(&mut self, acc: &mut SynJ, ctx: CtxJ, spans: Vec<PosSpan>);
    fn exitloop_j(&mut self, acc: &mut SynJ) {}
}
}

Notes:

  • In case of <L> + repetitions, a last_iteration boolean field flags the last repetition instead of a exitloop_*(…) method call, as seen before.

Formal Grammar Definition

Here are the lexicon and the grammar of Gram, using the Lexigram syntax. Those files can be found in the repository, in the build-stage1 directory.

Gram’s Lexicon

lexicon GramLexer;

fragment COMMENT        : '/*' .*? '*/';
fragment LINECOMMENT    : '//' ~[\r\n]*;
fragment WHITESPACE     : [ \n\r\t]+;
fragment ID             : [a-zA-Z][a-zA-Z_0-9]*;

Comment         : COMMENT                   -> skip;
LineComment     : LINECOMMENT               -> skip;
WhiteSpace      : WHITESPACE                -> skip;

Colon           : ':';
Lparen          : '(';
Or              : '|';
Plus            : '+';
Question        : '?';
Rparen          : ')';
Semicolon       : ';';
Star            : '*';

Grammar         : 'grammar';
SymEof          : 'EOF';

Lform           : '<L' ('=' ID)? '>';
Rform           : '<R>';
Pform           : '<P>';
Greedy          : '<G>';

Id              : ID;

Gram’s Grammar

grammar GramParser;

file: header rules;

header:
    Grammar Id Semicolon
;

rules:
    rule
|   rules rule
;

rule:
    rule_name Colon prod SymEof? Semicolon
;

rule_name:
    Id
;

prod:
    prod_alt
|   prod Or prod_alt
;

prod_alt:
    prod_factor*
;

prod_factor:
    prod_atom (Plus | Star | Question)?
;

prod_atom:
    Id
|   Lform
|   Rform
|   Pform
|   Greedy
|   Lparen prod Rparen
;