Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Overview

Developing a parser is generally done in several steps:

  • writing the grammar
  • generating the code
  • implementing the listener
  • integrating it into your top-level parser

Designing the Grammar

Writing the grammar is an important step, and one that shouldn’t be underestimated. If the language you need to parse already exists, there’s not a great deal of latitude, but there are still an infinite number of grammars that can describe that language. If you’re using an LL(1) parser, you’ll need to design an LL(1) grammar, and if the language can’t be described with an LL(1) grammar at all, you must use another parser. But many programming languages can, so hopefully it shouldn’t be the case.

If you’re designing the language, you’ll find that designing the grammar may sometimes push you to adapt the language in order to make it easier (or possible) to parse.

We’ll start with this simplified grammar to show how the parts come together:

s -> Id "=" val | "exit" | "return" val;
val -> Id | Num;

Note

Notation

Like in the introduction, we’re often going to use a simplified notation for the grammar which doesn’t represent the full lexicon, for the sake of clarity.

In this simplified notation, the terminals are either fixed string literals or capitalized identifiers that represent a variable terminal, like a numeric value. The lowercase identifiers are the nonterminals. In the full notation required by Lexigram, string literals aren’t allowed in the grammar; terminal names must be used instead.

To distinguish that simplified notation from the full notation, we use -> in the production rules instead of :.

Here’s the full notation that Lexigram would expect, either in a combined form as below or with the lexicon and the grammar in separate locations:

lexicon Example;
Equal:  '=';
Exit:   'exit';
Return: 'return';
Id:     [a-z][_0-9a-z]*;
Num:    [0-9]+;

grammar Example;
s:   Id Equal val | Exit | Return val;
val: Id | Num;

Generating the Code

Lexigram generates the code of a lexer/parser based on their “specifications”: a lexicon for the lexer, and a grammar for the parser. There are two main options to generate the code: with the command-line executable or programmatically.

If you prefer the command-line option, you should install the lexigram executable.

Alternatively, you could generate the code programmatically in a test unit or a binary crate that you can launch directly from the IDE, without having to retype the same command each time. The build method is described briefly in the lexi-gram crate overview, and you can see several examples in the repository, like examples/gen_microcalc. There, you’ll find one version with the lexicon and the grammar in separate files, and one version with a combined lexicon/grammar.

Or you could create a build script that generates the code. Again, there’s a choice: the script may either launch the command-line executable, call its public method lexigram::execute, or use the build method as described in the previous paragraph.

Finally, there’s the more advanced possibility of creating the lexicon and/or the grammar dynamically in a Rust program by using the lower-level API of the lexigram-lib crate, but that’s completely outside the scope of this documentation.

Tip

If you generate the parser programmatically without a build.rs script, it’s recommended to proceed with two crates: one for the generator and one for the final application that includes the generated code. While it’s possible to put both in the same crate, it would have two drawbacks:

  • When the generated code is written to its destination file, it likely prevents you from compiling the code until you have completed the parser or updated it with the latest changes in the grammar. Before that, if you need to modify the grammar again, the code that generates the parser won’t be able to compile.
  • The dependencies of the generator are bigger, since you need all the crates (through lexi-gram). You could put the generator in a test config, as we did for the examples of the repository, so only the test dependencies ([dev-dependencies]) would require lexi-gram, and the production dependency ([dependencies]) would just be lexigram-core. This, however, might create conflicts since lexi-gram has lexigram-core in its own dependencies. Theoretically, there’s no such problem between test code and production code, but you’ll have to be careful not to bind lexigram-core directly in your test code; instead, you must bind lexi_gram::lexigram_lib::lexigram_core.

That’s why we split all the examples in two crates: gen_microcalc and microcalc, gen_watcher and watcher, and so on.

Let’s assume that the combined lexicon/grammar are in a file, example.lg:

lexicon Example;
Equal:  '=';
Exit:   'exit';
Return: 'return';
Id:     [a-z][_0-9a-z]*;
Num:    [0-9]+;

grammar Example;
s:   Id Equal val | Exit | Return val;
val: Id | Num;

The parser can be generated with this command:

$ lexigram -c example.lg -l lexer.rs -p parser.rs \
  --listener listener.txt --types types.txt
55 note(s)
23 info(s)
0 warning(s)
0 error(s)
OK

It creates 4 files:

  • lexer.rs: the lexer
  • parser.rs: the parser and the wrapper, which is the interface between the parser and the user-defined listener
  • listener.txt: a template of the listener implementation, that you can use as a starting point
  • types.txt: a template of the nonterminal types, that you can also copy and modify

Tip

Adding those two templates to your repository can prove helpful. Each time the grammar is modified and the code regenerated, you can more easily spot the differences and copy the new bits into your code.

Implementing the Listener

The code generated by Lexigram is a trait that you must implement.

The generated code includes, among other things, the listener trait and the custom parameter types:

#![allow(unused)]
fn main() {
pub enum CtxS {
    /// `s -> Id "=" val`
    V1 { id: String, val: SynVal },
    /// `s -> "exit"`
    V2,
    /// `s -> "return" val`
    V3 { val: SynVal },
}

pub enum CtxVal {
    /// `val -> Id`
    V1 { id: String },
    /// `val -> Num`
    V2 { num: String },
}

enum EnumSynValue { S(SynS), Val(SynVal) }

pub trait ExampleListener {
    // ...
    fn init_s(&mut self) {}
    fn exit_s(&mut self, ctx: CtxS) -> SynS;
    fn init_val(&mut self) {}
    fn exit_val(&mut self, ctx: CtxVal) -> SynVal;
}
}

The principle is simple:

  • Each time a production is predicted to come next, the init_<nonterminal>(…) method is called by the parser.
  • Each time the production has been fully parsed, exit_<nonterminal>(…) is called, giving you the contextual information about the parsed elements in ctx.

Implementing these trait methods allows you to process the input in a structured way.

Here, we assumed that both s and val had a value, which is represented by the types SynS and SynVal. It’s up to you to define their type, and which value they take in the exit_*(…) methods. You can declare that some nonterminals don’t have a value, if you want. Here, if we declare that val has no value, exit_val(…) won’t return any type, and val won’t appear in the CtxS::V1 and CtxS::V3 variants. Similarly, terminals that don’t hold a value, like "=", are not present in the context variants.

Lexigram optionally produces a template to help with the implementation, that you can copy and complete (the attributes and some details have been omitted):

#![allow(unused)]
fn main() {
/// User-defined type for `s`
pub struct SynS();

/// User-defined type for `val`
pub struct SynVal();

struct Listener {
    log: BufLog,
}

impl ExampleListener for Listener {
    // ...
    fn exit_s(&mut self, ctx: CtxS, spans: Vec<PosSpan>) -> SynS {
        match ctx {
            // s -> Id "=" val
            CtxS::V1 { id, val } => {}
            // s -> "exit"
            CtxS::V2 => {}
            // s -> "return" val
            CtxS::V3 { val } => {}
        }
        SynS()
    }

    fn exit_val(&mut self, ctx: CtxVal, spans: Vec<PosSpan>) -> SynVal {
        match ctx {
            // val -> Id
            CtxVal::V1 { id } => {}
            // val -> Num
            CtxVal::V2 { num } => {}
        }
        SynVal()
    }
}
}

If you want to store local variables Id, you can add a field to Listener that holds them and their variable ID. Here’s the implementation of a very simplified AST generator:

#![allow(unused)]
fn main() {
pub enum SynS {
    Assign(usize, SynVal),
    Exit,
    Return(SynVal),
}

pub enum SynVal {
    Id(usize),
    Val(u32),
    Error,
}

struct Listener {
    log: BufLog,
    vars: HashMap<String, usize>,
}

#[allow(unused)]
impl ExampleListener for Listener {
    // ...
    fn exit_s(&mut self, ctx: CtxS, spans: Vec<PosSpan>) -> SynS {
        match ctx {
            // s -> Id "=" val
            CtxS::V1 { id, val } => {
                let var_id = self.vars.len();
                self.vars.insert(id, var_id);
                SynS::Assign(var_id, val)
            }
            // s -> "exit"
            CtxS::V2 => SynS::Exit,
            // s -> "return" val
            CtxS::V3 { val } => SynS::Return(val),
        }
    }

    fn exit_val(&mut self, ctx: CtxVal, spans: Vec<PosSpan>) -> SynVal {
        match ctx {
            // val -> Id
            CtxVal::V1 { id } =>
                if let Some(var_id) = self.vars.get(&id) {
                    SynVal::Id(*var_id)
                } else {
                    self.get_log_mut().add_error(
                        format!("{id} is undefined"));
                    SynVal::Error
                }
            // val -> Num
            CtxVal::V2 { num } =>
                match u32::from_str(&num) {
                    Ok(value) => SynVal::Val(value),
                    Err(e) => {
                        self.get_log_mut().add_error(
                            format!("cannot parse {num:?} as an integer: {e:?}"));
                        SynVal::Error
                    }
                }
        }
    }
}
}

Top Integration

In the generated code, you’ll find the following components:

  • a build_lexer() method that returns a lexer of type Lexer. That object is generic, but the tables are specific to the lexicon.
  • a build_parser() method that returns the table-driven parser of type Parser. Like the lexer, that object is generic, but the tables are specific to the lexicon and the grammar.
  • a Wrapper<T> type, a wrapper for your listener of type T. It’s very specific, and its code is entirely based on the grammar.

Note

We use the term “Parser” to design either the low-level parser, so the table-driven LL(1) parser of type Parser defined in lexigram-core, or the “top-level” parser which is your own, complete application. The context should normally indicate which is which, but we’ll use “low-level” and “top-level” when it needs clarifying.

The typical way to implement the top-level parser is to create an object that keeps the lexer, the parser, the wrapper and the listener (the listener being itself a field of the wrapper):

#![allow(unused)]
fn main() {
pub struct ExampleParser<'l, 'p> {
    lexer: Lexer<'l, Cursor<String>>,
    parser: Parser<'p>,
    wrapper: Wrapper<ExampleListener>,
}

impl ExampleParser<'_, '_> {
    pub fn parse_text(text: String) -> Result<BufLog, BufLog> {
        let mut mcalc = ExampleParser::new();
        mcalc.parse(text)
    }

    pub fn new() -> Self {
        let lexer = build_lexer();
        let parser = build_parser();
        let wrapper = Wrapper::new(ExampleListener::new(), VERBOSE_WRAPPER);
        ExampleParser { lexer, parser, wrapper }
    }

    pub fn parse(&mut self, text: String) -> Result<BufLog, BufLog> {
        let stream = CharReader::new(Cursor::new(text));
        self.lexer.attach_stream(stream);
        let tokens = self.lexer.tokens().keep_channel0();
        if let Err(e) = self.parser.parse_stream(&mut self.wrapper, tokens) {
            self.wrapper.get_listener_mut().get_log_mut().add_error(e.to_string());
        }
        let log = std::mem::take(&mut self.wrapper.get_listener_mut().log);
        if log.has_no_errors() {
            Ok(log)
        } else {
            Err(log)
        }
    }
}
}

Before Going Any Further…

In the following subsections, we’re giving an overview of the top-level components, including the listener. If you’d like to get a better idea of how the components are integrated, you may find it useful.

If you’d rather see more practical examples before tackling those aspects, you may just have a quick glance or skip it entirely for now.

Lifetimes

The lexer type has two generic parameters:

pub struct Lexer<'a, R: Read>
  • 'a is simply the lifetime of the table references, which are static if you use build_lexer() since the tables are static in the code.
  • R: Read is the input. If you read from a string, it’s usually Cursor<String>. If the input is itself a reference, like &str, you may have a Cursor<&str> and so another lifetime in your top parser type. You’ll find an example of that in examples/typedef.

The parser has only a lifetime generic parameter:

pub struct Parser<'a>

Like the lexer, this lifetime is for the table references, which are static with build_parser().

Interaction Between Components

The lexer extracts the tokens "return" and Id("x"), that it sends to the parser on demand. Lexer::tokens(…) returns an iterator, so you can intercept the tokens if you like. In the code example, we used a keep_channel0(…) method before giving the tokens to the parser:

#![allow(unused)]
fn main() {
let tokens = self.lexer.tokens().keep_channel0();
}

The lexer indeed allows you to output tokens on several channels by using the -> channel(<channel ID>) action after defining a token. It can be used to keep the comments or other ancillary information apart from the grammar. Here, we don’t use that, so we simply filter out everything that doesn’t belong to the default channel (ID = 0).

The (low-level) generic table-driven parser consumes the tokens, predicts the production rules, and calls the wrapper at key moments, like when a production has been predicted and is about to be parsed, or when it has been fully parsed.

The wrapper’s code is generated according to the grammar. It calls the trait methods of the listener, so that you can get the values of all the symbols in a production once it’s fully parsed. For example, the values of Id and val once s -> Id "=" val is parsed.

Those values are either the content of variable terminals like Id or the value of nonterminals like val and s, which you determine in exit_val(…) and exit_s(…). They are kept on a value stack between those calls. Symbols that have no value, like fixed terminals ("=") and valueless nonterminals are not managed by the wrapper.

Let’s see what happens using a short example of input, using the same grammar as before:

return x

Here is how the components interact with one another:

  • The parser first expects a nonterminal s, which is the starting nonterminal of the grammar. The first token is "return", so the parser predicts the production s -> "return" val, and queues "return", val.
    • The parser calls the wrapper with this information: we enter that production.
    • The wrapper calls your listener method init_s(…).
  • The next item in the queue is "return", which matches the token (normal, since it was used to predict the production!), so it’s consumed.
  • The next item is val. Since the next token is Id("x"), the production val -> Id is predicted. Id is queued.
    • The parser calls the wrapper.
    • The wrapper calls your init_val(…).
  • The next item is Id, which matches the next token.
  • At this point, the production val is complete.
    • The parser calls the wrapper with the information: we exit that production, and we get an Id("x").
    • The wrapper calls exit_val(CtxVal::V1 { id: "x" }), which returns a value for val of type SynVal, and the wrapper keeps it on the value stack.
  • The production s is complete, too.
    • The parser calls the wrapper: we exit that production.
    • The wrapper calls exit_s(CtxS::V3 { val }), where val is the value popped from the value stack, previously returned by exit_val(…).

The Wrapper

The wrapper implements a few public methods, among which you’ll recognize a few in the code excerpt of the top-level we gave earlier (T is the listener type):

  • new(listener: T, verbose: bool) -> Self

    Creates a new wrapper around the listener. The listener is moved inside the wrapper before parsing any input, and it may be moved back (by destroying the wrapper) later.

  • get_listener(&self) -> &T

  • get_listener_mut(&mut self) -> &mut T

    Gives a reference to the listener. You can use that after parsing the input to get access to the log, or to any method of your listener.

  • give_listener(self) -> T

    Gives the listener back and destroys the wrapper.

  • set_verbose(&mut self, verbose: bool)

    Changes the verbosity of the wrapper; this is only a debug tool.

Since the listener is owned by the wrapper while the input is parsed, it’s easier to keep all the data you need during that phase in the listener itself. Then, you can take the result back in the top parser when the parsing is complete.

If you want to parse several inputs, you have the choice of how long the wrapper and the listener live: you can create them each time, or you can keep the same listener and move it back and forth each time, or you can keep the same wrapper/listener.

The Listener

Besides the specific nonterminal enter/exit methods already discussed before, the listener trait provides a few general methods:

  • get_log_mut(&mut self) -> &mut impl Logger

    Gives access to the log to the parser and the wrapper. The log object itself, or at least a mutable reference to it, must be owned by your listener. The log must implement the Logger trait; the BufLog type of the lexigram-core can be used, or you can create one of your own.

    We could have instantiated the log inside the wrapper, but putting it into the listener allows you to use it to add your own messages. It also allows you to destroy the wrapper and keep only the listener after the input is parsed, since it contains all the relevant data.

  • exit(&mut self, program: SynProgram) {}

    Is called after the exit method of the starting nonterminal, once everything is parsed. You can implement this method if you want to separate the post-processing from what is done in exit_<starting nonterminal>(…).

    The method isn’t called if the parsing is aborted (see below), or if a lexical/syntax error is detected.

  • check_abort_request(&self) -> Terminate { Terminate::None }

    Checks if an abort is requested by the listener. The method is called after each step of the low-level parser; typically after calling each nonterminal’s enter/exit method. If you detect an irrecoverable error in the listener, you can set an abort flag and return the appropriate value with that method:

    #![allow(unused)]
    fn main() {
    struct Listener {
        log: BufLog,
        vars: HashMap<String, usize>,
        abort: bool,
    }
    
    // ...
    
    impl ExampleListener for Listener {
        fn check_abort_request(&self) -> Terminate { 
            if self.abort { Terminate::Abort } else { Terminate::None }         
        }
    
        fn exit_s(&mut self, ctx: CtxS, spans: Vec<PosSpan>) -> SynS {
            match ctx {
                // s -> Id "=" val
                CtxS::V1 { id, val } => {
                    self.abort = val == SynVal::Error;
                    let var_id = self.vars.len();
                    self.vars.insert(id, var_id);
                    SynS::Assign(var_id, val)
                }
                // ...
            }
        }
    
        // ...
    }
    }

    Another return value, Terminate::Conclude, makes the parser stop as if everything had been successfully parsed. It can be used if you intend to parse an input that may be longer than what you need and if it’s too difficult to cut that input before giving it to the parser. There’s an example of that in examples/watcher

    Keep in mind that aborting the parsing, even with Terminate::Conclude will not call the exit methods of the pending nonterminals, nor the final exit() method. Instead, it will call abort(). For example, if you aborted in exit_val(…), the exit_s(…) method wouldn’t be called.

  • abort(&mut self, terminate: Terminate) {}

    Is called when check_abort_request(…) has returned something else than the default Terminate::None. It lets you do any post-processing required in that eventuality.

    Note that, as we said before, if you conclude the parsing with Terminate::Conclude, abort(…) is called instead of exit(…), even though the parsing may be considered as successful. However, the terminate code is given as argument, so you’re able to decide what to do, even call exit(…) yourself if you want to.

  • intercept_token(&mut self, token: TokenId, text: &str) -> TokenId { token }

    Is called each time a token is taken from the lexer. By default, this is optimized away, but you can use it if you need to change the value of tokens dynamically. This is an advanced feature, so we won’t elaborate on it in this tutorial.

  • fn hook(&mut self, token: TokenId, text: &str) -> TokenId { token }

    Is an optional method generated if hooks were declared in the lexicon. Like intercept_token(…), it allows to intercept tokens, but only at selected spots in the grammar, which is more efficient. This is also an advanced feature, so we won’t elaborate on it in this tutorial either.