Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Listener implementation

Now that the grammar has been designed, we’ll generate the code and implement the parser.

We’d like our configuration parser to give an Options object, ready to be used with try_gen_parser(…) to generate a parser from Rust code:

#![allow(unused)]
fn main() {
let options = ... ; // what's returned by the parser
match try_gen_parser(Action::Generate, options) {
    Ok(log) => {
        println!("Code generated");
        assert!(
            log.has_no_warnings(),
            "unexpected warning(s):\n{}",
            log.get_warnings().join("\n"));
    }
    Err(build_error) => panic!("{build_error}"),
}
}

Starting Point

Let’s first generate the lexer, parser, and the wrapper. It would also be good to keep a template of the user types and listener implementation; if we change the grammar at some point, it’s easier to update the listener code by checking how the templates changed. Those templates are generated alongside the code on demand.

Create a new library project, then add src/parser.rs and src/test.rs. For now, add the lexigram-core dependency.

Apart from the top-level parser definitions, we intend to have a few modules in parser.rs:

  • listener: where we implement the generated listener trait (we’ll initially copy that from the template).
  • listener_types: where we declare all the user types (we’ll initially copy that from the template).
  • config_lexer: the generated lexer.
  • config_parser: the generated low-level parser.

You can change the structure, of course, but that’s what we’ll assume in the code samples below.

Here’s how the initial parser.rs file should look like. We’re adding attributes to avoid unnecessary warnings about unused items for now; we’ll remove that later.

#![allow(unused)]
fn main() {
// top-level parser

mod listener {
    // listener implementation
}

#[allow(unused)]
mod listener_types {
    // listener types
}

#[allow(unused)]
mod config_lexer {
    // [config_lexer]
    // [config_lexer]
}

#[allow(unused)]
mod config_parser {
    // [config_parser]
    // [config_parser]
}
}

Create another file in the project’s root directory, templates.txt, with two pairs of tags for the templates:

// -----------------------------------------------------------------------------
// [template_user_types]
// [template_user_types]
// -----------------------------------------------------------------------------
// [template_listener_impl]
// [template_listener_impl]
// -----------------------------------------------------------------------------

Add those files to version control (e.g. Git), so you can see the differences each time you generate a new version of the parser.

Generating From Code

If you want to create the code programmatically, create an additional binary crate gen_config and add it as workspace member. It needs lexi-gram as dependency, too.

Write the code that generates the parser in main.rs, with paths relative to the main project’s directory (that’s where the binary should be run from):

use lexi_gram::lexigram_lib;

use lexi_gram::{gencode, genspec};
use lexi_gram::gen_parser::try_gen_parser;
use lexi_gram::options::{Action, OptionsBuilder};
use lexigram_lib::CollectJoin;
use lexigram_lib::log::LogStatus;

static LEXICON_FILE: &str = "src/config.l";
static GRAMMAR_FILE: &str = "src/config.g";
static DEST_FILE: &str = "src/parser.rs";
static LEXER_TAG: &str = "config_lexer";
static PARSER_TAG: &str = "config_parser";
static DEST_TEMPLATES: &str = "templates.txt";
static USERS_TAG: &str = "template_user_types";
static LISTENER_TAG: &str = "template_listener_impl";

fn main() {
    let options = OptionsBuilder::new()
        .indent(4)
        .lexer(
            genspec!(filename: LEXICON_FILE), 
            gencode!(filename: DEST_FILE, tag: LEXER_TAG))
        .parser(
            genspec!(filename: GRAMMAR_FILE), 
            gencode!(filename: DEST_FILE, tag: PARSER_TAG))
        .types_code(gencode!(filename: DEST_TEMPLATES, tag: USERS_TAG))
        .listener_code(gencode!(filename: DEST_TEMPLATES, tag: LISTENER_TAG))
        .libs(["super::listener_types::*"])
        .span_params(true)
        .build()
        .expect("should have no error");
    match try_gen_parser(Action::Generate, options) {
        Ok(log) => {
            println!("Code generated in {DEST_FILE}\n{log}");
            assert!(log.has_no_warnings(), "unexpected warning(s)");
        }
        Err(build_error) => panic!("{build_error}"),
    }
}

Generating From CLI

Make sure you have installed the Lexigram binary.

If the project has a single crate, with the lexicon and grammar files in the src subdirectory, the following command should generate the same code:

lexigram --indent 4\
  -x src/config.l -l "src/parser.rs" tag config_lexer\
  -g src/config.g -p "src/parser.rs" tag config_parser\
  --types templates.txt tag template_user_types\
  --listener templates.txt tag template_listener_impl\
  --lib "super::listener_types::*" --spans

Basic Parser

Before implementing the full parser, it’s a good idea to test a minimal implementation with all the generated code.

We can start by copying the content of the templates

  • [template_user_types]: copy what’s between the tags into the listener_types module.
  • [template_listener_impl]: copy what’s between the tags into the listener module.

The parser needs a field for the lexer, parser, and wrapper. The listener is inside the wrapper, so you don’t need a field for it at the top.

We’ll provide the input as a &str, so we need an extra lifetime 'ls:

#![allow(unused)]
fn main() {
pub struct ConfigParser<'l, 'p, 'ls> {
    lexer: Lexer<'l, Cursor<&'l str>>,
    parser: Parser<'p>,
    wrapper: Option<Wrapper<Listener<'ls>>>,
}
}

In the listener module, let’s add a reference to the input in the listener, as a vector of references to each line. It’s not mandatory, but we’ll show a simple way to point at the error by annotating the relevant input lines. We should also add bindings for PosSpan and Logger, and everything above, which are not generated automatically in the template.

#![allow(unused)]
fn main() {
mod listener {
    use lexigram_core::lexer::PosSpan;
    use lexigram_core::log::Logger;
    use super::*;

    pub(super) struct Listener<'ls> {
        pub log: BufLog,
        lines: Option<Vec<&'ls str>>,
    }
  
    impl<'ls> Listener<'ls> {
        pub fn new() -> Self {
            Listener {
                log: BufLog::new(),
                lines: None,
            }
        }

        pub fn attach_lines(&mut self, lines: Vec<&'ls str>) {
            self.lines = Some(lines);
        }
    }

    //  ... (the rest remains as in the template)
}
}

We’re now ready to implement the top parser.

#![allow(unused)]
fn main() {
use std::io::Cursor;
use listener_types::*;
use config_lexer::build_lexer;
use config_parser::*;
use listener::Listener;
use lexigram_core::char_reader::CharReader;
use lexigram_core::lexer::{Lexer, TokenSpliterator};
use lexigram_core::log::{BufLog, Logger, LogStatus};
use lexigram_core::parser::Parser;

const VERBOSE_WRAPPER: bool = false;

impl<'l, 'ls: 'l> ConfigParser<'l, '_, 'ls> {
    /// Creates a new parser
    pub fn new() -> Self {
        let lexer = build_lexer();
        let parser = build_parser();
        ConfigParser { lexer, parser, wrapper: None }
    }

    /// Parses a text.
    ///
    /// On success, returns the log.
    /// 
    /// On failure, returns the log with the error messages.
    pub fn parse(&mut self, text: &'ls str) -> Result<BufLog, BufLog> {
        self.wrapper = Some(Wrapper::new(Listener::new(), VERBOSE_WRAPPER));
        let stream = CharReader::new(Cursor::new(text));
        self.lexer.attach_stream(stream);
        self.wrapper.as_mut().unwrap()
            .get_listener_mut()
            .attach_lines(text.lines().collect());
        let tokens = self.lexer.tokens().keep_channel0();
        let result = self.parser.parse_stream(self.wrapper.as_mut().unwrap(), tokens);
        let Listener { mut log, .. } = self.wrapper.take().unwrap().give_listener();
        if let Err(e) = result {
            log.add_error(e.to_string());
        }
        if log.has_no_errors() {
            Ok(log)
        } else {
            Err(log)
        }
    }
}
}

The idea we’re following is to put the wrapper inside an Option (yes, we’re wrapping the wrapper). That way, we can keep the same parser object, even if we parse several inputs in a loop. Each time, a new wrapper/listener is created with the references to the new input, and it’s destroyed after the parsing is done. It makes it easier to deal with the lifetimes, and it avoids reusing a listener which may have remnants of the previous input.

All we need now is a test. Add this to src/tests.rs:

#![allow(unused)]
#![cfg(test)]

fn main() {
use crate::parser::ConfigParser;

#[test]
fn test_run() {
    let mut p = ConfigParser::new();
    for (i, src) in [SRC1, SRC2].into_iter().enumerate() {
        println!("source #{i}");
        match p.parse(src) {
            Ok(log) => println!("success\n{log}"),
            Err(log) => panic!("error\n{log}"),
        }
    }
}


// ---------------------------------------------------------
static SRC1: &str = r#"
def SOURCE_FILENAME = "../watcher/src/lib.rs";

lexer {
    combined: "src/watcher.lg",
    output: SOURCE_FILENAME ["watcher_lexer"],
    indent: 4
}
parser {
    output: SOURCE_FILENAME ["watcher_parser"],
    indent: 4
}
options {
    nt-value: none,
    spans: true
}
"#;

static SRC2: &str = r#"
def LEXICON_GRAMMAR_FILENAME = "src/microcalc.lg";
def SOURCE_FILENAME = "../microcalc/src/main.rs";

lexer {
    input: LEXICON_GRAMMAR_FILENAME,
    output: SOURCE_FILENAME ["microcalc_lexer"],
    indent: 4
}
parser {
    input: LEXICON_GRAMMAR_FILENAME,
    output: SOURCE_FILENAME ["microcalc_parser"],
    indent: 4
}
options {
    libs: { "super::listener_types::*" },
    nt-value: default
}
"#;
}

You can launch the test to verify that it works. Then corrupt one of the inputs to verify that the error is detected.

Storing the Options

Since Options and its fields are public, we’ll simply store an instance in the listener and set its fields accordingly to what is parsed.

By default, the generated code relies only on the lexigram-core crate. That’s what we used in our first test. But Options is in lexi-gram, which also depends on lexigram-core, as we explain in the Lexigram crates chapter.

Rather than using two dependencies, we’ll use only lexi-gram, from which we’ll take lexigram-core. For example with this binding:

#![allow(unused)]
fn main() {
use lexi_gram::lexigram_lib::lexigram_core;
}

Let’s add this binding for the lexer and the low-level parser, which also need lexigram_core.

  • If you generated the parser from code, insert a .headers(…), and generate the code again:

    #![allow(unused)]
    fn main() {
        let options = OptionsBuilder::new()
            .headers(["use lexi_gram::lexigram_lib::lexigram_core;"])
            .indent(4)
            .lexer(
                genspec!(filename: LEXICON_FILE), 
                gencode!(filename: DEST_FILE, tag: LEXER_TAG))
            .parser(
                genspec!(filename: GRAMMAR_FILE), 
                gencode!(filename: DEST_FILE, tag: PARSER_TAG))
            .types_code(gencode!(filename: DEST_TEMPLATES, tag: USERS_TAG))
            .listener_code(gencode!(filename: DEST_TEMPLATES, tag: LISTENER_TAG))
            .libs(["super::listener_types::*"])
            .span_params(true)
            .build()
            .expect("should have no error");
    }
  • If you generated the parser from the command-line, add a --header and generate the code again:

    lexigram --indent 4 --header "use lexi_gram::lexigram_lib::lexigram_core;"\
      -x src/config.l -l "src/parser.rs" tag config_lexer\
      -g src/config.g -p "src/parser.rs" tag config_parser\
      --types templates.txt tag template_user_types\
      --listener templates.txt tag template_listener_impl\
      --lib "super::listener_types::*" --spans
    

In parser.rs, insert this at the top of the bindings:

#![allow(unused)]
fn main() {
use lexi_gram::lexigram_lib::lexigram_core;
}

The listener can now have its new options field:

#![allow(unused)]
fn main() {
pub(super) struct Listener<'ls> {
    pub options: Options,
    pub log: BufLog,
    lines: Option<Vec<&'ls str>>,
}

impl<'ls> Listener<'ls> {
    pub fn new() -> Self {
        Listener {
            options: Options::default(),
            log: BufLog::new(),
            lines: None,
        }
    }
    // ...
}
}

Check that everything compiles and that the tests still run.

Now we’re ready to modify the listener implementation.

Storing the Constants

The next task is to store the constants that are defined at the beginning. They can be used anywhere, so they could contain any type of value defined in the grammar.

In the listener_types module, we define the type that holds any type of value, and we add a field to the listener (which also needs to be initialized in new(…)):

#![allow(unused)]
fn main() {
/// User-defined type for `value`
#[derive(Clone, PartialEq, Debug)]
pub(super) enum SynValue {
    Error,
    Bool(bool),
    Num(usize),
    Str(String),
    CodeStdout,
}

pub(super) struct Listener<'ls> {
    // ...
    consts: HashMap<String, SynValue>,
}
}

Back in the listener module, we can now modify exit_value(…):

#![allow(unused)]
fn main() {
fn exit_value(&mut self, ctx: CtxValue, spans: Vec<PosSpan>) -> SynValue {
    match ctx {
        // value -> BoolLiteral
        CtxValue::V1 { boolliteral } => SynValue::Bool(boolliteral == "true"),
        // value -> NumLiteral
        CtxValue::V2 { numliteral } => {
            match usize::from_str(&numliteral) {
                Ok(v) => SynValue::Num(v),
                Err(e) => {
                    self.log.add_error(format!("at {}, {e}", spans[0]));
                    SynValue::Error
                }
            }
        }
        // value -> StrLiteral
        CtxValue::V3 { mut strliteral } => {
            strliteral.remove(0); // remove quotes
            strliteral.pop();
            SynValue::Str(strliteral)
        }
        // value -> Id
        CtxValue::V4 { id } => {
            if let Some(v) = self.consts.get(&id) {
                v.clone()
            } else {
                self.log.add_error(format!("at {}, {id} is not defined", spans[0]));
                SynValue::Error
            }
        }
        // value -> "stdout"
        CtxValue::V5 => SynValue::CodeStdout,
    }
}
}

The spans vector contains the location of the symbols in the corresponding production. Note that there are positions for symbols that don’t have any value, like fixed terminals. The type implements Display to show the location of one or more characters as “line:column”, “line:col1-col2”, or “line1:col1-line2:col2”. Later, we’ll update that to show the actual input that produces the error, but it’s enough for now.

The definitions are in the rule definitions, but we’re mostly interested in exit_i_def(…):

#![allow(unused)]
fn main() {
fn exit_definitions(&mut self, _: CtxDefinitions, _: Vec<PosSpan>) -> SynDefinitions {
    SynDefinitions()
}

fn init_i_def(&mut self) -> SynIDef {
    SynIDef()
}

fn exit_i_def(&mut self, acc: &mut SynIDef, ctx: CtxIDef, spans: Vec<PosSpan>) {
    // `<L> "def" Id "=" value ";"` iteration in 
    // `definitions -> ( ►► <L> "def" Id "=" value ";" ◄◄ )*`
    let CtxIDef::V1 { id, value } = ctx;
    if self.consts.contains_key(&id) {
        self.log.add_error(format!("at {}, redefined constant {id}", spans[4]));
    } else if value != SynValue::Error {
        self.consts.insert(id, value);
    }
}
}

Note that we use spans[4], but value is the 4th item in the production (<L> doesn’t count). Why?

In a loop, there’s an extra value in index 0, that contains the accumulated positions so far. So here, spans[0] spans all the definitions parsed, including the current one. spans[1] is "def", and so on.

To have a peek at the results, we can implement the listener’s exit(…) method:

#![allow(unused)]
fn main() {
fn exit(&mut self, config: SynConfig, span: PosSpan) {
    let mut defs = self.consts.iter()
        .map(|(k, v)| format!("\n- {k}: {v:?}"))
        .to_vec();
    defs.sort();
    println!("definitions:{}", defs.join(""));
}
}

to_vec requires a use lexi_gram::lexigram_lib::CollectJoin;; alternatively, you can use a collect.

Run the test again to verify the definitions, and try to add new ones, including redefinitions and references to undefined constants to check the errors.

io_options

Next, we have the production rules of the lexer and parser groups, both using io_options to parse the options.

lexer:
    Lexer Lbracket io_options Rbracket
;

io_options:
    io_option (<L=i_io_opt> Comma io_option)*
;

io_option:
    Combined Colon value tag_opt            // string
|   Input    Colon value tag_opt            // string
|   Output   Colon value tag_opt            // string
|   Indent   Colon value                    // num
|   Headers  Colon Lbracket value (Comma value)* Rbracket // string
;

tag_opt:
    (LSbracket value RSbracket)?            // string
;

We have two possible approaches.

In the first one, we store in a new field of the listener whether we’re parsing the lexer or the parser when we enter those rules (init_lexer(…) and init_parser(…)). In io_option, we set the appropriate values of self.options depending on the production alternative and the subject (lexer/parser). The nonterminals io_options and io_option don’t hold a value.

In the second approach, we store in io_option a new type that contains the possible options (Specification, CodeLocation, usize, …). We fold the values in io_options and detect repetitions, then we store the same type in io_options, and lexer/parser detect other errors and store the options in self.options.

Both work, but we’ll choose the second approach because there’s less coupling and the responsibilities are better separated.

For io_option, we define the following type in the listener_types module (Specifications and CodeLocation must be imported):

#![allow(unused)]
fn main() {
/// User-defined type for `io_option`
#[derive(Clone, PartialEq, Debug)]
pub enum SynIoOption {
    Error,
    Combined(Specification),
    Spec(Specification),
    Code(CodeLocation),
    Indent(usize),
    Headers(Vec<String>),
}
}

The Error variant is added in case an error is encountered when parsing the nonterminal.

tag_opt, which is used in io_option, is just an optional string for the tag:

#![allow(unused)]
fn main() {
/// User-defined type for `tag_opt`
pub type SynTagOpt = Option<String>;
}

For io_options, here is the “accumulator” type we’ll use to fold the option values:

#![allow(unused)]
fn main() {
/// User-defined type for `io_options`
#[derive(Clone, PartialEq, Debug)]
pub struct SynIoOptions {
    pub is_combined: bool,
    pub spec: Specification,
    pub code: CodeLocation,
    pub indent_opt: Option<usize>,
    pub headers: Vec<String>,
}

impl Default for SynIoOptions {
    fn default() -> Self {
        SynIoOptions {
            is_combined: false,
            spec: Specification::None,
            code: CodeLocation::None,
            indent_opt: None,
            headers: vec![],
        }
    }
}
}

Here is its implementation:

#![allow(unused)]
fn main() {
impl SynIoOptions {
    pub fn new() -> Self {
        Self::default()
    }

    pub fn fold(&mut self, io_option: SynIoOption) -> Result<(), &str> {
        match io_option {
            SynIoOption::Error => {}
            SynIoOption::Combined(spec) => {
                if self.spec.is_none() {
                    self.is_combined = true;
                    self.spec = spec;
                } else {
                    return Err("more than one specification");
                }
            }
            SynIoOption::Spec(spec) => {
                if self.spec.is_none() {
                    self.spec = spec;
                } else {
                    return Err("more than one specification");
                }
            }
            SynIoOption::Code(code) => {
                if self.code.is_none() {
                    self.code = code;
                } else {
                    return Err("more than one code location");
                }
            }
            SynIoOption::Indent(indent) => {
                if self.indent_opt.is_none() {
                    self.indent_opt = Some(indent);
                } else {
                    return Err("more than one indentation");
                }
            }
            SynIoOption::Headers(headers) => {
                self.headers.extend(headers);
            }
        }
        Ok(())
    }
}
}

We don’t check if we can use combined at this level; this and its effect are handled in lexer and parser.

Here’s the implementation of exit_io_option(…), which is a little long because of the different cases and the error checking, but otherwise straightforward.

#![allow(unused)]
fn main() {
fn exit_io_option(&mut self, ctx: CtxIoOption, spans: Vec<PosSpan>) -> SynIoOption {
    fn get_spec(value: SynValue, tag_opt: SynTagOpt) -> Result<Specification, &'static str> {
        match value {
            SynValue::Error => Ok(genspec!(none)),
            SynValue::Bool(_)
            | SynValue::Num(_)
            | SynValue::CodeStdout => Err("string"),
            SynValue::Str(s) => {
                if let Some(tag) = tag_opt {
                    Ok(genspec!(filename: s, tag: tag))
                } else {
                    Ok(genspec!(filename: s))
                }
            }
        }
    }

    match ctx {
        // io_option -> "combined" ":" value tag_opt
        CtxIoOption::V1 { value, tag_opt } => {
            match get_spec(value, tag_opt) {
                Ok(spec) => SynIoOption::Combined(spec),
                Err(expected) => {
                    self.log.add_error(format!("at {}, expected {expected}", spans[2]));
                    SynIoOption::Error
                }
            }
        }
        // io_option -> "input" ":" value tag_opt
        CtxIoOption::V2 { value, tag_opt } => {
            match get_spec(value, tag_opt) {
                Ok(spec) => SynIoOption::Spec(spec),
                Err(expected) => {
                    self.log.add_error(format!("at {}, expected {expected}", spans[2]));
                    SynIoOption::Error
                }
            }
        }
        // io_option -> "output" ":" value tag_opt
        CtxIoOption::V3 { value, tag_opt } => {
            match value {
                SynValue::Error => SynIoOption::Error,
                SynValue::Bool(_)
                | SynValue::Num(_) => {
                    self.log.add_error(
                        format!("at {}, expected string or stdout", spans[2]));
                    SynIoOption::Error
                }
                SynValue::Str(s) => {
                    if let Some(tag) = tag_opt {
                        SynIoOption::Code(gencode!(filename: s, tag: tag))
                    } else {
                        SynIoOption::Code(gencode!(filename: s))
                    }
                }
                SynValue::CodeStdout => SynIoOption::Code(gencode!(stdout)),
            }
        }
        // io_option -> "indent" ":" value
        CtxIoOption::V4 { value } => {
            if let SynValue::Num(indent) = value {
                SynIoOption::Indent(indent)
            } else {
                if value != SynValue::Error {
                    self.log.add_error(
                      format!("at {}, expected number instead of {value:?}", spans[2]));
                }
                SynIoOption::Error
            }
        }
        // io_option -> "headers" ":" "{" value ("," value)* "}"
        CtxIoOption::V5 { star: SynIoOption1(values) } => {
            match self.values_to_strings(values, &spans[3]) {
                Ok(headers) => SynIoOption::Headers(headers),
                Err(()) => SynIoOption::Error,
            }
        }
    }
}
}

Where we add this listener method to convert Vec<SynValue> to Vec<String> because it will be used several times:

#![allow(unused)]
fn main() {
impl<'ls> Listener<'ls> {
    // ...
    fn values_to_strings(&mut self, values: Vec<SynValue>, spans: &PosSpan) -> Result<Vec<String>, ()> {
        let values_len = values.len();
        let strings = values.into_iter().enumerate().filter_map(|(i, v)| {
            if let SynValue::Str(s) = v {
                Some(s)
            } else {
                self.log.add_error(format!("at {}, item #{} isn't a string", spans, i + 1));
                None
            }
        }).to_vec();
        if strings.len() == values_len {
            Ok(strings)
        } else {
            Err(())
        }
    }
}
}

Finally, parsing the optional tag:

#![allow(unused)]
fn main() {
fn exit_tag_opt(&mut self, ctx: CtxTagOpt, spans: Vec<PosSpan>) -> SynTagOpt {
    match ctx {
        // tag_opt -> "[" value "]"
        CtxTagOpt::V1 { value } => {
            if let SynValue::Str(s) = value {
                Some(s)
            } else {
                self.log.add_error(format!("at {}, expected a string", spans[1]));
                None
            }
        }
        // tag_opt -> ε
        CtxTagOpt::V2 => None,
    }
}
}

i_io_opt, which iterates on io_option items and folds them, has the same type as io_options:

#![allow(unused)]
fn main() {
/// User-defined type for `<L> "," io_option` iteration in `io_options -> io_option ( ►► <L> "," io_option ◄◄ )*`
pub type SynIIoOpt = SynIoOptions;
}

Remember that io_option items are separated by a comma, so they’re defined in the grammar like below, but Lexigram sees that the first io_option item outside the repetition is actually the same pattern as the repetition (without the comma), so it includes that first item in the repetition:

io_options: io_option (<L=i_io_opt> Comma io_option)*;

Since it’s a <L> repetition, we get the two methods below. The first one gets the first item, and the second gets the remaining items.

#![allow(unused)]
fn main() {
fn init_i_io_opt(&mut self, ctx: InitCtxIIoOpt, spans: Vec<PosSpan>) -> SynIIoOpt {
    // value of `io_option` before `<L> "," io_option` iteration in 
    // `io_options -> io_option ( ►► <L> "," io_option ◄◄ )*`
    let InitCtxIIoOpt::V1 { io_option } = ctx;
    let mut acc = SynIIoOpt::new();
    if let Err(e) = acc.fold(io_option) {
        self.log.add_error(format!("at {}, {e}", spans[0]));
    }
    acc
}

fn exit_i_io_opt(&mut self, acc: &mut SynIIoOpt, ctx: CtxIIoOpt, spans: Vec<PosSpan>) {
    // `<L> "," io_option` iteration in 
    // `io_options -> io_option ( ►► <L> "," io_option ◄◄ )*`
    let CtxIIoOpt::V1 { io_option } = ctx;
    if let Err(e) = acc.fold(io_option) {
        self.log.add_error(format!("at {}, {e}", spans[2]));
    }
}
}

Note

In exit_i_io_opt(…), spans[0] spans all the loop items that have already been parsed up to this point, spans[1] is the comma, so spans[2] is where the io_option input is located.

io_options: io_option (<L=i_io_opt> "," io_option)*;
                       ^^^^^^^^^^^^ ^^^ ^^^^^^^^^
[0]: accumulated       not in spans [1]    [2]

In exit_io_options(…), spans[0], which corresponds to the first io_option nonterminal, contains the span of that first item plus all the items in the repetition.

In fact, we don’t really need to check the error in init_i_io_opt(…) because, at this point, only incorrect repetitions of some options are verified.

exit_io_options(…) has nothing else to do but to pass the value:

#![allow(unused)]
fn main() {
fn exit_io_options(&mut self, ctx: CtxIoOptions, spans: Vec<PosSpan>) -> SynIoOptions {
    // io_options -> io_option (<L> "," io_option)*
    let CtxIoOptions::V1 { star } = ctx;
    star
}
}

lexer and parser

The exit_lexer(…) method takes the value with the folded options and sets the corresponding values in the final self.options field.

However, there is a little snag with the indent option. Normally,

  • a default indentation for the lexer, parser, types, and listener can be specified in the command-line binary (and in the OptionBuilder) by putting this option in front of everything else.
  • if the indentation is specified for any of those parts of code, it has more priority over the default indentation level.

Thus, lexigram --indent 4 -x lexicon.l -l lexer.rs --indent 0 doesn’t indent the lexer code.

But in our grammar, the global options are specified after lexer and parser, so we might erase their specific indentation by the global, default value. To avoid those conflicts, we’ll store the specific values in separate fields of the listener, then we’ll resolve conflicts after parsing everything:

#![allow(unused)]
fn main() {
pub(super) struct Listener<'ls> {
    // ...
    lexer_indent: Option<usize>,
    parser_indent: Option<usize>,
}
}

The only other particularity in exit_lexer(…) is copying the specifications (the location of the lexicon/grammar) into the self.options.parser_spec field:

#![allow(unused)]
fn main() {
fn exit_lexer(&mut self, ctx: CtxLexer, spans: Vec<PosSpan>) -> SynLexer {
    // lexer -> "lexer" "{" io_options "}"
    let CtxLexer::V1 { io_options } = ctx;
    if io_options.is_combined {
        self.options.parser_spec = io_options.spec.clone();
    }
    self.options.lexer_spec = io_options.spec;
    self.options.lexer_code = io_options.code;
    self.options.lexer_headers.extend(io_options.headers);
    self.lexer_indent = io_options.indent_opt;
    SynLexer()
}
}

For the parser, there are more verifications to perform:

  • if parser options were found (V1 variant)
    • combined shouldn’t be used.
    • If combined was used in the lexer, there shouldn’t be a redundant input spec in the parser options.
    • Otherwise, input must be present.
  • Finally, the parser options may be absent if one doesn’t want a parser. In this case, combined is illegal in the V2 variant.
#![allow(unused)]
fn main() {
fn exit_parser(&mut self, ctx: CtxParser, spans: Vec<PosSpan>) -> SynParser {
    match ctx {
        // parser -> "parser" "{" io_options "}"
        CtxParser::V1 { io_options } => {
            if io_options.is_combined {
                self.log.add_error(
                    format!("at {}, 'combined' can only be used in 'lexer' options", spans[2]));
            } else {
                if io_options.spec.is_none() && self.options.parser_spec.is_none() {
                    self.log.add_error(format!("at {}, undefined grammar location", spans[2]));
                } else if !io_options.spec.is_none() && !self.options.parser_spec.is_none() {
                    self.log.add_error(
                        format!("at {}, redefined grammar location ('combined' in lexer)", spans[2]));
                }
            }
            if !io_options.spec.is_none() {
                self.options.parser_spec = io_options.spec;
            }
            self.options.parser_code = io_options.code;
            self.options.parser_headers.extend(io_options.headers);
            self.parser_indent = io_options.indent_opt;
        }
        // parser -> ε
        CtxParser::V2 => {
            if !self.options.parser_spec.is_none() {
                self.log.add_error(
                    "combined lexicon/grammar, but missing parser code location".to_string());
            }
        }
    }
    SynParser()
}
}

global_options

Finally, we have the group of the global options, which is a mix of default/global options and wrapper code options.

options:
    (Options Lbracket global_options Rbracket)?
;

global_options:
     global_option (<L=i_global_opt> Comma global_option)*
;

global_option:
    Headers Colon Lbracket value (Comma value)* Rbracket
|   Indent  Colon value
|   Libs    Colon Lbracket value (Comma value)* Rbracket
|   NTValue Colon nt_value
|   Spans   Colon value
;

value: /* ... */

nt_value:
    Default
|   None
|   Parents
|   Set Lbracket Id (Comma Id)* Rbracket
;

The handling of these options is quite similar to the io_options of the lexer and parser groups.

Let’s first get nt-value out of the way. We’re using the same type that Options requires, since it’s already available and fits our purpose:

#![allow(unused)]
fn main() {
/// User-defined type for `nt_value`
pub type SynNtValue = NTValue;
}

Parsing it is straightforward, but since we don’t have a NTValue::Error if there’s something wrong with the set variant, we simply return an empty list of names. In case of parsing error, it allows the parser to continue parsing the input and possibly to detect further errors. It would ultimately report the errors instead of returning an Options, so it’s not important if some values aren’t accurate.

We could also use the listener abort feature, if not returning accurate data could risk the code to crash later.

#![allow(unused)]
fn main() {
fn exit_nt_value(&mut self, ctx: CtxNtValue, spans: Vec<PosSpan>) -> SynNtValue {
    match ctx {
        // nt_value -> "default"
        CtxNtValue::V1 => NTValue::Default,
        // nt_value -> "none"
        CtxNtValue::V2 => NTValue::None,
        // nt_value -> "parents"
        CtxNtValue::V3 => NTValue::Parents,
        // nt_value -> "set" "{" value ("," value)* "}"
        CtxNtValue::V4 { star: SynNtValue1(values) } => {
            match self.values_to_strings(values, &spans[2]) {
                Ok(names) => NTValue::SetNames(names),
                Err(()) => NTValue::SetNames(vec![]),
            }
        }
    }
}
}

We then define the following type for global_option:

#![allow(unused)]
fn main() {
/// User-defined type for `global_option`
#[derive(Debug, PartialEq)]
pub enum SynGlobalOption {
    Error,
    Headers(Vec<String>),
    Indent(usize),
    Libs(Vec<String>),
    NTValue(NTValue),
    Spans(bool),
}
}

Lastly, global_options is the accumulator in which the global_option items are folded:

#![allow(unused)]
fn main() {
/// User-defined type for `global_options`
#[derive(Debug, PartialEq)]
pub struct SynGlobalOptions {
    pub headers: Vec<String>,
    pub indent_opt: Option<usize>,
    pub libs: Vec<String>,
    pub nt_value_opt: Option<NTValue>,
    pub spans_opt: Option<bool>,
}

impl Default for SynGlobalOptions {
    fn default() -> Self {
        SynGlobalOptions {
            headers: vec![],
            indent_opt: None,
            libs: vec![],
            nt_value_opt: None,
            spans_opt: None,
        }
    }
}
}

Its implementation is again very similar to that of SynIoOptions:

#![allow(unused)]
fn main() {
impl SynGlobalOptions {
    pub fn new() -> Self {
        Self::default()
    }

    pub fn fold(&mut self, global_option: SynGlobalOption) -> Result<(), &str> {
        match global_option {
            SynGlobalOption::Error => {}
            SynGlobalOption::Headers(headers) => {
                self.headers.extend(headers);
            }
            SynGlobalOption::Indent(indent) => {
                if self.indent_opt.is_none() {
                    self.indent_opt = Some(indent);
                } else {
                    return Err("more than one indentation");
                }
            }
            SynGlobalOption::Libs(libs) => {
                self.libs.extend(libs);
            }
            SynGlobalOption::NTValue(nt_value) => {
                if self.nt_value_opt.is_none() {
                    self.nt_value_opt = Some(nt_value);
                } else if let Some(current) = self.nt_value_opt.as_mut() {
                    match (current, nt_value) {
                        // we allow to grow names
                        (NTValue::SetNames(names), NTValue::SetNames(new)) => {
                            names.extend(new);
                        }
                        _ => return Err("conflicting nt-value options"),
                    }
                }
            }
            SynGlobalOption::Spans(spans) => {
                if self.spans_opt.is_none() {
                    self.spans_opt = Some(spans);
                } else {
                    return Err("more than one `spans` option")
                }
            }
        }
        Ok(())
    }
}
}

For nt-value, we allow several options in the same group if they’re all adding names with set:

options {
    nt-value: set { "lexer", "parser" },
    nt-value: set { "options" },
    spans: true
}

The implementation of global_option is simpler than what we did for io_option, but still very similar (especially since headers and indent are there, too):

#![allow(unused)]
fn main() {
fn exit_global_option(&mut self, ctx: CtxGlobalOption, spans: Vec<PosSpan>) -> SynGlobalOption {
    match ctx {
        // global_option -> "headers" ":" "{" value ("," value)* "}"
        CtxGlobalOption::V1 { star: SynGlobalOption1(values) } => {
            match self.values_to_strings(values, &spans[3]) {
                Ok(headers) => SynGlobalOption::Headers(headers),
                Err(()) => SynGlobalOption::Error,
            }
        }
        // global_option -> "indent" ":" value
        CtxGlobalOption::V2 { value } => {
            if let SynValue::Num(indent) = value {
                SynGlobalOption::Indent(indent)
            } else {
                if value != SynValue::Error {
                    self.log.add_error(format!("at {}, expected number instead of {value:?}", spans[2]));
                }
                SynGlobalOption::Error
            }
        }
        // global_option -> "libs" ":" "{" value ("," value)* "}"
        CtxGlobalOption::V3 { star: SynGlobalOption2(values) } => {
            match self.values_to_strings(values, &spans[3]) {
                Ok(libs) => SynGlobalOption::Libs(libs),
                Err(()) => SynGlobalOption::Error,
            }
        }
        // global_option -> "nt-value" ":" nt_value
        CtxGlobalOption::V4 { nt_value } => SynGlobalOption::NTValue(nt_value),
        // global_option -> "spans" ":" value
        CtxGlobalOption::V5 { value } => {
            if let SynValue::Bool(flag) = value {
                SynGlobalOption::Spans(flag)
            } else {
                self.log.add_error(format!("at {}, expected boolean instead of {value:?}", spans[2]));
                SynGlobalOption::Error
            }
        }
    }
}
}

The iteration in global_options is also an <L> repetition in a token-separated list. Lexigram provides the first item in init_i_global_opt(…) and the following ones in exit_i_global_opt. All we have to do is accumulate and report any error:

#![allow(unused)]
fn main() {
fn init_i_global_opt(&mut self, ctx: InitCtxIGlobalOpt, spans: Vec<PosSpan>) -> SynIGlobalOpt {
    // value of `global_option` before `<L> "," global_option` iteration in `global_options -> global_option ( ►► <L> "," global_option ◄◄ )*`
    let InitCtxIGlobalOpt::V1 { global_option } = ctx;
    let mut acc = SynGlobalOptions::new();
    acc.fold(global_option);
    acc
}

fn exit_i_global_opt(&mut self, acc: &mut SynIGlobalOpt, ctx: CtxIGlobalOpt, spans: Vec<PosSpan>) {
    // `<L> "," global_option` iteration in `global_options -> global_option ( ►► <L> "," global_option ◄◄ )*`
    let CtxIGlobalOpt::V1 { global_option } = ctx;
    if let Err(e) = acc.fold(global_option) {
        self.log.add_error(format!("at {}, {e}", spans[2]));
    }
}
}

There’s nothing to do in global_options but to return the value in the context:

#![allow(unused)]
fn main() {
fn exit_global_options(&mut self, ctx: CtxGlobalOptions, spans: Vec<PosSpan>) -> SynGlobalOptions {
    // global_options -> global_option (<L> "," global_option)*
    let CtxGlobalOptions::V1 { star } = ctx;
    star
}
}

options

To complete the listener implementation, options copies the values accumulated in global_options.

#![allow(unused)]
fn main() {
fn exit_options(&mut self, ctx: CtxOptions, spans: Vec<PosSpan>) -> SynOptions {
    match ctx {
        // options -> "options" "{" global_options "}"
        CtxOptions::V1 { global_options } => {
            self.options.lexer_headers.extend(global_options.headers.clone());
            self.options.parser_headers.extend(global_options.headers);
            if let Some(indent) = global_options.indent_opt {
                self.options.lexer_indent = indent;
                self.options.parser_indent = indent;
            }
            self.options.libs.extend(global_options.libs);
            if let Some(nt_value) = global_options.nt_value_opt {
                self.options.nt_value = nt_value;
            }
            if let Some(spans_opt) = global_options.spans_opt {
                self.options.gen_span_params = spans_opt;
            }
        }
        // options -> ε
        CtxOptions::V2 => {}
    }
    SynOptions()
}
}

The global indent value is copied to all the indent options. We overwrite those values with self.lexer_indent and self.parser_indent, if they’re defined, in the exit(…) method:

#![allow(unused)]
fn main() {
fn exit(&mut self, config: SynConfig, span: PosSpan) {
    if let Some(indent) = self.lexer_indent { self.options.lexer_indent = indent };
    if let Some(indent) = self.parser_indent { self.options.parser_indent = indent };
}
}

Annotating the Input

The current error messages show the position of the error. It’s nice, but seeing the erroneous input is more helpful.

For example, try parsing the following input:

def SOURCE_FILENAME = "../watcher/src/lib.rs";

lexer {
    combined: "src/watcher.lg",
    output: SOURCE_FILENAMES ["watcher_lexer"],
    indent: 4
}

You’ll get this error in the log:

- ERROR  : at 5:13-28, SOURCE_FILENAMES is not defined

Let’s modify the listener to annotate the error in the input text. There are a few helpers for that in lexigram_core::text_span, but the listener must implement the GetLine trait:

#![allow(unused)]
fn main() {
use lexi_gram::lexigram_lib::lexigram_core::text_span::GetLine;

// ...

impl<'ls> GetLine for Listener<'ls> {
    fn get_line(&self, n: usize) -> &str {
        self.lines.as_ref().unwrap()[n - 1]
    }
}
}

Then, when errors are detected, self.extract_text(…) can be used to get the offending line(s) or self.annotate_text(…) to emphasize the erroneous parts in them.

#![allow(unused)]
fn main() {
use lexi_gram::lexigram_lib::lexigram_core::text_span::GetTextSpan;

// ...
impl ConfigListener for Listener<'_> {

    fn exit_value(&mut self, ctx: CtxValue, spans: Vec<PosSpan>) -> SynValue {
        match ctx {
            // ...
          
            // value -> Id
            CtxValue::V4 { id } => {
                if let Some(v) = self.consts.get(&id) {
                    v.clone()
                } else {
                    //self.log.add_error(format!("at {}, {id} is not defined", spans[0]));
                    let text = self.annotate_text(&spans[0]);
                    self.log.add_error(format!("{id} is not defined:\n\n{text}\n"));
                    SynValue::Error
                }
            }
    // ...
}

Now, the error looks like this (in the current crate version, the actual output changes the colour of SOURCE_FILENAMES, so you need an ANSI-friendly terminal):

- ERROR  : SOURCE_FILENAMES is not defined:

   6:     output: SOURCE_FILENAMES ["watcher_lexer"],
                  ^^^^^^^^^^^^^^^^

It’s not perfect: the parser and the lexer can only give the position when they detect an error, since they’re unaware of what the listener implements. In a future version, it will be possible to intercept their errors to customize the output, but it’s not implemented yet.

Wrapping It Up

The parser can now return the options instead of just the log. Maybe it’s better to put that in a single object:

#![allow(unused)]
fn main() {
pub struct ConfigResult {
    pub options: Options,
    pub log: BufLog,
}

impl<'l, 'ls: 'l> ConfigParser<'l, '_, 'ls> {
    // ...
    pub fn parse(&mut self, text: &'ls str) -> Result<ConfigResult, BufLog> {
        self.wrapper = Some(Wrapper::new(Listener::new(), VERBOSE_WRAPPER));
        let stream = CharReader::new(Cursor::new(text));
        self.lexer.attach_stream(stream);
        self.wrapper.as_mut().unwrap().get_listener_mut().attach_lines(text.lines().collect());
        let tokens = self.lexer.tokens().keep_channel0();
        let result = self.parser.parse_stream(self.wrapper.as_mut().unwrap(), tokens);
        let Listener { mut log, options, .. } = self.wrapper.take().unwrap().give_listener();
        if let Err(e) = result {
            log.add_error(e.to_string());
        }
         if log.has_no_errors() {
            Ok(ConfigResult { options, log })
        } else {
            Err(log)
        }
    }
}
}

and the tests, which should still be expanded to actually validate the result:

#![allow(unused)]
fn main() {
#[test]
fn test_run() {
    let mut p = ConfigParser::new();
    for (i, src) in [SRC1, SRC2].into_iter().enumerate() {
        println!("source #{i}");
        match p.parse(src) {
            Ok(ConfigResult { options, log }) => println!("{options:#?}\n\nlog:{log}"),
            Err(log) => panic!("error\n{log}"),
        }
    }
}
}

Possible Improvements

That’s a good first step. It’s not complete: there are other options that are not yet accessible from the configuration file parser, like the templates and some switches.

The templates of the user types and the listener behave like the lexer and parser options, so they could be additional groups with their own generated code location and indentation. Feel free to add those features to the existing grammar and implementation, using the templates.txt file to more easily spot the required changes.

Valueless Nonterminals

Some of the nonterminals don’t need a value: you can remove them with the --nt-value command-line option:

lexigram --indent 4\
  -x src/config.l -l "src/parser.rs" tag config_lexer\
  -g src/config.g -p "src/parser.rs" tag config_parser\
  --types templates.txt tag template_user_types\  
  --listener templates.txt tag template_listener_impl\
  --lib "super::listener_types::*" --spans\
  --nt-value set "<default>,-config,-definitions,-i_def,-lexer,-parser,-options" 

or with the .set_nt_value(…) method of the OptionsBuilder:

#![allow(unused)]
fn main() {
// ... (other constants)
static NT_NAMES: [&str; 7] = [
        "<default>",
        "-config",
        "-definitions",
        "-i_def",
        "-lexer",
        "-parser",
        "-options",
];

fn gen_source_config_l_g(action: Action) {
    let options = OptionsBuilder::new()
        .headers(["use lexi_gram::lexigram_lib::lexigram_core;"])
        // ...
        .span_params(true)
        .set_nt_value(NTValue::SetNames(NT_NAMES.into_iter().map(|s| s.to_string()).to_vec()))
        .build()
        .expect("should have no error");
    // ...
}

With the help of the template, removing the Syn* types and updating the relevant exit_*(…) methods should be easy.

Trailing Comma in Lists

The options in the lexer, parser, and options group are separated by a comma, and so are some list elements like headers, libs, and nt-value set.

However, the grammar doesn’t allow for a trailing comma at the end of the list:

options {
    nt-value: set { 
        "lexer", 
        "parser",
        "options",    // <-- trailing comma 
    },
    spans: true
}

If you modify the grammar to allow it, Lexigram warns you that the grammar is ambiguous. For example, if you modify the following rule:

nt_value:
    Default
|   None
|   Parents
|   Set Lbracket value (Comma value)* comma_opt Rbracket
;

comma_opt:
    Comma
|
;

you get this message, that we split to improve the clarity (it’s indeed a little obscure!):

Warning: - calc_table: ambiguity for NT 'nt_value_1', T ',': 
           <"," value nt_value_1> or <ε> 
           => <"," value nt_value_1> has been chosen

To understand what it means, you must inspect the table with the rule alternatives, at the end of the log:

32: nt_value -> "default"
33: nt_value -> "none"
34: nt_value -> "parents"
35: nt_value -> "set" "{" value nt_value_1 comma_opt "}"
44: . nt_value_1 -> "," value nt_value_1                  <- starts with ","
45: . nt_value_1 -> ε
36: comma_opt -> ","                                      <- starts with ","
37: comma_opt -> ε

The repetition of comma-separated items is handled by nt_value_1. The new optional comma after value_1 in production 44 is problematic because the parser cannot tell whether it’s the comma in production 44 or the trailing comma in production 35.

  • In the first case, annotated <"," value nt_value_1, the iteration continues with production 44.
  • In the second case, annotated <ε>, the iteration ends with production 45, which is then followed by production 36.

If the parser could predict the next production by looking two tokens ahead instead of one, so if it was an LL(2), it could check whether the 2nd token is "}" or one of the tokens that starts value ("false", "true", Id, …).

Here, the parser is LL(1), so it removes the ambiguity by assuming it’s always production 44: it takes a new iteration. If your list has a trailing comma, it will generate a syntax error when the parser gets a "}" instead of a value-friendly token.

Is it possible to have trailing commas in a LL(1) grammar?

Yes, but you must reword the rules. Let’s consider a simpler, generic rule:

a -> "{" Id ( "," Id )* ","? "}";

which is transformed into:

a   -> "{" Id a_1 a_2 "}";
a_1 -> "," Id a_1 | ε; 
a_2 -> "," | ε; 

It can be transformed into:

a   -> "{" Id a_1 "}";
a_1 -> "," a_2 | ε; 
a_2 -> Id a_1 | ε; 

Lexigram doesn’t do that transformation automatically for you. You could write those rules, but you’d have to process the list on your own, which would be less convenient than the vector we receive for free in our current grammar.

An alternative is to end all the options in a group by a semicolon, and not to allow a trailing comma in lists like libs.

That’s quite typical of the occasional compromise or transformation you must do when you’re working with an LL(1) parser.