Lexigram crates

The Lexigram project is split into several crates. The following sections give a quick summary of what you can find in them.

The dependency tree is as follows:

lexigram: the command-line lexer & parser generator
- lexi-gram: the high-level methods and objects to generate a lexer & parser
  - lexigram-lib: the low-level code that generates the lexer & parser
    - lexigram-core: the minimal library required by the generated parser

lexigram

This crate contains the command-line executable lexigram, a lexer & parser generator.

Link to the crate documentation.

You can install it with

cargo install lexigram

Lexigram is able to read the lexicon and the grammar either separately or from the same location. The location can be either a file or the part of a file found between a pair of custom tags.

The options for the location of the language specifications are as follows:

-x|--lexicon: location of the lexicon
- -x calc.l reads from the file ./calc.l
- -x calc.lg tag lexicon reads from the file ./calc.lg the lexicon located between a pair of tags [lexicon]
-g|--grammar: location of the grammar (same principle as above)
-c|--combined: location of the combined lexicon/grammar (same principle as above)

The tags can be in a comment or in a string; Lexigram simply looks if it’s included in two lines of the file. If it can’t find both tags, Lexigram stops before going any further; e.g. it won’t start writing to the file, possibly erasing important code

Warning

Always leave a blank line after the first tag and before the second when you write the lexicon and the grammar, because those lines are skipped.

You can find example of lexicons and grammars, either in their own file or between tags, in the repository. Look for *.l, *.g, and *.lg files in the gen_* crates; the extension doesn’t really matter for Lexigram, but we stuck to this convention for the sake of consistency.

The same system as above applies to the location where the generated code should be written, although you can also output it to stdout. Using the tags allow you to have your grammar in the same file as the generated code, if you wanted something minimal, or in one of the crate’s Rust files. The generated lexer and parser can even share the same module, although it’s tidier to keep them apart.

-l|--lexer: location of the generated lexer
- -l src/calc_lex.rs overwrites the file ./src/calc_lex.rs
- -l src/calc.rs tag lexer writes to the file ./src/calc.rs between the tags [lexer] (overwriting anything previously written between those tags)
- -l -: writes the generated lexer to stdout
-p|--parser: location of the generated parser and the wrapper for the user listener (same principle as above)

Alongside the code of the lexer and parser, Lexigram can optionally generate a code template for the user types and the listener implementation. The location follows the same syntax as the options --lexer and --parser:

--types: location of the user types template
--listener: location of the listener implementation

There are a few global options that give you control over the generated code. Mainly:

--header: extra headers to be put in front of the generated code
--indent: the default indentation
--nt-value: list or category of nonterminals which hold a value
--lib: extra library required by the listener; typically the location where the types of the nonterminals are defined, when they hold a value. Can be used multiple times to add several libraries
--ansi <off/on/passive>: enables/disables ANSI colour codes in errors and logs (on by default). Use “passive” or “off” if the default creates problems in your console.
--log: detailed log, which contains useful information like the list of terminals and nonterminals, or code that can be copied/pasted the first time, like the nonterminal types used in the generated code and a skeleton implementation of the listener
-v|--verify: verification that the code previously generated is still the same, which can be used in validation tests

The complete list of options can be obtained with

lexigram -h

Warning

It’s worth noting that

The options --lexicon and --lexer must be given before --grammar and --parser.

The template options --types and --listener are best given after --parser

If the lexicon and the grammar are combined, --combined must be given first, then --lexer, and finally --parser

The options --header and --indent apply to the previously mentioned part: they apply to the lexer after --lexer and to the parser after --parser. If you want the same headers or indentation for both, you can put those options in front.

In short, follow this order:

--lexer, --parser, --types, --listener, then other global options.

--indent in front as a default value

--header in front if it must apply to both lexer and parser

Example:
lexigram -c calc.lg --indent 4 -l lexer.rs --header "#![cfg(test)]" \ 
  -p parser parser.rs --header "#![allow(unused)"
indents both the lexer and the parser by 4 spaces

puts #![cfg(test)] in front of the lexer code

puts #![allow(unused) in front of the parser code

Examples

This command reads the combined lexicon/grammar from ./grammar/calc.lg and writes the generated code to ./src/lexer.rs and ./src/parser.rs, respectively. The log is output to stderr.

lexigram -c grammar/calc.lg -l src/lexer.rs -p src/parser.rs --log

Adding -v verifies that the generated code is still the same (once it’s been written, of course).

lexigram-core

This crate contains the minimal library required by the generated lexers and parsers.

Link to the crate documentation.

Some public items may be of interest; among others:

log is the log object used by the parser, which you may also use in your listener so that all the notes, warnings, and errors are in the same place.
text_span defines two traits which can be used in the listener implementation to extract the text corresponding to a terminal or a nonterminal and show it in context of the parsed input. This could be used to indicate where an error has occurred, for instance.
char_reader is the object that sends characters to the lexer; the module has a few other methods related to UTF-8 .
lexer includes the lexer and associated types, including PosSpan, the type of an optional parameter of the listener methods (token position in the text).
parser includes the parser.

The lexigram-core crate has an optional feature, delay_stream_interception, which is disabled by default. That feature delays the capture of the next tokens coming from the lexer, in order to reduce the latency between the capture and the calls to the listener even further. It can be used in some situations:

When the parser must ignore the remaining part of an input stream, if that remaining part contains text the lexer cannot scan. By delaying the capture, it prevents the lexer from issuing a lexical error before the listener can stop the parser. An example is given in watcher, where you can modify the Cargo.toml to change that optional feature and see the impact.
When the listener intercepts the tokens with intercept_token(…) to change them, if that change depends on another exit_*(…) listener method that may not be called before the token is captured by the parser.

Note that, in both cases, it is usually possible to work around those latency problems by introducing an extra nonterminal whose exit_(…) method will be called earlier. This is illustrated in the watcher (impl) example: the shutdown nonterminal was added to the grammar for that purpose.

lexi-gram

This crate contains Lexi and Gram, the parsers used to read the lexicon and the grammar. We do indeed require a parser to read those definitions, which led to some interesting recursion during the development of Lexigram, since those parsers generate themselves.

Link to the crate documentation

It also provides methods and objects to easily generate the parser source programmatically. This is a viable alternative to the command-line executable if you need to generate the code repeatedly or from a list of lexicons/grammars. The main elements are:

gen_parser, a module with methods to generate and to get the source of a parser in a string, or to write it to a file.
options, a module with an option builder used with the gen_parser.
A couple of macros that can be used with the options.

You will find examples in the repository; look at the gen_* crates to see how gen_parser is used to generate the code in the other crates.

To illustrate one case, the parser used in the microcalc crate is generated with the following code when action = Action::Generate (Action::Verify is used to verify the generated code, similarly to the -v option of the executable).

#![allow(unused)]
fn main() {
static LEXICON_GRAMMAR_FILENAME: &str = "src/microcalc.lg";
static SOURCE_FILENAME: &str = "../microcalc/src/main.rs";
static LEXER_TAG: &str = "microcalc_lexer";
static PARSER_TAG: &str = "microcalc_parser";
const LEXER_INDENT: usize = 4;
const PARSER_INDENT: usize = 4;

fn gen_source_microcalc_lg(action: Action) {
    let options = OptionsBuilder::new()
        .combined_spec(genspec!(filename: LEXICON_GRAMMAR_FILENAME))
        .lexer_code(gencode!(filename: SOURCE_FILENAME, tag: LEXER_TAG))
        .indent(LEXER_INDENT)
        .parser_code(gencode!(filename: SOURCE_FILENAME, tag: PARSER_TAG))
        .indent(PARSER_INDENT)
        .libs(["super::listener_types::*"])
        .build()
        .expect("should have no error");
    match try_gen_parser(action, options) {
        Ok(log) => {
            if action == Action::Generate {
                println!("Code generated in {SOURCE_FILENAME}\n{log}");
            }
            assert!(log.has_no_warnings(), "no warning expected");
        }
        Err(build_error) => panic!("{build_error}"),
    }
}
}

This would correspond to the command-line

lexigram -c src/microcalc.lg \
  -l ../microcalc/src/main.rs tag microcalc_lexer --indent 4 \
  -p ../microcalc/src/main.rs tag microcalc_parser --indent 4 \
  --lib "super::listener_types::*"

lexigram-lib

This crate contains the code that generates a lexer and a parser, except for the high-level API provided in lexi-gram.