Lexer Generator Reference

Lexi is the part of Lexigram that generates a lexer.

The lexer, or scanner, is the very front-end of a parser. It reads the input text, recognizes symbols, and outputs a stream of tokens corresponding to those symbols. The lexer isn’t aware of the syntax of the input language, only of its lexical rules. The stream of tokens is generally read by a parser, but it can also be used directly in a user application.

The lexical rules are defined in a lexicon, in a human-readable language described in this reference. The code of the lexer proper already exists in the lexigram-core crate; its main object is Lexer. Lexi takes a lexicon and generates the data required by Lexer to scan the language.

The language used to describe a lexicon is very similar to that of the ANTLR tool, though there are a few differences.

Notation

We’re using the following conventions in the lexicon:

Terminals (or tokens, which means the same in this context) are written in upper camel case: Assign, Id, StringLiteral.
Fragments are written in upper snake case: DIGIT, HEX_DIGIT.
Modes are written in upper snake case and end with _MODE: DEFAULT_MODE, TAG_MODE.
Channels are written in upper snake case and end with _CHANNEL: COMMENTS_CHANNEL.

Those conventions are not enforced by Lexi.

Main Concepts

The lexicon defines the tokens emitted by the lexer, using simple regular expressions:

Assign	      : '=';
Add           : '+';
If            : 'if';
While         : 'while';
Id            : [_a-z][_a-z0-9]+;
StringLiteral : '"' (~('"'|'\n'|'\r') | '""')* '"';

From the lexicon, Lexi generates the source code of the lexer:

The symbol and the finite state machine tables, other supporting data.
A build_lexer() method that creates a Lexer from those data.

Lexicon Structure

A lexicon can be made up of several elements.

It begins with an optional header:

lexicon LexiLexer;

After the optional header, the lexicon is a list of items, each of which is an option, a rule, or a declaration.

The only option currently implemented is a list of additional channels:

channels { COMMENTS_CHANNEL, ATTRIBUTE_CHANNEL }

A rule is the definition of a token, which is what the lexer outputs, or a fragment, which is akin to an alias or a constant:

fragment DIGIT  : [0-9];
Number          : DIGIT+;

The only declaration currently implemented is the mode section, in which rules are active only when that mode is the current one. The starting mode, DEFAULT_MODE, is implicitly declared at the top, and thus the default mode of the lexer.

mode COMMENT_MODE;

Matching rule

Among the rules of the current mode, the one that matches the longest input is selected to issue a token. If you need the keywords "down", "downto", and "to" in your language, you can safely define the three tokens in whichever order you prefer. The resulting lexer will be the same, and it won’t cut the input "downto" in two tokens "down" and "to".

Down:   'down';
DownTo: 'downto';
To:     'to';

When there is a conflict between two rules, the first token defined in the lexicon takes precedence. This typically occurs with variable tokens:

Down:   'down';
DownTo: 'downto';
To:     'to';
Id:     [_a-z][_a-z0-9]*;

"to" is scanned as To. There is a conflict between To and Id("to"), so To is selected because it’s defined first.
"total" is scanned as Id("total"). There is no conflict after reading the first two characters "to" because * and + are greedy by default, so Id captures the whole word "total". The lexer chooses the token matching the longer input, which is Id("total").

Were Id defined first, none of the other tokens would ever be selected. However, Lexigram would see that some tokens are never output, so it would produce an error before generating the lexer, suggesting to re-order the definitions.

With a non-greedy repetition, we would get a different result for "total": Once the first two characters "to" are read by the lexer, To and Id("to") are in conflict this time:

To:     'to';
Id:     [_a-z][_a-z0-9]*?;

"to" is scanned as To. Like above, there is a conflict between To and Id, which is resolved by taking the first definition.
"total" is scanned as To, Id("tal"). Once the first two characters "to" are read, there is a conflict because of the non-greedy *?. To is selected because it’s defined first, which leaves the rest of the input "tal". That remaining input is scanned as Id("tal")

In conclusion, the usual way to design a lexicon is to put the fixed tokens first and the variable tokens last, unless there is no risk of conflict.

Fragments

The fragment keyword defines a fragment of lexemes that may be used several times, like an alias or a constant. It can also be used to split a rule into more explicit components.

fragment DIGIT        : [0-9];
fragment HEX_DIGIT    : ( DIGIT | [A-Fa-f] );
fragment DEC_INTEGER  : DIGIT ( '_' | DIGIT )* DIGIT | DIGIT;
fragment HEX_INTEGER  : HEX_DIGIT ('_' | HEX_DIGIT)* HEX_DIGIT | HEX_DIGIT;
                      
Integer               : DEC_INTEGER | '0x' HEX_INTEGER;

Fragments are globally accessible, so they can be used in any mode section.

Regular Expressions

The regular expressions used in each rule or fragment can include the following elements:

'a', 'script': string literals that must be matched entirely (single quotes are used for both characters and strings)
- \n, \r, \t, \', and \\: common escape codes for characters inside string literals
- \u{0}-\u{d7ff}, \u{e000}-\u{10ffff}: general Unicode escape codes
character sets:
- inside square brackets:
  - [abc], [a-z0-9_], [+\-*/]: explicit character set; ‘-’ is used for inclusive ranges inside square brackets
  - \n, \r, \t, \\, \], \-: escape codes for characters inside square brackets (both [ and \[ are accepted)
- outside square brackets:
  - 'a'..'z': inclusive range of characters
  - .: represents any character
- valid both inside and outside square brackets:
  - \w: represents a single word character: [_0-9A-Za-z]
  - \d: represents a decimal digit: [0-9]
( and ) around item(s): groups the content to avoid conflicts of precedence with other operators
~: negates the character set on the right:
- ~[ \t\n\r]: any non-space character
- ~( 'a' | 'b' | 'c' ): any character but ‘a’, ‘b’, or ‘c’, equivalent to ~'a'..'c'
*: zero or more instances of the item on the left:
- [a-z]*: zero or several letters
- ab*: ‘a’ followed by zero or more ‘b’ characters
- ('--'|'==')*: zero or more pairs of ‘-’ or ‘=’
+: one or more instances of the item on the left (same principle as *)
?: zero or one instance of the item on the left (same principle as *)
- after * or +: lazy repetition; for example, '/*' .*? '*/' (avoids the ending "*/" characters to be captured by .*)
|: alternative between several items; for example, 'if' | 'while' | [a-z]+
fragment; for example, ( DIGIT | [a-f] )+ where DIGIT has been defined as a fragment

Channels

The lexer can output tokens on different channels, if parts of the input must be processed differently. For example, it can be used to isolate a sublanguage that is better parsed with another parser, or it can extract comments that must be put back in the output of a transpiler but would be too complex to handle in the grammar.

Tokens are normally output on the predefined DEFAULT_CHANNEL (channel #0). If other channels are required, they must be declared with the channels option, and the channel(<channel name>) action must then be used in rules to output the corresponding tokens to that channel.

  channels { COMMENTS_CHANNEL };
  
  Comma           : ',';
  Keyword         : [a-zA-Z]+;
  
  CommentLine     : '//' ~[\r\n]*     -> channel(COMMENTS_CHANNEL);
  WhiteSpaces     : [ \t\r\n]*        -> skip;

Comma and Keyword tokens are output on channel #0.
CommentLine tokens are output on channel #1 (COMMENTS_CHANNEL).

The lexer provides two methods to output the tokens:

get_token(&mut self) -> Result<Option<LexerToken>, LexerError>
tokens(&mut self) -> LexInterpretIter, which returns an iterator of LexerToken items.

Both methods provide LexerToken items, defined as the following tuple, where the channel is the 2nd element:

#![allow(unused)]
fn main() {
pub type LexerToken = (TokenId, ChannelId, String, PosSpan);
}

The TokenSpliterator trait defines a few convenient methods which can be used with tokens() to filter the tokens. In particular, keep_channel0() discards all the channels but the default one, and its output can be directly connected to a parser generated by Lexigram. An example can be found in the tutorial.

Modes

The lexicon can be divided into mode sections where the rules are independent of the other modes. It is very useful to scan a sublanguage.

For example, Lexi uses square brackets to define a character set: [a-z]. The lexicon used to generate Lexi handles the content of square brackets in a separate mode to simplify the definition and to avoid any conflict with the other rules:

fragment ESC_SET_CHAR   : '\\' [nrt\\[\]\-];  // simplified for this example

// ...
Minus           : '-';
FixedSet        : ('\\w' | '\\d');
LSbracket       : '['                       -> push(SET_CHAR_MODE);

mode SET_CHAR_MODE;

RSbracket       : ']'                       -> pop;
Minus2          : '-'                       -> type(Minus);
SetChar         : (ESC_SET_CHAR | ~[\n\r\t\\\]\-]);
FixedSet2       : ('\\w' | '\\d')           -> type(FixedSet);

When the next input character is "[", the LSbracket token is emitted and the lexer switches to SET_CHAR_MODE. In that mode:

\w, \d, and - emit the same tokens as they would in the default mode
unescaped "]" emits RSbracket and returns to the default mode
other valid characters and escape sequences emit SetChar.

If this had to be written as a single rule, it would require a complex repetition of both escaped and unescaped codes—of different lengths—while avoiding an accidental capture of the ending "]". It would also emit a single token for the whole expression, so the parser would have to interpret it on its side.

With the use of this separate mode, the definitions are simple, and the separate components are sent to the parser, simplifying its job considerably. Here is the relevant part of Lexi’s grammar:

char_set:
    LSbracket (char_set_one)+ RSbracket
// ...
;

char_set_one:
    SetChar Minus SetChar // adds the range to the current set
|   SetChar               // adds the character to the current set
|   FixedSet              // adds the set to the current set
;

Actions

When a rule is matched, the default action of the lexer is to emit the token. Other actions can be specified by placing them on the right side of a -> symbol. Several actions are separated by a comma:

TagOpen: '<' -> mode(TAG_MODE), more;

List of actions:

channel(<channel>) emits the token on a specific channel (which must be defined in a channels rule). DEFAULT_CHANNEL is the name of channel 0, which is the default output channel for all tokens:
```
channels { COMMENTS_CHANNEL, WHITESPACES_CHANNEL };
CommentLine     : '//' ~[\r\n]*         -> channel(COMMENTS_CHANNEL);
```

skip discards the match:

WhiteSpaces     : [ \t\r\n]*            -> skip;

mode(<mode>) changes the mode.

TagOpen: '<'                            -> mode(TAG_MODE), more;

push(<mode>) changes the mode after pushing the current mode on the stack:

ScriptOpen      : '<script'             -> push(SCRIPT_MODE);
Id              : [A-Za-z][A-Za-z0-9]+;

mode SCRIPT_MODE;
ScriptClose     : '>'                   -> pop;
ScriptInclude   : 'include';
ScriptExecute   : 'execute';
ScriptId        : [a-z][a-z0-9]+;

pop pops a mode from the stack and changes to that mode.
more continues evaluating the rules but keeps all the characters that have been matched so far for the next token. Below, the Tag token is associated with the < and > delimiters and all the characters in-between:
```
TagOpen         : '<'                   -> mode(TAG_MODE), more;

mode TAG_MODE:
TagMore         : [a-zA-Z][a-zA-Z0-9]*  -> more;
Tag             : '>'                   -> mode(DEFAULT_MODE);
```

type(<token>) emits another defined token instead of one with the rule’s name. This action can be used if the same token must be emitted from different modes. Below, both Minus in DEFAULT_MODE and Minus2 in SET_CHAR_MODE emit a Minus token for the "-" input:

fragment ESC_SET_CHAR : '\\' [nrt\\[\]\-];
Minus                 : '-';
LSbracket             : '['             -> push(SET_CHAR_MODE);

mode SET_CHAR_MODE; 
RSbracket             : ']'             -> pop;
Minus2                : '-'             -> type(Minus);
SetChar               : (ESC_SET_CHAR | ~[\n\r\t\\\]\-]);
FixedSet2             : ('\\w' | '\\d') -> type(FixedSet);

hook marks a token as interceptable. Whenever there’s a possibility that this token is parsed next, the parser calls a specific listener method hook(…), and the user implementation can change the token before the parsing continues. See also Token Interception.
```
Id              : [a-zA-Z][a-zA-Z_0-9]*;
(Type)                                  -> hook;
```

If skip or more is in the list of actions, no token is emitted. Otherwise, the rule emits a token corresponding to the rule name or, if specified, to the type(<token>) action. In the example below, the String and StringContent rules don’t emit any token; they accumulate the input characters. The StringClose rule emits the final String token with all the accumulated characters, including the surrounding double quotes.

String          : '"'         -> push(STRING_MODE), more;

mode STRING_MODE;
StringClose     : '"'         -> pop, type(String);
StringContent   : [ -z]*      -> more;

Fixed / Variable Tokens

Tokens can correspond to a fixed string literal, like "-", or to an expression that can match different strings, like [a-z]+ or "a" | "A". In the documentation, we call them fixed and variable tokens (or terminals), respectively. The distinction is important if you get the symbol table from Lexi, or when implementing the listener generated by Lexigram.

Fixed tokens have an associated (constant) string literal in the symbol table, and they’re not represented in the context argument of the exit_*(…) listener methods. Variable tokens don’t have an associated string in the symbol table, but they’re represented in the context arguments with the captured input, which is generally useful for the parser: an identifier’s name, an integer literal, and so on.

For example, with the following lexicon / grammar:

lexicon Example;
Equal : '=';
Id    : [a-z]+;
Num   : [0-9]+;

grammar Example;
assign:
    Id Equal Num
;

The generated listener has the following method, where Equal is not present in the context because it’s fixed:

#![allow(unused)]
fn main() {
fn exit_assign(&mut self, ctx: CtxAssign) -> SynAssign {
    // assign -> Id "=" Num
    let CtxAssign::V1 { id, num } = ctx;    // id, num: String
    SynAssign()
}
}

Note

If the same token is emitted by several rules, the corresponding token is fixed only if all the literals are equal:
Interval              : '..';
LSbracket             : '['             -> push(SET_CHAR_MODE);

mode SET_CHAR_MODE;
RSbracket             : ']'             -> pop;
Interval2             : '-'            -> type(Interval);
SetChar               : [a-z];
Since Interval can be either ".." or "-", it is a variable token.

Only emitted tokens are considered. In this somewhat contrived example,
A0 : 'a' -> type(A);
A  : 'A' -> skip;
the generated lexer is such that the input

"a" emits the token 0 = (A, Some("a")), not (A, None), so it won’t be in the context of the listener methods.

"A" is valid but emits no token.

Advanced Concepts

Token Indices

Each token has an index, which is the index in the symbol table and the value used internally by the lexer and the parser. Normally, it’s transparent to the Lexigram user, who doesn’t need to deal with those values, except when the tokens must be intercepted. In that case, an option of Lexigram generates an optional type that represents the tokens in an enum: --token-enums (or the option builder method token_enums(true), if the lexer is generated from Rust code). See Token Interception for more details.

As a general rule, the index corresponds to the position of the rule in the lexicon. There are a few nuances:

Rules that don’t produce a token don’t count in the indexing, in order to avoid gaps in the indices.
If a rule has a type(Other) action, its definition is used for matching the input, but the output token is Other and its index is the index given to the Other rule. Note that if the Other rule itself doesn’t produce anything, it’s now considered as “productive” because its reference in another productive rule.
```
Zero: '0';
NotAToken: 'x' -> type(Other);
One: '1';
Other: 'y' -> skip;
```
produces a lexer such that the input
- "0" emits token 0 = (Zero, Some("0"))
- "1" emits token 1 = (One, Some("1"))
- "x" emits token 2 = (Other, Some("x"))
- "y" is valid but doesn’t emit any token

Token Interception

If a rule has the name of the token between parentheses and no definition, it’s a reserved token. Lexi reserves the token index but doesn’t create any lexer rule that outputs the token, unless another rule specifies the token with the type action.

Reserved tokens are mainly used when the tokens are intercepted dynamically by the listener, where they can be changed to another token.

In this example of lexicon, Id takes all the identifiers, even those that are possibly types. Type is reserved, and it’s used in the hook(…) method when the identifier is indeed a language type or a user-defined type.

Id              : [a-zA-Z][a-zA-Z_0-9]*;
(Type)                                      -> hook;

Its accompanying grammar has, among other rules, decl:

program:
    (<L=decl_i> decl)* (<L=inst_i> inst)+
;
decl:
    Type Id (<L=id_i> Comma Id)* SemiColon
|   Typedef Type Id SemiColon
;
inst:
    Let Id Eq expr SemiColon
|   Print expr SemiColon
;
// ...

The two alternatives include Type, though at a different position.

When time decl is the next nonterminal to derive, or when there’s a possibility decl is predicted, hook(…) is called to determine the actual token, to see if the first production should be chosen.
Also, if the 2nd alternative has been predicted, hook(…) is called after Typedef has been successfully matched. At this point, if the method returns anything else than the token Type, the parser issues a syntax error.

#![allow(unused)]
fn main() {
impl TypedefListener for TypeListener<'_> {
    // ...
    fn hook(&mut self, token: TokenId, text: &str, span: &PosSpan) -> TokenId {
        match text {
            "int" | "float" | "double" => Term::Type as u16,
            t => {
                if self.types.contains_key(t) {
                    Term::Type as u16
                } else {
                    token
                }
            }
        }
    }
}
}

(Term is an enum generated with the --token-enums option)

Warning

Since the method is called each time the hooked token may occur at any point in the grammar, it will potentially encounter all the other tokens that could also occur at that point instead.

In the grammatical definition of decl above, the Type in the first alternative is “competing” with the Typedef of the second alternative. Therefore, the method could get Term::Type or Term::Typedef. That’s not all: inst is competing with decl as next predicted nonterminal in program, so ``LetandPrintare two other arguments thathook(…)` receive.

The listener offers another way to intercept all the tokens: the parser calls the listener’s intercept_token(…) method each time it receives a token from the lexer. Its default implementation is empty, so it’s optimized out by the compiler. Implementing it to intercept tokens may have a bigger performance impact than hook(…), and it requires to take all the incoming tokens into account.

See also the gen_typedef and typedef examples, which illustrate the two interception methods.

Formal Grammar Definition

Here are the lexicon and the grammar of Lexi, using the Lexigram syntax. Those files can be found in the repository, in the build-stage1 directory.

Lexi’s Lexicon

lexicon LexiLexer;

fragment BLOCK_COMMENT  : '/*' .*? '*/';
fragment LINE_COMMENT   : '//' ~[\r\n]*;
fragment HEX_DIGIT      : [0-9a-fA-F];
fragment UNICODE_ESC    : 'u{' HEX_DIGIT+ '}';
fragment ESC_CHAR       : '\\' ([nrt'\\] | UNICODE_ESC);
fragment CHAR           : ESC_CHAR | ~[\n\r\t'\\];
fragment CHAR_LITERAL   : '\'' CHAR '\'';
fragment STR_LITERAL    : '\'' CHAR CHAR+ '\'';

// Char inside a '[' ']' set
fragment ESC_SET_CHAR   : '\\' ([nrt\\[\]\-] | UNICODE_ESC);

Arrow           : '->'; /* // first token // */
Colon           : ':';
Comma           : ',';
Dot             : '.';
Ellipsis        : '..';
Lbracket        : '{';
Lparen          : '(';
Negate          : '~';
Minus           : '-';
Plus            : '+';
Or              : '|';
Question        : '?';
Rbracket        : '}';
Rparen          : ')';
Semicolon       : ';';
Star            : '*';

Channels        : 'channels';
Fragment        : 'fragment';
Lexicon         : 'lexicon';
Mode            : 'mode';
Pop             : 'pop';
Push            : 'push';
More            : 'more';
Skip            : 'skip';
Type            : 'type';
Channel         : 'channel';
Hook            : 'hook';

SComment        : BLOCK_COMMENT             -> skip;
SLineComment    : LINE_COMMENT              -> skip;
SWhiteSpace     : [ \n\r\t]+                -> skip;

Id              : [a-zA-Z][a-zA-Z_0-9]*;

CharLit         : CHAR_LITERAL;
StrLit          : STR_LITERAL;

FixedSet        : ('\\w' | '\\d');

LSbracket       : '['                       -> push(SET_CHAR_MODE);

mode SET_CHAR_MODE;

RSbracket       : ']'                       -> pop;
Minus2          : '-'                       -> type(Minus);
SetChar         : (ESC_SET_CHAR | ~[\n\r\t\\\]\-]);
FixedSet2       : ('\\w' | '\\d')           -> type(FixedSet);

Lexi’s Grammar

grammar LexiParser;

file: header? file_item*;

file_item:
    option | declaration | rule
;

header:
    Lexicon Id Semicolon
;

declaration:
    Mode Id Semicolon
;

option:
    Channels Lbracket Id (Comma Id)* Rbracket
;

rule:
    rule_fragment_name Colon match Semicolon
|   rule_terminal_name Colon match (Arrow actions)? Semicolon
    // only reserves the token, but doesn't add a rule to scan it:
|   Lparen rule_terminal_name Rparen opt_str_lit (Arrow Hook)? Semicolon
    // "grammar" Id ";" => start of grammar section
|   rule_terminal_name Id Semicolon
;

opt_str_lit:
    (Colon StrLit)?
;

rule_fragment_name:
    Fragment Id
;

rule_terminal_name:
    Id
;

actions:
    action (Comma action)*
;

action:
    Mode Lparen Id Rparen
|   Push Lparen Id Rparen
|   Pop
|   Skip
|   More
|   Type Lparen Id Rparen
|   Channel Lparen Id Rparen
|   Hook
;

match:
    alt_items
;

alt_items:
    alt_item (Or alt_item)*
;

alt_item:
	repeat_item+
;

repeat_item:
    item Star Question?
|   item Plus Question?
|   item Question?
;

item:
    Id
|   CharLit (Ellipsis CharLit)?
|   StrLit
|   char_set
|   Lparen alt_items Rparen
|   Negate item
;

char_set:
    LSbracket (char_set_one)+ RSbracket
|   Dot
|   FixedSet
;

char_set_one:
    SetChar Minus SetChar
|   SetChar
|   FixedSet
;

Keyboard shortcuts

Lexigram Book