Lexer rule evaluation in ANTLR

glenb · February 11, 2022, 7:37pm

I’m using ANTLR for the first time in a parser project. Do you know if ANTLR tries only the lexer rules which would be relevant based on the parser rule it is evaluating at the time? Perhaps ‘no’ is the answer because it conflicts with the definition of context-free grammar? I don’t know.

I have quite a bit of experience with lex/yacc where, unless you take some special steps, all the lexer rules are evaluated in top-down order, and the first to find a match will produce a token in the token stream. Is ANTLR the same way, across all possible lexer rules, even those embedded within a parser rule?
Thanks.

kevinjmackey · February 11, 2022, 8:56pm

Hi,

From the ANTLR Mega Tutorial found here: The ANTLR Mega Tutorial

Blockquote
It’s also important to remember that lexer rules are analyzed in the order that they appear, and they can be ambiguous.
The typical example is the identifier: in many programming languages it can be any string of letters, but certain combinations, such as “class” or “function” are forbidden because they indicate a class or a function. So the order of the rules solves the ambiguity by using the first match and that’s why the tokens identifying keywords such as class or function are defined first, while the one for the identifier is put last.

Hope this helps.

I typically use a separate Lexer Grammar file, so there are no lexer rules defined within the parser rule file. As far as I am aware, within its “first come, first served” approach, ANTLR’s lexer will try to match the largest span of input it can when tokenizing.

glenb · February 11, 2022, 9:59pm

Thank you. I think I was mentally adding: “lexer rules are analyzed in the order that they appear (within a parser rule)”. But it seems that ANTLR works like lex/yacc in that the lexer finds the token regardless of the rule it might be “inside” at the time. Since that’s the case, I think I’ll do the same as you suggest and put all the lexer rules in one file to keep them separate. I understand there are modes and maybe other ways to group lexer rules to add context. I haven’t done that yet, but maybe will for comments and maybe strings. I’m not sure yet about that.

mike · February 17, 2022, 6:33pm

Some clarification:

The Lexer is only “first some first serve” when multiple Lexer rules match an equal length sequence of characters. All Lexer rules are evaluated, and the rule that matches the longest sequence of characters from the input stream will “win”. It’s only when there is a tie for the length that the order of the rules matter.

And specifically to the original question, In ANTLR, the input stream is passed through the Lexer (aka Tokenizer) which will use Lexer rules to recognize tokens and produce a stream of tokens which the parser rules take as input. Since input is tokenized before parser rules act on it, there is nothing a parser rule can do to impact how the Lexer tokenizes the input. While ANTLR is recursive-descent, it is only the parser processing of the token stream that is recursive-descent and can impact which rules are even considered. The Lexer evaluates all Lexer rules giving precedence to the longest matching sequence and breaks ties by their position in the ANTLR source.

Lotes · February 22, 2022, 8:45am

I messaged Glenn about PEG. I was not sure about posting it, because he was asking for ANTLR.
One area where PEG is currently used afaik is Invisible XML. A conversion of source code to XML.
There are a lot of implementations on Github.