Is it better to be more granular in a lexer or less?

thad · November 18, 2023, 12:07am

A little context first:
I’m working on a grammar for a language called AL used in microsoft’s ERP product (built using Antlr4). It’s a bit funky in that it has an object language and then inside that it has a code language. The two are not 100% consistent so I started using a mode for the code, so I could create very specific lexer rules like TABLE, PAGE, FILTER, etc. The language has a ridiculous number of keywords. However, in practice, many of these are not truly reserved by the compiler, since you can use them as variable names in code.

I started defining a ton of lexer rules, but then I began to wonder, is this the best way? Is it perhaps better to tokenize many of these words as a simple IDENTIFIER, then in the parser rules add a check to insure that the identifier token’s text value is equal to “table” for instance? I’m unsure as to what is the most common and recommended practice in these situations. If I define them all as lexer rules then I think I will have close to 300 lexer rules. Granted if I do text checks, I’ll be inserting code segments in a ton of parser rules. Also, if I define all the explicit rules, I suspect it will bite me in the butt later, since no one should ever name a table field “table”, but I think the language technically allows it. In that instance the parser would blow up when someone references <mytable.table> in a field rule somewhere (since table would be tokenized as a table token and not an identifier named “table”).

igor.dejanovic · November 18, 2023, 2:05pm

You have to work inside the boundaries imposed by the parsing tool you use. Thus, you would probably get a better support from the community if you specify what tool do you use.

The problem that you are facing is a problem of lexing ambiguity due to missing context.
There is a lexing technique called context-aware lexing where lexing is done during parsing on-the-fly. When the lexer is called the parser supply the tokens it expect at the current location. This enables having e.g. variable named the same as keywords without having to specify special additional rules. There is also scannerless parsing which is similar regarding the lexing/parsing power but the parsing is usually done down to a character level. While this could resolve most of the issues you are facing if you could use the tool that supports it, semantic ambiguities remain which is IMHO best handled, as you described, by relaxed lexing/parsing and deciding latter during semantic analysis.

thad · November 18, 2023, 5:04pm

Ah, sorry, I added a tag for Antlr4 but failed to specify that I was using Antlr4. I updated the original post to reflect that.

Thank you for the explanation. I did not realize there were parsers that only tokenize at time of parsing (though I should have guessed I suppose). Thus far I have only written recursive descent parsers from scratch or used parser generating frameworks like Antlr.

igor.dejanovic · November 19, 2023, 1:36pm

I failed to see the tag. Sorry. But, I was suspecting ANTLR.

I don’t use ANTLR but after a quick skim through the docs I agree with you that lexer modes would quickly get messy. They seems good for sub-languages but not that good for handling ambiguity between IDs and keywords. I would go with lexing all keywords that can be IDs as IDs, as you suggested, and using semantic predicates to check if the keyword is right for the parser rule/alternative.

Something along these lines:

Table: { <get_ID_text> == "table" }? ID <rest of the rule>...;

You might be able to make things simpler by using parser rule parameters, so it might end up as:

Table: Ident("table") <rest of the rule>

where Ident is a parser rule with String parameter that will do the checks.

This is just a hunch after reading some docs, no guarantee it would work, so take it with a grain of salt.

thad · November 19, 2023, 4:05pm

Thank you very much. I’m grateful for the advice from someone with a lot more experience. I will try this route and do some more reading on Antlr as well.