I have created a parser generator, so I watch the market for
this kind of tool. It seems that ANTLR is the dominant tool
in use at this time, however, Bison/Flex is getting a huge
number of downloads at SourceForge. So I’m curious to
know which tools are being used by the members of this
community. Or maybe you don’t use a tool. I would like to
know that also.
I have created a parser generator, so I watch the market for
I’ve been using CoCo/R: http://www.ssw.uni-linz.ac.at/coco/, it features an LL(k) parser: LL(1) conflicts are resolved by multi-symbol look-ahead and/or semantic checks.
P.S. As a part of my own research on formal grammars and parsing algorithms, I’ve implemented a prototype parser generator for an extension of context-free grammars with contexts: http://users.utu.fi/mikbar/gwrc/
What is your preferred programming language for parsing?
I use Rascal and SDF2; both generate scannerless generalized parsers. the first using SGLL and the second using the SGLR algorithm. I use them for DSLs as well as (legacy) programming languages. Although they are not the fastest algorithms, their generality gives me modularity and the flexibility to model syntaxes of languages which were grown (rather than designed).
I use hand-written parsers solely because I think the learning curve of most parser generators is rather steep. Also, my personal development style, which avoids GUIs and complicated build systems, tend to lean itself much more towards self-written parsers than parser generator tools.
I have looked at ANTLR a number of times, but don’t seem to be able to get my brain to wrap around how to implement Occam/Python-style indentation using ANTLR. In particular, I’d love to get a parser generator to help with the rather big task of creating and maintaining the AST nodes for the parse tree.
Personally, I’m adamant on using LL(k) parsing techniques because I think LALR(1) tends to give the rather unreadable and illogical syntax that you see in most C-inspired languages. I think Niklaus Wirth, the creator of Pascal, did a lot to show that LL(1) or LL(k) generally yields much more readable code than LALR(1).
The above is probably just a personal bias, though
I’ve only used ANTLR, if we exclude some long-forgotten toying with packrat parsers in Common Lisp.
Parsing Python with ANTLR is definitely possible - Federico knows
See above, because I replied to the archfrog above.
I agree with your preference to avoid the GUIs and more complicated stuff.
I would like to know what you mean, “LALR(1) tends to give unreadable syntax …”
Do you mean unreadable grammar rules? Or unreadable parser code?
Unreadable grammar rules are the fault of the guy who wrote the grammar.
Unreadable parser code is the fault of the state-machine parser algorithm.
Why do you care about the parser code, when the user supplied code deals
with the AST and symbol table after parsing is finished? Or do you like to see
the rules of the grammar in the parser code?
I may be generalizing a bit too much, what I meant is that in my opinion, LALR(1) parsers give extra freedom for hard-to-read syntactic structures that some people unfortunately tend to make use of.
So, I’m basically talking about the language that’s being parsed, not the parser generator language nor the implementation language. For instance, C is notorious for allowing very complicated expressions and especially type constructs.
So it is not related to the parser code as such, as you say, I don’t need to care about that at all
@ftomassetti, could you do a short piece on how to parse Occam/Python-style code using ANTLR or have you already done such a piece?
My main problem here is pure laziness. I know how to write a top-down parser by hand and don’t want to spend weeks or months learning all about ANTLR just to be able to do Occam/Python-style indentation using ANTLR.
But I wouldn’t be the least surprised if you have already covered this ground
So, you are talking about grammar rules that are too complex. Maybe that is because they are
using disambiguating rules, which are not available with ANTLR. Or not using disambiguating
rules, which tends to make expressions about 10 levels deep in the grammar. I don’t see how
this is worse than using ANTLR. You can even make ambiguous grammars with ANTLR and
it does not tell you about it. I guess maybe it’s a subjective thing, because you started with
recursive descent coding. I actually like to power of LALR or LR(1) and if the C language
requires a more complicated grammar, then OK. I guess, that’s because I started with LALR
instead of recursive descent.
I have been using a number of parser generators (e.g. Yacc, LLgen, javacc, Xtext, pegjs), the one I choose depends on the circumstances.
The thing I like about parser generators is the brilliant DSLs they provide. This DSL allows me to express the syntax of whatever I want to parse concisely using a domain specific notation: grammar rules.
My productivity using these DSL’s is at least an order of magnitude higher than doing it manually, and the resulting DSL code (i.e. the grammar) is much easier to change and maintain than any handwritten code. The syntax is literally staring me in the face.
Aren’t these the reasons why we are developing DSLs in the first place?
Very interesting viewpoint. A BNF grammar is a DSL for defining DSLs. I agree. But then there is the learning phase. Learning to use a parser generator is a pain for beginners, but not so much after you already learned one parser generator. What I especially don’t like is to see code in a grammar.
As the grammar is a contract for many different kind of language processing stages, I tend to optimize the shape of the grammar for readability and maintainability. So it helps me to have a modular grammar, as few non-terminals as possible, and definitely not factored rules.
LALR can not provide any of this; in particular modularity is hard, and so I find LALR parser generators hard to work with. If I’d write a parser once, maybe it would do, but maintaining an evolving and growing language, or many language versions… nah!
So I always chose for parser algorithms which do not care about the shape of the rules, and I can simply write stuff like
Expression ::= **left** Expression "+" Expression > "left" Expressioon "+" Expression > … ;
As a result, you get a lot fewer non-terminals, and parse trees which are very close in shape and size to abstract syntax trees, and most of the times the mapping can be down fully automatically. This removes the need for maintaining the mapping between grammar rules and AST node definitions, and depending on the type of technology you use, also completely the need for ASTs.
Also it helps a lot to have EBNF notation for regular expressions, like lists, separated lists, optionals and inline alternatives to keep a grammar concise. Some LALR generators do not offer regular expressions.
The learning phase depends on your background indeed. If you have knowledge of grammar (which every educated IT person should have of course, but is missing in many curricula) a parser generator will be easier to understand, but yes, every DSL has a learning phase.
I don’t like code in grammars either, mainly because the tools usually don’t understand both the grammar and the code language. If they did it would be so much easier to use code in grammars.
This is where language composition comes in, I would like to combine the two languages seamlessly with both of them having full IDE support. Using tools like MPS using projectional editors this can be achieved.
I’m using LRstar and it allows regular expression in grammar rules. It also does the automatic mapping of parse tree to AST, because it does not waste time creating a parse tree. It directly maps rules to AST if desired. It does not allow code in the grammar, so you don’t have to keep running the parser
generator when the code changes. It generates LALR, LR(1), or LR(*) parsers depending on how
complex the grammar is. So the problems you mentioned are not an LALR problem.
I think you can find a grammar for Python for ANTLR4. I did not write it
I think that ANTLR works extremely well for the common cases and it is flexible enough to handle all cases. I think I wrote some orders for you indentation languages and I definitely wrote parsers for column based languages. In years I never found a case in which I could not just use ANTLR.
Totally agree, this is especially important when you are designing a new language and keep refactoring the grammar all the time, in my opinion
I would definitely also not work without regulars and automatic tree building. Seems like lr* does a lot for you.
What about factoring and modularity and simplicity of the grammar? Those are my main points; not needing asts are a happy corollary.
I don’t know what you mean by factoring and modularity. Is that something that the parser generator does for you? LR* offers the most concise regular expressions I have seen (e.g. [a|b|c]/’,’…). It can even allow you to generate pseudo code as it traverses the AST, but most people probably don’t do that. There is a paper on that topic: https://dl.acm.org/doi/10.1145/1147214.1147218