I’m pretty new to writing parsers (except for a few semesters in college long ago). A colleague and I are working on a transpiler to convert a subset of the SAS language to Python.
We have put together a parser that works nicely for the use case we have been looking at. Unfortunately, I am having trouble getting the data I need in the right order to generate the Python code for SAS data steps.
I’m considering making data steps an island grammar, so I can get a more friendly parse tree out of them. Am I barking up the wrong tree?
Hi David, when I read this I thought “what a coincidence”, then I understood from your other message that we are working on the same project
If your problem is the way the information it is organized in the parse tree, we typically solved that problem with follow-up steps.
What we do is to take the parse-tree we get from ANTLR (some people call it also Concrete Syntax Tree or CST) and “polishing it” to get an Abstract Syntax Tree. We typically represent that AST using Kolasu, an open-source library we created. We may then use the AST for different goals and doing all sort of processing we need on it.
In other words our approach is to typically not try to get the parse-tree already in the form that would fit our needs, but to arrive there in steps: the parse-tree will be closer to the syntactic rules and will be written to make parsing easy. To get processing and code generation easy we will design the AST appropriately, and we will have a transformation in between.
Hope this makes sense, but I confess I have limited familiarity with SAS (I am not the one writing that parser), so I could be missing something.
Thank you for getting back to me. This has given me some ideas to look into.
I am doing my work in Python so I cannot use Kolasu, but it might be useful as as a source of ideas. I am not having much luck finding documentation. Is there a resource you could point me toward?
We need to improve the documentation indeed.
We also considered building something similar for Python and we created a subset of Kolasu in Python for some projects in the past.
The core functionality of Kolasu to represent trees, and possibly references between nodes (making trees into graphs).
In Kolasu we use the “heterogeneous API” approach. It means that we create subclasses for each kind of node. The “homogeneous API” would be using a Node class for all nodes instead.
Each node needs to know its parent, its properties, and its children. Possibly also the containment relationship under which the children are stored. For example, an if-statement would contain a condition and a body. Both the condition and the body would be children of the if-statement but I may want to differentiate among them.
References are non-containment relationships. For example, when instantiating a class Foo, new instantiation statement could have a reference to the declaration of class Foo. This is useful to calculate types and doing other checks. These references transform the tree into a graph.
Once you have these graphs, you may want to add APIs to navigate them. For example, finding all the nodes of a certain kind in a subtree. And also to modify them to perform transformations.
I have used the structure here: OscarGen/uast.py at main · kevinjmackey/OscarGen (github.com) in Python to represent an AST. Some of the concepts are borrowed from here: uast · pkg.go.dev though none of the api.
Using a Visitor I scan the concrete syntax tree produced by the parser and create an AST suited to the needs of the application being created (code generator, transpiler, whatever.)
The AST, as implemented, can persist itself as a JSON structure (calling ToString() at the root iterates over the structure successively calling ToString() on each node) and can also parse such a JSON structure to recreate the AST tree structure.
Unlike Kolasu, all the nodes are homogeneous. That has worked for me. Your mileage may vary.
Portion of the JSON representation: