Experience with Tree sitter parsing?

Has anyone here experience with tree sitter parsing and is willing to share it?

E.g.: https://tree-sitter.github.io/tree-sitter/

Thanks,
Robert

1 Like

In this presentation, Max Brunsfeld from GitHub’s Atom team talks about how syntax highlighting can be improved by incremental parsing with Tree Sitter (starting at 6:11).

He says that this way of syntax highlighting is ideal, because it allows the reader to see the structure of the code just by glancing at the colors (he makes this point between 9:21 and 10:12).
I’m interested to see if you’d agree with him?

Also, it looks like that the incremental parsing of text files, communcating the structure of code via color-coding and by other means, brings together textual editing and the ideas of structural/projectional editing in an interesting way, doesn’t it?

5 Likes

After the part about syntax highlighting, he shows code folding, selection features and refactoring features some might recognize from structural/projectional editors. Having it work like that in text files is pretty cool, I think.

To achieve speed most editors are usually not doing full parsing, just a form of lexing, so they are limited in syntax highlighting and any structural operation (folding, structure based selections). What is cool here is that they achieved very fast incremental parsing (using GLR algorithm). If I understand the parse tree is updated under 1ms for common files which is very cool I think.

The tree exist even if the input is currently syntactically incorrect. If I understand correctly, the way error correction is implemented is that different options are investigated by trying each alternative in parallel, in the GLR manner. This sounds really cool.

I fully agree with you. At this level it does feel like projectional editing since there is a fast sync with the parse tree behind, and the editor operations can be tree based. Incremental full parsing does open many interesting new possibilities.

2 Likes

I looked at the presentation this morning, it is well done and interesting.

First, let me say that they assign different colors to syntax categories (for instance to types and fields) which is something I share. They don’t try to make the tree structure visible through colors in general.

The goal of recognizing syntax categories is still hard to achieve by having just a parse tree and a general purpose language.

For me a cutting-edge solution to this problem requires at least a query language and two additional DSLs for capturing typing and naming rules are welcome.
Furthermore it’s much easier if you start from an abstract syntax tree, so you need also one or more transformational languages to move back and forward the parse tree and the AST.

In the textual vs. projective approach equation they are still in the textual side also as a way of reasoning about the problem.

For instance, in a model driven approach to editing, I prefer to replace the idea of incremental parsing with a, language and notation specific, keyboard interaction, and the text is just a grammar based persistence needed to achieve interoperability.

However, it is interesting to note how much they approached a model driven solution in order to improve the textual experience.

Given the skills shown by the author, I hope he will want to continue in that direction.

I agree, I never claimed otherwise. As I quoted the presenter before, he says that this way of syntax highlighting “achieves the goal […] so that you get the structure of the code, just by glancing at the colors”. To me, that sounds very much like his goal is to use syntax highlighting so that the structure of a (arbitrary) concrete syntax is more explicit and thus easier comprehensible for readers. Categorizing types and fields into different colors is the way to achieve this goal.

I don’t think I understand what you mean by that. Do you mean “automatically recognizing syntax categories” via an algorithm, because that’s not what this technology does. Or do you mean from a human perspective, as a reader of code?
There are different implementations of tree-sitter for different languages. Each implementation allows you to define custom syntax highlighting, as far as I understand it.

Could you give or point to an example of that, cause I cannot follow you what you mean by that.

Reasoning about what problem exactly? Could you explain where you draw the line or how you differentiate the “textual” and “structural/projectional” approach?

Thanks,
Robert

Of course, I was referring to the author you quoted faithfully, and I’m grateful to you for suggesting the video. I don’t even blame the author for anything.

I rephrased the goal of the author because the structure of a type is just an identifier and you need at least to see the parent node relation to discern something about the syntax category.

I agree with the proposed use of colors.

Yes you need to define a tree sitter for each supported language and a specific syntax highlighting algorithm. But for writing the tree the author gives you a grammar language, instead for writing the algorithm he proposes to use a GP language such as Javascript.

I am saying that the code you need to write with the general purpose language is still much more complicated than it could be.

It may be that, although not described in the video, there is some form of library to query the structure.

Several language workbenches have at least part of the DSLs I have named.
The best documented, also with academic articles dedicated to the single aspects, is probably Spoofax (mine is not documented at all :slightly_smiling_face:).

http://www.metaborg.org/en/latest/

So I understand your point boils down to: if a technology is not using external DSLs to query and manipulate a syntax tree, but uses an API and a GPL, it is not cutting-edge. That’s fair, although I cannot agree with that myself.

Not really, I try to explain better.

For me the distinction between external and internal DSLs is misleading.

From one side, there are GPL languages that support guest DSLs with a syntax much better than “function calls” (see SwiftUI inside of Swift for instance).
And, on the other side, there are language workbenches that are able to embed guest language fragments in whatever host language including a GPL one.

So external is no longer a synonym of written in a separate source file; it means only not understandable by the original host language compiler (the GPL one in the example) without a prior phase of guest DLSs expansion operated by an additional tool (ie. a LWB).
My point, about the problem the tree-sitter solution tries to solve, is that you need multiple DSLs and not just the one provided for defining the syntax trees (ie. an internal DSL that is good enough for me).

We’re using tree-sitter inside Semgrep. I have good experience of tree-sitter. Come to my talk tomorrow and I will speak a little bit about tree-sitter :slight_smile:

I wanted to see the talk, but life got in the way. It seems to me that starting from 2023, the talks are recorded - do you @pad know if that was the case?

Furthermore, real quick: did anyone try to use TreeSitter compiled to WASM? Assume no special “bindings”, just the parser producing - an AST IIRC - but in the fairly “primitive” flat memory as available to any WASM code.

For example, if I wanted to parse a language from Zig, I would compile TreeSitter’s generated parser to WASM, my Zig code to WASM too and then the qestion is: how complex would it be to consume the parse result? IOW, how much value is in the “SDK”/bindings which support specific languages, provided by TreeSitter for languages like Rust, Swift, Python, Lua… (but not Zig, nor C#/.NET, for example)

Yes the talk was recorded, but in the end we didn’t have much time to speak about tree-sitter.

Regarding WASM, I just know the WASM “backend” of tree-sitter is currently used by its playground
https://tree-sitter.github.io/tree-sitter/playground
so I guess it’s working.
There’s also a forum where you can ask those kinds of questions here: Discussions · tree-sitter/tree-sitter · GitHub
and people also create some github issue to get answers to some bindings question.