"How to write a transpiler": new article online!

sandra.delpistoia · June 10, 2020, 7:22am

Hello everyone!

The new article “How to write a transpiler” is online!
You can read it at this link: How to write a transpiler - Strumenta

Enjoy the reading!

tajmone · June 18, 2020, 11:08pm

Excellent article!

Thanks for sharing the link (and thanks to @ftomassetti for writing it!)

Cross-language source translation has become a hot topic lately, especially for new languages which transpile to C/C++ before compiling (eg. Nim lang).

Many binary compilable languages today are also targeting JavaScript/WASM (e.g. via Emscripten/), which is also making people more aware of what goes on behind the compilers’ scenes (so to speak).

Whereas in the past the compiler used to be more of a black box which end-users didn’t really need to know much about how it worked under the hood, these new trends of using intermediate languages (for the sake of portability) and cross-compiling to other OSs or web technologies are prompting many end users to investigate how their favorite compilers work.

Rust lang is also a good example, for it leverages LLVM’s intermediate code representation to target multiple architectures, including JavaScript/WASM.

I think that languages- and compilers-design are gradually becoming everyday more the subject of discussion among programmers, and no longer a subject restricted to the domain of academics or engineers, because a basic understanding of some concepts of this field is now becoming a prerequisite in choosing which modern language to use — “Does it also compile to JS/WASM?” is now a common question asked by someone considering migrating to a new language.

Transpiling is the best approach for creating a new language (or a DSL) that needs to be portable. Nim is a good example because it can be used on any architecture that supports a good C compiler (currently, only GCC is officially supported and granted to work, but other compilers could be used with some adaptations).

Intermediate Representation is the other path.

When I started to look into Nim and Rust I was intrigued to understand better the differences in these approaches, and the pros and cons of each one of them, which is what ultimately prompted me to dig further into the topics of languages designs and compilers.

ftomassetti · June 19, 2020, 6:38am

I never thought about this, but this seems an interesting development

I would say that also in the past many languages have used this approach. For example, Eiffel.

tajmone · June 19, 2020, 1:12pm

Sure, Nim is by no means the first language to take that root, I just choose it as a comparison to Rust because they are two modern languages which are usually considered as “competitors” (so to speak) — a bit like Python vs Ruby, which have similar goals but different philosophies; the former dictating “only one idiomatic way to do something”, the latter encouraging “being able to do the same thing in many ways”.

So, from what I gather, I get the impression that today developers are divided in different schools of thought regarding the transpiling vs intermediate assembly (LLVM IR) approach.

Personally, I’d like to see more discussions on these topics from the SLE perspective, for there are many implications with each approach which would be interesting to explore — e.g. the bootstrapping stage of a new language that compiles C, or code optimization. etc.

In Nim case, the compiler was written in pascal and then bootstrapped to Nim; I guess that there might be some special considerations in this regard, as opposed to a compiler that emits assembly code or some intermediate VM assembly-like machine code.

Ultimately, languages and syntaxes is a wide topic, and we’re experiencing increasing brushes with some of its aspects in our daily work — e.g. with the way modern editors and IDEs allow users to create custom syntax highlighter definitions, which are no longer a mere list of keywords but require some understanding of code context; and even more so with the spread of LSP, which relies on fault-tolerant parsers that are able to syntax highlight documents in real time leveraging incremental compiling and building a real AST for the source being edited.

Therefore, even syntax highlighting is now demanding knowledge of how lexers and parsers work, thanks to LSP:

With the advent of IntelliSense, syntax highlighting is no longer a matter of matching listed keywords and coloring them — modern editors today demand semantics, real-time code validation, and project-wide code awareness that can be used for smart refactoring.

With such a high-demand of parsing know-how, and the new challenges of real-time fault-tolerant parsers, it reasonable to expect significant changes and paradigm shifts in the field of SLE/DSLs, to make the topic more accessible to the wider audience, and possibly moving toward new standards definition that would allow better cross-context design.

For example, the Rust community has invested great energy on developing an official Rust LSP package (currenlty, a usable WIP) so that any editor and IDE can support the language and offer a rich editing experience. So much so that Language-Server needs are kept in consideration in the Rust compiler design itself, because the Rust LSP relies on incremental compiling to build and test the sources in a fast manner.

After all, it makes sense to consider LSP at the early stages of a new language design, because a good Language Server can grant editor/IDE agnostic support for the language, and allows centralizing all the efforts in a single package instead of spreading-thin across hundreds of independent editors syntaxes (non of which would provide code refactoring or real-time code validation from the official compiler).

It seems reasonable to conclude that a huge wave of interest (from non SLE specialists) is going to hit the language engineering community — and, quite possibly, a Tzunami of demands and discussion, in which engineers might be outnumbered by general-purpose developers and their demands (I mean, who doesn’t use syntax highlighers?). So, ultimately it’s a matter of having to face upcoming demands and changes, and whether these demands will be met by new standards devised by some think-tank board, or whether solutions will come naturally from the chaotic cauldron of open source collaborative efforts, or maybe it will come from the tech industry (or a combination of the three).

LSP was conceived by Microsoft, but it’s rooted in the history and needs of the open source community that was built around VSCode, so it’s more an example of the FOSS community finding a solution with the financial support of a corporation. No “experts boards” were employed for LSP, it was a democratic decision process via the usual discussions on forums and GitHub Issues threads.

LSP is a good example of an idea conceived out of the need for a specific editor (VSCode) but with a broader vision to embrace all editors and IDEs, and of how a single solution can ultimately set the standards for the whole community (bottom-up). It’s thanks to LSP is today we’re witnessing so many efforts to build fault-tolerant parsers, and compilers developers are starting to take in consideration incremental real-time compiling in their design, in order to simplify LSP support.

Indeed, there are many interesting challenges waiting for us…

ftomassetti · June 19, 2020, 3:54pm

I think that more and more people are understanding that tooling is key for languages. It is something to be taken in account already when designing a language

I would be happy that would be the case but I am afraid there will not be such a huge number of requests

tajmone · June 19, 2020, 10:24pm

The problem is that most users won’t know where to channel them, but you can see an increase of questions and help requests on these topics on StackOverflow, or on repositories dealing with languages, compilers or IDEs. For example, on Sublime Text’s forum I’ve notice that there has been a shift in attention from syntax definitions (via RegEx based JSON/YAML file) toward how to implement a parser in LSP.

Another hot topic recently has been how to write markdown parsers (even pandoc, which adopts a PEG parser, had to resort to manual hacks to handle edge cases) or how to manipulate pandoc/AsciiDoc document AST to convert to other formats.

The point is that there’s currently more awareness of the needs rather then where to go and ask for solutions.

ftomassetti · June 22, 2020, 8:51am

Interesting. Maybe I should be more present on StackOverflow