Beginners' Tutorial: Write a C Interpreter

tajmone · March 15, 2021, 9:26am

I’ve come across this interesting tutorial on how to create a bootstrapping C interpreter with four functions:

GitHub - lotabout/write-a-C-interpreter: Write a simple interpreter of C. Inspired by c4 and largely based on it.

The tutorial was written by Jinzhou Zhang, and it’s based on Robert Swierczek’s c4 project:

GitHub - rswier/c4: C in four functions

For people like me, who don’t have a formal education in software engineering, this kind of tutorials are really helpful, because they make compiler design more accessible to a wider public.

I’ve found and read a number of good books on the topic, but most of them tend to be too academic, in the sense that they inevitably cover some theoretical topics, and usually they do so in the same exact order — which I guess has to do with the fact that they are intended to mirror the classical way the subject unfolds in university courses.

Although the above linked tutorial might not be the best example of English writing (the author is not a native English speaker, and he acknowledges not mastering the language), I see it as a good example of how a tutorial targeting non-experts should be written.

And I would love to see more tutorials like this one.

The problem is not the lack of tutorials on the topic of language/compiler design (on the contrary), but rather that most of them tend to be cut-down versions of what you’ll find in most academic books on the subject. So the problem has more to do with the way the topic is approached.

In my opinion (and the author of the tutorial shares my view) is that non-trained programmers who approach language design are most likely interested in learning how to write a parser by hand, and not how to create one with a parser generator (which is what the majority of books and tutorials do).

I probably don’t even need to mention that the vast majority of the compiler design tutorials found online deal with implementing a calculator — surely, parsing these expression teaches you the basis, and operators precedence, and the shunting-yard algorithm, and blah, blah — but does anyone really believe that someone who sets out to learn how to create his/her own language is thinking to develop a calculator? The problem is that these tutorials stop there, leaving you with 1000 lines of code on how to emulate a calculator (most likely using yacc, lex, bison or some other generator tool to achieve this). That’s unlikely something that’s going to quench the thirst for knowledge and expectations of those who want to learn how to create a language of some sort.

I’m well aware that there are good reasons why the academic approach is shaped they way it is, and that it’s the soundest approach to the subject; but it’s an approach tailored to match the formal education of colleges, on how the various topics are handled in the education system.

For an “outsider” this approach doesn’t play all that well. Theory needs to be backed up by practical examples, and possibly chunked down to a manageable size. I’ve never quite got around the fact that few books on the topic of language engineering actually focus on hand writing a parser, and that when they do it’s usually an exercise on how to write a parser that can build lexer from EBNF grammars (i.e. an example on how to create a parser generator).

Today we have many developers who don’t necessarily have a software engineering training (e.g. web designers who jumped into programming), which are showing a growing interest in language and parsers design. For example, my interest was captured when I stared to create my fist syntax highlighter definition for an unsupported language, and then started to create syntaxes for editors, etc. This is what forced me to learn about BNF, parsing, etc. (something I wouldn’t have dared to approach otherwise), and then started to realize that these were the same building blocks of language engineering, and that maybe I could (after all) explore the possibility of experimenting with real languages.

I can’t avoid noticing that tutorials like the one mentioned here are often the result of great efforts by non-specialists to break through the “academic barrier,” succeeding in some measure, and then sharing their findings with others. So many similar articles complain about “academic barriers” encountered in their attempts to break through the topic, and that once these barriers were overcome they could finally see that creating a compiler is not that hard as books on the topic make it look like.

I really hope that in the future we might see more books on the topic of language and compiler design, written by engineers for an audience of untrained programmers. And by this I mean books where the author puts aside all his/her formal training, focusing on how to deliver the topic by reducing its scope and complexity, by turning it into a practical hands-on example on how to create a proof of concept (yet working) language (not a calculator). Authors who are willing to drop all the non-essential academic jargon in favor of layman terms that can bridge the gap and make the topic more accessible.

This great community of ours, Strumenta, which welcomes both the professionals and the enthusiast amateurs, could be the ideal place to try and fill the knowledge gap on this precious topic of language engineering.

After all, when we look back at when emails were born, who would have ever thought that soon everyone would be using them on a daily basis, regardless of age and education, and carrying their emails client with them wherever they went? Back then, emails were the exclusive domain of high-tech, long-bearded wizards living (Oops, working) in some basement/lab at MIT or some other rocket-scientists facility. The same goes with many other computer related technologies, which were probably not even designed (let alone envisioned) to be used by the masses. But then came the “computer revolution”, and now everyone carries in his/her pocket a smart phone device far more powerful than the Pentagon’s super computers of those days.

Computers play such a big role in our daily lives that computer science has leaked into culture, and the boundaries between the domain of expertise and users have become blurry, since they often overlap in real world usage. Language design is no exception, and it’s definitely a domain upon which even non-engineers have set their eyes on.

Surely, there’s always going to be a limit on how far an untrained, non-engineer can venture into this field. But this is not a reason to declare it a “no go” zone, on the contrary. Limits can always be surpassed, if one is willing to catch up with studies, as it’s often the case. The problem right now seems to me that we lack a well-trodden path bridging the gap between academia and amateurs when it comes to language and compiler design … which brings us back to square one, the tutorial linked at the beginning of this post.

What makes the Write a C Interpreter tutorial so special (at least to me)?

First of all, it’s worth mentioning that its author, Jinzhou Zhang, was actually a computer engineering student who dropped out of college right before the compiler design course was about to begin. He mentions this in his tutorial, which is full of personal thoughts and insights that are just as precious as the code. So, he’s neither a total amateur nor a fully trained engineer, he stands somewhere in-between, having some grounding into academia, and at the same time being left with a thirst for knowledge which he needs to quench.

The actual code was taken from a third party project, the c4 repository by Robert Swierczek, described as “an exercise in minimalism”. Jinzhou Zhang had sufficient training to work his way through the code, and a balanced view on the topic between what are the needs of the untrained and the academic ways of presenting it, which enabled him to chunk up the original code into a multi-step approach around which he could tailor a step-by-step tutorial targeting untrained programmers. And I think that he succeeded in doing so (and the 2.5k stars on his repository confirm this).

There are hundreds of compiler design tutorials in the wild, but only few of them seem to hit mark of bridging the gap that separates academia and amateurism. Bridging this gap is the role of adventurers, and not an easy task either, for it requires striking a delicate balance between the needs and limits of those who lack the means, on the one hand, and the solid foundations upon which these sciences were erected, on the other.

There probably isn’t a single correct way to go about it, but some people will simply feature better than others at the task, and others will follow their footsteps, until a well-trodden path is established. Well trodden-paths eventually evolve into main routes, which attract services for the travelers, which lead to an economy (motels, restaurants, bars), and eventually a town is build around a main road.

My guess is that, eventually, this is how the current gap will be bridged (figuratively speaking). The clear signs of demand are out there for all to see, but supply is still short and an open challenge — no one says it’s going to be easy; on the contrary, it’s quite hard to deliver in (i.e. translate to) layman terms that which was naturally learned through the language of scientific specialism, because the whole educational path is designed to be as smooth a transition as possible — whereas the gap we’re speaking about is full of bumps and wrinkles that needs ironing out.

Probably the best (if not only) way to fill this gap is by paying attention to those who succeeded (in some degree) in these early steps; and by mutual feedback between the trained and untrained, between provider and consumer (so to speak). And I can’t think of a better place than this community, Strumenta, for this task, since it includes both professionals and amateurs, and offers a trusted (and closed) space where it’s possible to discuss the topic without starting flame-wars.

I hope that the tutorial link and my shared thoughts on this topic might inspire those who wish to open the doors of language design to non-experts, in some degree or another, e.g. by writing books, creating video tutorials, etc.

ldesantis · March 15, 2021, 6:11pm

Hi, let me do a quick comment on your post. Industry level compilers are complex objects: good tutorials can help you to jump faster in the topic, but only hard work and learning will bring you to have the job done.
Again, almost everything can be learnt by-doing, but without a strong knowledge of fundamentals you will never do the step ahead. To design an house, you have to know the science of structures, it’s a mere illusion that a good computing tool is enough.

tajmone · March 15, 2021, 8:09pm

Thanks for your reply @ldesantis,

Industry level compilers are complex objects: good tutorials can help you to jump faster in the topic, but only hard work and learning will bring you to have the job done.

The point is that I believe that many people don’t want to learn how to create industry level compilers, but simple DSL or interpreters.

Again, almost everything can be learnt by-doing, but without a strong knowledge of fundamentals you will never do the step ahead.

The problem with knowledge of fundamentals is that one needs to know what to look for and where — which is exactly what you won’t find in an academic text, because they target an audience which it’s assumed to have gone (or being undergoing) a full training where tutors will be feeding them all the knowledge they need, in a pre-planned order which has a broad spectrum of skill as final expertise. It’s not like there’s a university degree on compilers design only.

We have to assume that intelligent people who have managed to learn any hard sciences will have the ability to learn something new, provided they know where to start from. A nuclear physicist or chemical engineer will probably have no clue how computers work, but there’s no reason to think that he/she won’t be able to learn a new subject if provided the right directions where to start looking. But we can’t expect him/her to start all the education training from scratch, from kinder garden to a doctorate in computer engineering; we should instead isolate the relevant topics in a pertinent manner.

To design an house, you have to know the science of structures, it’s a mere illusion that a good computing tool is enough.

I thought that I had made that point clear in my post, i.e. that there will always be limits to the self-taught approach. But while I agree that to build a skyscraper you’ll need very strong foundation in building structures, materials resistance, weather damages, etc., it’s also true to the average man can build a garden shed with far less knowledge.

There’s no reason why a skyscraper engineer can’t teach a layman how to build a garden shed; but it will require delivering a reduced version of his/her knowledge, and adapting it to the immediate needs of the shed-builder. Often the technical jargon is one of the major barriers, since academic text books opt to call things with their technical names (sometimes a pipe is just a pipe, so we could just call it “pipe”).

But surely, promoting the idea that it’s impossible to build even an arithmetic expressions interpreter without an engineering degree is not going to motivate people to try and approach the field. On the other hand, motivating amateurs to experiment creating a simple parser could inspire them, and who knows, if they discover they really love it they can always go back to school and get their degree (at least, here in Italy we don’t have age limits preventing people to going back to college).

It’s worth remembering that computer engineering is one of the very few professions which doesn’t have a “professional order” in most (if not all) countries in the world — unlike physicians, psychologists, and many other professions — even though it’s a profession that has a direct impact on national security issues (actually, because of that). A programmer is a programmer anywhere he/she goes in the world, without necessarily needing a professional order to recognize and certify his/her talent, whereas even the best heart surgeon is not allowed to give medical advise (let alone practice) in a foreign country (it would be a crime to do so). And the industry and government agencies are willing to hire anyone who has proved his/her worth in the practical field, regardless of credentials, including self-made hackers who managed to breach high profile systems. Here is where the similitude with the house builder ends, because if a wanna-be compiler designer fails no one is going to be hurt or die, even if he/she fails very badly.

alessio.stalla · March 16, 2021, 2:18pm

It’s probably a matter of personal thinking patterns because for me personally starting from a grammar and generating a parser is far easier than coding a parser by hand. I’ve done both. Some people prefer going top-down, other people prefer going bottom-up. There’s space for all kinds of different approaches.

tajmone · March 16, 2021, 5:07pm

@alessio.stalla:

It’s probably a matter of personal thinking patterns because for me personally starting from a grammar and generating a parser is far easier than coding a parser by hand. I’ve done both. Some people prefer going top-down, other people prefer going bottom-up. There’s space for all kinds of different approaches.

This is definitely true and something important to consider, because different thinking ways tend to appreciate different exposition methods too. American psychologist Robert Sternberg researched this topic with his colleagues for a long time (600+ publications) and formulated the Triarchic Theory of Intelligence, which aims at helping teachers and writers better understand the learning patters of their students/audience:

He also wrote a number of non academic books on the topic, targeting teachers, speakers and writers, in which he explains how to create a lesson (or write a book) in a manner take takes into account the three major learning patterns, by presenting each concept in a way that will be appreciate by all learner types.

I’ve found his work very interesting because it really broadens your views on communication in the context of teaching anything. He also carried out extensive research in children schools around the world to sustain his theory by demonstrating that very often poor schools grades are not the fault of the students but the fact that teachers only deliver knowledge in a single thinking pattern, making it hard for differently thinking kids to absorb the lessons.

Computer related concepts are already quite complex in themselves; writing good teaching books requires not only a sound knowledge in the topic at hand, but also good writing skills and, possibly, a basic understanding of the different learning models. Unfortunately, too often books on computer related subjects are packed with sound knowledge but written in an awful way.

For some authors simply have the gift of writing. For example, I consider Randall Hyde (author of the well known Write Great Code book series) one of the best computer writers. His books are simply wonderful, he manages to explain complex concepts in a very neat manner, making them easy to understand. Somehow he’s able to lead the reader’s attention right where he wants it go, and the whole learning experience is seamless. His books are highly successful and have undergone many reprints, and that’s not because of a lack of material on the topic, but because he’s great at writing.

I wish every computer book I bought was as good a read, but the truth is that unreadable books are the norm rather than the exception. After all, programming tends to be a solitary experience, and if one is not coding at the keyboard he’s probably thinking about a problem and how to solve it; so it’s not exactly a profession that promotes communication skills (as opposed to jobs like public relations, lawyers, etc.). And probably that’s why academic teachers in our field tend to write better books than inventors and discoverers — or, e.g., a system admin who works and lives 24/7 in an underground super high-security data-server bunker, carved in a mountain or dug in Alaska during the cold war, or in some abandoned mine converted into a data center (not exactly a job that stimulates social and communication skills).

oscaretu · March 18, 2021, 12:51pm

For people interested in his books you can buy them al low cost. They are in offer in Humble https://www.humblebundle.com/books/learn-you-more-code-no-starch-press-books for the next 18 days (and 12 more books about Rust, Java, Shell, Powershell, C C++, Python, … are included in the bundle for less than 15 euros).

tajmone · March 18, 2021, 2:12pm

Thanks for the tip! I’ve purchased the full bundle; it’s worth it just for the eBooks I was missing, and because some of them were updated editions.

oscaretu · April 19, 2021, 8:52pm

@tajmone, there is a new Humble Bundle offer of programming books from O’Reilly’s Head First series. One of them is “Head First Kotlin”, a language that comes up a lot in our meetups…

sanyaade · April 25, 2021, 1:32pm

It will be nice to have a section for beginners and wannabe rookies like me.

We can do tutorials on lexers, parser ast, code generation, etc… from scratch, no antlr, lex, yacc, flex, bison, emf, etc…
Just pure, straight forward approach that demystify the whole shebang on DSL

Regards!

tajmone · April 26, 2021, 1:01pm

This is a good idea, after all Strumenta welcomes also non-experts (like me) in the field of language engineering, so having a dedicated section/niche for beginners would be really nice. We could share links and materials for beginners, without polluting the Forum with posts that are trivial for the experts, and we’d be also able to benefit from the advise of the expert members of Strumenta, who might help us on topics we don’t understand.

Since the main goal of Strumenta is bringing together professionals from the field, having a separate section for beginners would allow a nice separation of contents.

mpostol · May 3, 2021, 8:54am

I agree and let me stress it. A compiler is complex, so we need time to implement it It raises a question about motivation and reward.

tajmone · May 5, 2021, 1:58pm

That is also true for many other type of applications, for most projects are complex and do require time to implement them into full fledged production-ready applications. I also would assume that anyone willing to embark on any programming project is motivated (otherwise why do it in the first place?).

As for rewards, that is really something subjective (different people do similar things for different reasons), so I don’t really see how this would apply to compilers specifically — unless I’m missing out something here. What is considered rewarding for one person might just as well be considered boring for another, after all many programmers do what they as a profession, not necessary because they like everything they do.

mpostol · May 5, 2021, 7:15pm

@tajmone, I fully agree, everything depends on everything. I only try to say that sometimes, let me stress sometimes, I have a problem with my students because they try to do something extraordinary (it is subjective, but as a teacher, my duty is to recognize things like that) for a variety of reasons (motivation), because of education, just for fun, to impress someone, etc. Sometimes it works, sometimes it doesn’t. If it doesn’t the reword is just frustration instead of additional knowledge, experience, skills, etc. Therefore I like to stress: do … (something) provided you have a good reason (if you don’t like motivation). My message is just warming, but it is only my point.

Let me share my experience. I am the author of the following video curse:

Language C# in practice - video curse - external data processing (in polish)

I spend half a year preparing and hard working on it. Maybe it is a very bad course, but my average income was 10USD a month, and now it is 5USD. So next time when I was asked to prepare a curse like this I answered I am very bad in this respect - I just can’t do that. As one said you must have a gift, but maybe I haven’t. To prove it I asked my students what was wrong. The answer was very impressive (for me it is the only reward) "it is a very good curse, but we have got it for free. You are welcome to do the next one. We are waiting impatiently ". Now I am looking for motivation, but not for reward.

tajmone · May 6, 2021, 1:43am

Thanks for your reply @mpostol.

What I’m trying to understand here is why when it comes to interpreters/compilers the usual response, regarding a non-specialists approach to the topic, tends to point to it being a subject too complex, too hard, etc.

My understanding is that, unlike other programming related topics (e.g. databases, internet protocols, etc.) compiler theory is an omnipresent subject in formal computer engineering training — i.e. a subject which every student has gone through, to some degree or another.

Now, there are many books addressing specific programming topics (again, databases, Internet protocols, etc.) which are not necessarily targeting an audience with a formal degree in computer sciences, along other books on the same which might be more academic and oriented toward engineers.

So, we commonly accept that there is a computer literature for specialists and non-specialists, which seems to indicate that there are programmers who came to the profession from walks of life other than a formal degree in computer sciences.

I’m aware that designing a full fledged programming language is indeed a complex and time consuming task (much more complex than working with databases, for example), but I also believe that there are less complex use cases for simple interpreters and compilers. So I am firmly convinced that it is possible to fill the gap, to some degree, by presenting the topic in a more informal way, using a language which strips away most of the engineering jargon (or at least introduces it gently to the reader).

I’ve read various books on computer algorithms, and I’ve seen clearly the difference between books written for the formally trained and those written for a general audience — the former are basically unreadable for the untrained. Yet, both cover the same algorithms, and lead the reader toward the same implementations. So, if it’s possible for algorithms, why shouldn’t it be possible for compilers?

I have purchased a few similar books introducing the non-specialized reader to the creation of simple language interpreters (e.g. a Tiny Basic interpreter). Although the final product is not a full fledged compiler, it is nonetheless a working interpreter for a very early Basic implementation. The problem is that as of today you can’t find a full course that carries on into the subject — i.e. after completing the build your “Basic/Lisp/etc.” mini course, you’re back to square one.

Exactly because interpreters and compilers are such a broad and complex topic, a course for non-experts would have to be rather big in size (much bigger than an academic work on the topic), for it would have to present the topic step by step, starting with something simple, and then move on to a more complex example, and so on.

As you rightly mentioned, creating a course (video or printed) is a time consuming process, and covering such a broad topic as language engineering and compiler interpreters is most likely too big a task for a single person. But maybe a community effort in that direction could be more fruitful (i.e. in the open source realm). Similar efforts can already be witnessed in other fields, like algorithms, were the open source community are trying to collaborative build similar courses in algorithms theory and practical implementations.

Also, I would add, there is a big difference between knowledge of a subject and the ability to teach it in different ways (i.e. differently from the canonical way it’s being taught). As an example, I like to think of spoken languages, which in formal education take many years to learn, but for which there are alternative teaching/learning methods which are much faster and effective — and, interestingly, of the people who speak fluently more language in the world, none of the had a formal training, but each developed it’s own method for learning (and teaching) new languages.

The academic way of teaching languages is to rely on the native language as a reference, whereas all the fast learning techniques rely on not using any known language as a reference, but to focus only on the language being studied.

The fact that something hasn’t yet been done doesn’t mean it’s not doable, rather that it’s a pending challenge.

alessio.stalla · May 10, 2021, 3:30pm

It’s interesting to have beginner-oriented material, but I’m skeptical that starting “from scratch” would be the best approach for everyone. Some people learn top-down (first glossing over lots of details, then diving into the subtleties), others learn bottom-up (first combining what they already know to achieve a new result, then understanding the abstractions). I for example would prefer to start from high-level tools (e.g., textX, about which I’ve written a tutorial), and gradually deepen my understanding of what they do under the hood.

tajmone · May 10, 2021, 11:59pm

Very good observation. I also strongly believe that there different learning models, and that the pedagogy of teaching (any subject) is something that goes beyond mastery of a subject.

I’m a definitely a top-down learner. This style of learning is typically associated with people who give priority to visual information first, then auditory. It’s about getting a bird-view first (graphs, diagrams, the whole picture), and then processing the building blocks and how they fit together. They also like to jump freely from on topic to another (they love cross-links) without necessarily following a specific order.

Bottom-up learners tend to be more of the auditory type, and give more importance to correct wording, exact order, rather than pictures. They also tend to be conservative in the learning order — one block at the time, first things first.

There’s also a third basic learning type, which prioritizes on the kinesthetic aspect: learning by doing, by practical examples. It’s the hands on type that likes to explore every little thing he/she’s learned, pushing it to the limits.

Of course, there are also variations within the basic learning types, but these three types reflect which sensory channels individuals use a their primary interface with the outer world when learning. A good deal of energy has been invested into the study of learning patterns in the last 50 years, and various schools of thoughts have arisen, and various formal methods of teaching pedagogical models to address specific learning types in the best way, or on how to deliver a subject covering all learning types it turn, so no one is left out.

tajmone · May 11, 2021, 12:19am

Free Book: Basics of Compiler Design

I would like to share here a link to the Basics of Compiler Design, by Prof. Torben Mogensen, from the DIKU, University of Copenhagen.

Basics of Compiler Design

This is an academic book which he wrote for his students, and I’ve found it to be a rather exceptional book in that it uses a very clear and easy to understand language, and manages to cover all the essential points of compiler design theory and practice.

What I like most though is that it’s a language agnostic treatise (or “language neutral”, as the author puts it), for it doesn’t contain any examples in any real programming language, but relies on pseudo code instead.
The whole idea is to teach the concepts and building blocks, and let students implement them in whichever language they feel more comfortable with.

In the book’s Preface, the author explains that the compiler course was initially part of the first grade formation in the school were he taught, this being the reason why he needed to write a simpler an more accessible book on the subject. I don’t think you can make this topic any simpler than Mogensen did; he uses the bare minimum technical terms required to honour the subject, yet manages to keep the whole course within the academic expectations.

For me, this books is a great reference text to dig further into the theory behind any practical tutorial, for it allows me to understand what a piece of code is doing, and why.

Prof. Mogensen was kind enough to share freely this book over the Internet, for others to benefit from it. It’s worth noting that the book is also available commercially, at Springers:

Introduction to Compiler Design | Torben Ægidius Mogensen | Springer

This book has undergone various updates and new editions, which points to it being a successful book.