Lenguage term definition

mpostol · May 1, 2021, 4:01pm

Hi all, I am trying to figure out what the language term actually means from a practical point of view.

In one article there is a definition:

A grammar is a formal description of a language that can be used to recognize its structure.

Does it mean that language has a structure?

Can you help me find something that I can use for educational purposes teaching my students?

Mariusz

jurgen.vinju · May 4, 2021, 10:48am

It really depends on what the practical context is, how these terms “language” and “grammar” are defined.

If you are studying (formal) syntax definition or parser implementation, for example using context-free grammars, then the words “language” and “grammar” have a very technical meaning. This is a context for the people who read, write or define parsers for programming languages, data formats or other software languages.

In this context:

“sentence” or “instance” is a finite string of characters or tokens (one “uttering” in a language)
“language” is a (possibly infinite) set of sentences, that contains all the syntactically valid utterings of a language (and nothing more)
“grammar” is a set of (mutually recursive) rules by which every sentence of the language (see above) may be generated.
“recognizer” is a boolean function (predicate) which decides if a sentence could have been generated by a given grammar, or not.
“parser” is a function which does what a recognizer does, but also produces a trace of the applied rules of the grammar (usually in the shape of a parse tree).
“parse tree” or “derivation tree” is a trace of a parser function as it has applied rules from the grammar to derive the input instance (if such a trace exists)
“parse forest” is a set of parse trees (some parsers may produce multiple trees if the grammar is syntactically ambiguous)

It’s interesting that:

a language may have several different grammars that generate the same set of sentences
grammar equivalence, in the above sense, is undecidable
parsers are defined for a given grammar, which generates a certain language in a specific way using its rules. So parsers are really specific to a particular interpretation of a language. The shape of the trees that are produced is also relative to a grammar. So you can say that the choice of a grammar for a language, influences the semantical interpretation of the language a lot. For example, say you write a grammar for an expression language, it really depends on how you write the grammar which operator has precedence (multiplication, or addition?), and so the shape of the grammar influences the result of all computations that use both multiplication and addition. This is just the top of the iceberg. Grammar engineering is a real topic.
the definition of what a sentence is, is relative to the alphabet of characters. Often a “lexer” first divides a sentence up into “tokens”, this produces then a new “sentence of token characters” for which then a grammar is defined. It can be confusing what “language” people are talking about then; is it the set of tokenized sentences, or the original set of sentences?
most practical parsers do not produce “parse trees” but rather “abstract syntax trees” (AST). An AST is an abstraction of a parse tree where details have been elided (spaces, newlines, comments) and the structure of the tree is simplified (intermediate nodes removed or folded).
people often use grammars which generate/accept more sentences than are strictly in the language. Such “over-approximated” grammars are often simpler and easier to understand. If a parser for such a grammar then produces a tree for a sentence which is not in the original language, then a post-parse analysis could reject such a tree (say a type-checker or name analysis tool which analyses the trees), or in the worst case a compiler that generates code for such an input produces non-sensical output.
because “grammars” (or hand-written parser code), is the only finite and formal description available for most languages, the word “grammar” is used interchangeably with “language” in day-to-day talk by language engineers. For them, the “grammar” is the definition of the language. Nevertheless it helps a lot to keep making the distinction, in particular when the language and/or grammar evolves.
an actual “parser” uses a “parsing algorithm” to find derivation(s). If there exist multiple derivations, or even when the algorithm hits non-deterministic choices during its computation, then parsers may make heuristic choices. Some (most) parsers produce a single tree even though the underlying grammar is ambiguous or non-deterministic (the grammar has an “LR conflict” for example). The question then becomes how and why the parser made such a choice. Some parser generators and parsers print warnings in such a case (“shift over reduce”). The semantics of a language can be severely influenced by such “under-defined” or “heuristic” choices. And if you write a grammar for an existing language that already has a parser for it, it can be quite a challenge to reverse-engineer these choices. A “context-free general” parsing algorithm always produces all trees for a given input sentence. Examples are GLR by Tomita (fixed version by Farshi or Scott and Johnstone), GLL by Scott and Johnstone and Earley’s algorithm (the fixed version). The implementations of most used algorithms however, such as LALR and LL make some kind of heuristical choices. The ANTLR parser generator has an interesting solution to this issue too, which I won’t go into now.

I use this “Rascalopedia” in some of my courses: http://tutor.rascal-mpl.org/Rascalopedia. It defines general software language engineering terms, independent of Rascal. But if you are a Rascal programmer, you should probably know these terms.

Other sources:

There are many computer sciency books on “formal languages”, usually also some “automata theory” is included which define some of the above terms.
The papers by Scott and Johnstone on parsers are usually very clear on their definitions of terms related to parsing. In particular they insist on the clear distinction between recognizers, parsers and “other things” that produce ASTs.
By reading Eelco Visser’s thesis on scannerless parsing you get a clear sense of the impact of tokenizers on the meaning of the word “language”. It is also a good intro into the issues of syntactical ambiguity.
My own thesis also has content related to ambiguity that might be interesting

oscaretu · May 4, 2021, 9:07pm

@jurgen.vinju / @eelcovisser Are your thesis publicly available? If so, can you include the URLs?

eelcovisser · May 4, 2021, 9:48pm

@oscaretu : See my page with Publications by Year for a list of my publications. Most entries have a direct PDF link, but let me know if you cannot find something. My thesis is:

Syntax Definition for Language Prototyping
Eelco Visser.
PhD thesis, University of Amsterdam, 1997 [pdf, bib, researchr, abstract]

Regarding language I can also recommend:

Pure and declarative syntax definition: paradise lost and regained
Lennart C. L. Kats, Eelco Visser, Guido Wachsmuth.
Onward! 2010 [pdf, doi, bib, researchr, abstract]

jurgen.vinju · May 5, 2021, 8:12am

My thesis is here: https://homepages.cwi.nl/~jurgenv/papers/thesis-2005.pdf

Other paper you might like: https://ir.cwi.nl/pub/25126/25126.pdf, on using data-dependent grammars for parsing programming languages declaratively without ambiguity

mpostol · May 5, 2021, 8:16pm

Thanks for all the comments. I am happy to see that I am in good hands to validate my understanding. First, let me stress that we are on the opposite site of a wall called language.

Short introduction: I am a university teacher and lecturer of the following subjects:

Programming Technologies - external data management
Programming Technologies - adaptive programming
Programming Technologies - mobile device programming

To avoid abstract discussion and make my classes very, very practical all the topics, tasks, and tests need implementation (coding) in a selected environment, namely C#. The language is a foundation for us. The example code and more details you may find checking out the GitHub repository C# in Practice - set of C# examples targeting education purpose (MIT license) - consider joining.

Unfortunately, we never use terms like grammar, parser, sentence, etc.

For us, the program is just a text - we may use a text editor for development purposes. The text is written to a file, but the file is a stream of bits and metadata. The stream of bits becomes a text if the encoding is defined. Text becomes a program if the compiler doesn’t complain. So what is the language in this context?

My answer is that language is *three sets:

alphabet - set of characters we can use
syntax - set of rules we are applying to check the text correctness
semantics - set of rules we are using to associate the correct text and the text meaning.

I am aware that it is just simplification, but I have only 1,5 hours to answer to the frequently asked question: why we need to learn the next language?!! My answer is: forget about the language - it is only a very simple tool - just 3 sets, the most important things are paradigms, patterns, architecture, techniques.

Do you think it is acceptable simplification? Any advice on how to improve this way of thinking.

Thanks in advance

ldesantis · May 5, 2021, 10:02pm

Why am I so fascinated by C, while I’m so disturbed by verbosity of C++?
Why does Python give me a sense of power and freedom while Haskell looks so threatening?
Why after 20 years have I not understood yet the substitution rules of tcl?
Why does Kotlin look so clean as compared to his grandfather ?

Probably to learn programming, language is not the primary factor, but the world is not language-neutral.

Kabi · May 6, 2021, 2:32am

Since the topic is very interesting, I am overwhelmed by myself to leave some opinions. Mine is a very rushing comment.

Language definition? or Definition of language? I remember Ferdinand de Saussure, Noam Chomsky and some ancient: Panini, Bhartrihari.

@mpostol I understand you are more concerned with teaching programming language than language engineering or natural languages. Still studies of programming language, formal language and natural language are more or less interdisciplinary; they somehow share some overlapping ideas. You may simplify your programming language by relating them with natural language and natural language grammar. Definitely you are doing that like alphabet, syntax etc.

However, making students understand from the point of programming language architecture and language engineering and specially formal language definition must be more beneficial as @jurgen.vinju suggested above. This paper may be useful for you: “Formal language theory: refining the Chomsky hierarchy” by Gerhard Jäger and James Rogers. (Available at jstor)

jurgen.vinju · May 6, 2021, 6:53pm

I think the distinction between alphabet, syntax and semantics is fine, and also how you explain it. We don’t have to pour every little detail onto these poor students

With respect to semantics it is good to say that this can also be defined by rules. There are generally two styles of semantics rules:

a set of translation rules, that map the syntax of the source to another (virtual) machine language, thereby implicitly giving the source language meaning. It’s like a compiler.
a set of “state” transition rules, that define exactly for every syntactic construct how and why it modifies the state of a (virtual) machine as the program is “running”. It’s more like an interpreter.

It’s good to teach students that although they require an accurate “notion” of the syntax and semantics of a language, this notion does not need to be a model of the actual implementation of the language. For example, you do not need to explain a call stack to explain how recursive method calls work, and you do not need to explain a stack machine to explain how C# expressions are evaluated. The notion of expressions can be explained independently of the how they actually work, as long as this notion simulates indeed the observable effects. I think you are already going in this direction; it is up to you to give them a notion of C# that is complete and correct. Most introductory books on programming play this game as well. For many modern languages, this is doable. For a language like C, it is very hard to explain it without going into the details of C’s memory model. For Haskell, you really can’t explain it without diving into lazy evaluation. C# is pretty easy imho.

jurgen.vinju · May 6, 2021, 7:05pm

Programming languages are “industrial design”; designed for humans and computers simultaneously. The design trade-offs are therefore very interesting, and the design space is quite large. Also contextual (historical, cultural, educational) factors influence both the design and the appreciation of a language design a lot!

In other words: I like your “why” questions very much but I have no answers Most languages are designed as an effort into improving one (or more) predecessor(s). So if you want to know some answers, the best thing to do is to study this historical “lineage” and the (hopefully documented) argumentation that the designers used to argue that their new language is “better” compared to their inspirations. Or you could start from the bottom and start studying LISP and ALGOL. Studying LISP will help you understand TCL better, for example.

mpostol · May 7, 2021, 3:47pm

@jurgen.vinju thanks for the feedback. I fully agree. Just one thing for further discussion.

You are right that syntax rules we can follow thanks to IntelliSense ™ (very clever prompter, sometimes too clever).

The real problem is how to explain semantics as a set of rules. Your proposal, if I correctly understand your point, is “imagine what the compiler/interpreter/translator will do with the program text.”. Unfortunately in this approach, I need to refer to terms like compiler and interpreter. My point is that this way it is hard to explain, for example:

how to implement abstraction?
what is type?
what is the difference between complex and structural types?

In all of the cases, I am expecting answers in terms of the language, usually, any language supporting object-oriented programming.

My point is that there must be a third style of how to explain semantics, namely paradigms, and patterns, for example instead of the call stack (need memory organization reference), we can use a call chain (looks like sequence).

Unfortunately, sometimes this approach is impractical also, for example, explaining the semantic difference between foreach and from constructs. The answer looks very easy, the foreach is a statement, but from is the expression. My point is that the answer refers to the syntax because semantics is iteration (foreach) and selection (from). We cannot explain from referring to the computer behavior because the most important activity is executed by an external data repository engine. Do we need a fourth style or, alternatively, iteration/selection is just a pattern?

To get more about language integrated query you may check my last lecture about Structural Data (in English).

Again, many thanks for your comments and advice. Concluding, my point is that explanation of the modern programming language semantic rules is not only a matter of what the compiler/interpreter/translator will do with the text.

jurgen.vinju · May 10, 2021, 8:23am

I fully agree. I formal specification of semantics is almost never a good educational vehicle.

You might have a look at this blog by Amy Ko which reports on a Dagstuhl seminar on exactly this topic: Dagstuhl trip report: learning and teaching programming language semantics | by Amy J. Ko | Bits and Behavior | Medium

jurgen.vinju · May 10, 2021, 4:16pm

Example: Expression Tutor a way of explaining expressions in programming language using visual trees as a notional machine.

mpostol · May 10, 2021, 6:54pm

@jurgen.vinju many thanks for the advice and helpful information. I am pretty sure that this topic is covered by the Introduction to Programming classes. I try to explain why an expression is converted to an object at a run time instead of being executed - Introduction to expression tree.

Thanks,

mpostol · May 27, 2021, 10:10am

If someone is interested to learn my point check out my recent lecture Programming Technologies Lesson 11: External Structural Data LINQ to SQL

mpostol · May 11, 2023, 8:38pm

I asked ChatGPT: Is the programming language aggregation of alphabet, syntax, and semantics?

I have got the following answer:


Yes, a programming language is typically defined by its alphabet, syntax, and semantics.

The alphabet of a programming language consists of the set of symbols or characters that the language uses to construct its syntax. These symbols can include letters, digits, operators, and special characters, among others.

The syntax of a programming language defines the rules and structure for how these symbols can be combined to form valid programs. The syntax determines the correct order and placement of statements, expressions, and other elements of a program, and specifies how they should be formatted.

Finally, the semantics of a programming language define the meaning and behavior of the language constructs, such as keywords, operators, and data types. The semantics specify how a program written in the language should behave when executed, including how it manipulates data, interacts with other systems or applications, and responds to user input or other external events.

Together, the alphabet, syntax, and semantics define the fundamental characteristics of a programming language, and provide the foundation for creating, understanding, and using programs written in that language.

I have decided to return to an old topic because I want to get more related to the usage scope of new emerging technology that we must be concerned with in language engineering. I decided to introduce to the discussion a new player. There is an enormous discussion related to the usage scope of AI. I try to ask Chat GPT first before asking my students the same. Sometimes it works, but sometimes it doesn’t. Do you have any experience in this respect?