Exploring Languages
with Interpreters
and Functional Programming
Chapter 41
H. Conrad Cunningham
04 April 2022
Browser Advisory: The HTML version of this textbook requires a browser that supports the display of MathML. A good choice as of April 2022 is a recent version of Firefox from Mozilla.
Chapter 40 surveyed the overall language processing pipeline.
Beginning with this chapter, we explore language concepts processing techniques in the context of a simple case study. The case study uses a language of simple arithmetic expressions, a language we call the ELI (Exploring Languages with Interpreters) Calculator language.
Chapter 41 introduces the formal concepts related to concrete syntax. It gives two different concrete syntaxes for the ELI Calculator language.
Chapter 42 introduces the concepts of abstract syntax and language semantics. It represents both concrete syntaxes of the ELI Calculator language with the same abstract syntax encoded as a Haskell algebraic data type. It defines the semantics of the language using a Haskell function that evaluates (i.e., interprets) the abstract syntax expressions.
Chapter 43 surveys the modular design and implementation of the ELI Calculator language application.
Chapter 44 considers lexical analysis and parsing of the concrete syntaxes to generate the corresponding abstract syntax trees
Chapter 45 explores the construction of a set of parsing combinators.
Chapter 46 looks at a simple Stack Virtual Machine with an instruction set represented as another algebraic data type and how to translate (i.e., compile), how to execute the machine, and how to translate the abstract syntax trees to sequences of instructions.
We will extend the language with other features in later chapters.
TODO: Give chapter’s goals explicitly.
The goals of this chapter are to:
The ELI Calculator language can be represented as human-readable text strings in forms similar to traditional mathematical and programming notations. The structure of these textual expressions is called the concrete syntax [20] of the expressions.
In this case study, we examine two possible concrete syntaxes: a familiar infix syntax and a (probably less familiar) parenthesized prefix syntax.
But, first, let’s consider how we can describe the syntax of a language.
We usually describe the syntax of a language using a formal grammar [7,11].
Formally, a formal grammar consists of a tuple , where:
Production rules describe how the grammar transforms one sequence of symbols to another. The rules have the general form
where and are sequences of symbols from such that has length of at least one symbol.
A sentence in a language consists of any finite sequence of symbols that can be generated from the start symbol of a grammar by a finite sequence of productions from the grammar.
We call a sequence of productions that generates a sentence a derivation for that sentence.
Any intermediate sequence of symbols in a derivation is called a sentential form.
The language generated by the grammar is the set of all sentences that can be generated by the grammar.
To express the syntax of programming languages, we normally restrict ourselves to the family of context-free grammars (and its subfamilies) [7,11,12] context free. In a context-free grammar (CFG), the production rules have the form
where and is a sequence of zero or more symbols from . This means that an occurence of nonterminal can be replaced by the sequence .
We often express a grammar using a metalanguage such as the Backus-Naur Form (BNF) or extended Backus-Naur Form (BNF) [5,13,14].
For example, consider the following BNF description of a grammar for the unsigned binary integers:
<binary> ::= <digit>
<binary> ::= <digit> <binary>
<digit> ::= '0'
<digit> ::= '1'
The nonterminals are the symbols shown in angle brackets:
<binary>
and <digit>
.
The terminals are the symbols shown in single quotes:
'0'
and '1'
.
The production rules are shown with a nonterminal on the left side of
the metasymbol ::=
and its replacement sequence of
nonterminal and terminal symbols on the right side.
Unless otherwise noted, the start symbol is the nonterminal on the left side of the first production rule.
For multiple rules with the same left side, we can use the
|
metasymbol to write the alternative right sides
concisely. The four rules above can be written as follows:
<binary> ::= <digit> | <digit> <binary>
<digit> ::= '0' | '1'
We can also use the extended BNF metasymbols:
{
and }
to denote that the symbols
between the braces are repeated zero or more times
[
and ]
to denote that the symbols
between the brackets are optional (i.e., occur at most once)
Consider a derivation of the sentence 101
using the
grammar for unsigned binary numbers above.
<binary>
<digit> <binary>
<digit> <digit> <binary>
<digit> 0 <binary>
1 0 <binary>
1 0 <digit>
1 0 1
This is not the only possible derivation for 101
. Let’s
consider a second derivation of 101
.
<binary>
<digit> <binary>
1 <binary>
1 <digit> <binary>
1 0 <binary>
1 0 <digit>
1 0 1
The second derivation applies the same rules the same number of times, but it applies them in a different order. This case is called the leftmost derivation because it always replaces the leftmost nonterminal in the sentential form.
Both of the above derivations can be represented by the derivation tree (or parse tree) [7,20] shown in Figure 41.1. (The numbers below the nodes show the rules applied.)
The grammar above for binary numbers is a special case of a context-free grammar called a right-linear grammar [7,15]. In a right-linear grammar, all productions are of the forms
where and are nonterminals and is a sequence of zero or more terminals. Similarly, a left-linear grammar [7,15] must have all productions of the form:
A grammar that is either right-linear or left-linear is called a regular grammar [7,16].
(Note that all productions in a grammar must satisfy either the right- or left-linear definitions. They cannot be mixed.)
We can recognize sentences in a regular grammar with a simple “machine” (program)—a deterministic finite automaton (DFA) [7,18].
In general, we must use a more complex “machine”—a pushdown automaton (PDA)[7,19]—to recognize a context-free grammar.
We leave a more detailed study of regular and context-free grammars to courses on formal languages, automata, or compiler construction.
Now let’s consider the concrete syntaxes for the ELI Calculator language—first infix, then prefix.
An infix syntax for expressions is a syntax in which most binary operators appear between their operands as we tend to write them in mathematics and in programming languages such as Java and Haskell. For example, the following are intended to be valid infix expressions:
3
-3
x
1+1
x + 3
(x + y) * (2 + z)
For example, we can present the concrete syntax of our core Calculator language with the grammar below. Here we just consider expressions made up of decimal integer constants; variable names; binary operators for addition, subtraction, multiplication, and division; and parentheses to delimit nested expressions.
We express the upper levels of the infix expression’s syntax with the
following context-free grammar where <expression>
is
the start symbol.
<expression> ::= <term> { <addop> <term> }
<term> ::= <factor> { <mulop> <factor> }
<factor> ::= <var> | <val>
| '(' <expression> ')'
<val> ::= [ '-' ] <unsigned>
<var> ::= <id>
<addop> ::= '+' | '-'
<mulop> ::= '*' | '/'
Normally we want operators such as multiplication and division to
bind more tightly than addition and subtraction. That is, we want
expression x + y * z
to have the same meaning as
x + (y * z)
. To accomplish this in the context-free
grammar, we position <addop>
in a higher-level
grammar rule than <mulop>
.
We can express the lower (lexical) level of the expression’s grammar with the following production rules:
<id> ::= <firstid> | <firstid> <idseq>
<idseq> ::= <restid> | <restid> <idseq>
<firstid> ::= <alpha> | '_'
<restid> ::= <alpha> | '_' | <digit>
<unsigned> ::= <digit> | <digit> <unsigned>
<digit> ::= any numeric character
<alpha> ::= any alphabetic character
The variables <digit>
and
<alpha>
are essentially terminals. Thus the above is
a regular grammar. (We can also add the rules for recognition of
<addop>
and <mulop>
and rules for
recognition of the terminals (
, )
, and
-
to the regular grammar.)
We assume that identifiers and constants extend as far to the “right”
as possible. That is, an <id>
begins with an
alphabetic or underscore character and extends until it is terminated by
some character other than an alphabetic, numeric, or underscore
character (e.g., by whitespace or special character). Similarly for
<unsigned>
.
Otherwise, the language grammar ignores whitespace characters (e.g.,
blanks, tabs, and newlines). The language also supports end of line
comments, any characters on a line following a --
(double
dash).
We can use a parsing
program (i.e., a parser) to determine whether a concrete
expression (e.g., 1 + 1
) satisfies the grammar and to build
a corresponding parse tree [7,21].
Aside: In a previous section, we use the term derivation tree to refer to a tree that we construct from the root toward the leaves by applying production rules from the grammar. We usually call the same tree a parse tree if we construct it from the leaves (a sentence) toward the root.
Figure 41.2 shows the parse tree
for infix expression 1 + 1
. It has
<expression>
at its root. The children of a node in
the parse tree depend upon the grammar rule application needed to
generate the concrete expression. Thus the root
<expression>
has either one child—a
<term>
subtree—or three children—a
<term>
subtree, an <addop>
subtree, and an <expression>
subtree.
If the parsing program returns a boolean result instead of building a parse tree, we sometimes call it a recognizer program.
An alternative is to use a parenthesized prefix syntax for the expressions. This is a syntax in which expressions involving operators are of the form
( op operands )
where op
denotes some “operator” and
operands
denotes a sequence of zero or more expressions
that are the arguments of the given operator. This is a syntax similar
to the language Lisp.
In this syntax, the examples from the section on the infix syntax can be expressed something like:
3
3
x
(+ 1 1)
(+ x 3)
(* (+ x y) (+ 2 z))
We express the upper levels of a prefix expression’s syntax with the
following context-free grammar, where <expression>
is
the start symbol.
<expression> ::= <var> | <val> | <operexpr>
<var> ::= <id>
<val> ::= [ "-" ] <unsigned>
<operexpr> ::= '(' <operator> <operandseq> ')'
<operandseq> ::= { <expression> }
<operator> ::= '+' | '*' | '-' | '/' | ...
We can express the lower (lexical) level of the expression’s grammar
with basically the same regular grammar as with the infix syntax. (We
can also add the rule for recognition of <operator>
and for recognition of the terminals (
, )
, and
-
to the regular grammar
The parse tree for prefix expression (+ 1 1)
is shown in
Figure 41.3,
Because the prefix syntax expresses all operations in a fully parenthesized form, there is no need to consider the binding powers of operators. This makes parsing easier.
The prefix also makes extending the language to other operators—and keywords—much easier. Thus we will primarily use the prefix syntax in this and other cases studies.
We return to the problem of parsing expressions in a later chapter.
This chapter (41) introduced the formal concepts related to a language’s concrete syntax. It also introduced the ELI (Exploring Languages with Interpreters) Calculator language, which is the simple language we use in the following five chapters.
Chapter 42 examines the concepts of abstract syntax and evaluation, using the ELI Calculator language as an example.
TODO if needed
TODO
Chapters 41-46 of this book explore the ELI Calculator language and general concepts and techniques for language processing. I initially developed the ELI Calculator language (then called the Expression Language) case study for the Haskell-based offering of CSci 556, Multiparadigm Programming, in Spring 2017. I continued this work during Summer and Fall 2017 for the Fall 2017 offering of CSci 450, Organization of Programming Languages. I based the ELI Calculator language case study on ideas drawn, in part, from the following:
the 2016 version of my Scala-based Expression Tree Calculator case study from my Notes on Scala for Java Programmers [4] (which was itself adapted from the the tutorial [8])
the Lua-based Expression Language 1 and Imperative Core interpreters I developed for the Fall 2016 CSci 450 course
chapters 1, 2, and 4 of Samuel Kamin’s textbook Programming Languages: An Interpreter-Based Approach [6] and my work to implement three (Core, Lisp, and Scheme) of Kamin’s interpeters in Lua in 2013
sections 8.3 and 9.6 of the classic Richard Bird and Philip Wadler’s textbook Introduction to Functional Programming [2]
sections 14.2, 16.1, 17.5, and 18.3 of Simon Thompson’s textbook [10]
chapters 1-4 and 8 of Peter Sestoff’s textbook Programming Language Concepts [9]
chapters 21 (Recursive Descent Parser) and 22 (Parser Combinator) of Martin Fowler and Parsons’s book Domain-Specific Languages [5].
section 3.2 (Predictive Parsing) of Andrew W. Appel’s textbook Modern Compiler Implementation in ML [1].
chapters 6 (Purely Functional State) and 9 (Parser Combinators) from Paul Chiusano and Runar Bjarnason’s Functional Programming in Scala [3].
sections 1.2, 3.3, and 5.1 of Peter Linz’s textbook Formal Languages and Automata [7]
the Wikipedia articles on Formal Grammar [11], Linear Grammar {[15]], Regular Grammar [16], Context-Free Grammar [12], Backus-Naur Form [13], Extended Backus-Naur Form [14], Parsing [21], Parse Tree [20], Recursive Descent Parser [23], LL Parser [22], Lexical Analysis [26], Finite-state Machine [17], Deterministic Finite Automaton [18], Pushdown Automaton [19], Abstract Syntax [24], Abstract Syntax Tree [25], Stack Machine [27], Reverse Polish Notation [28], Association List [29], and Associative Array [30].
For the 2017 textbook, I organized this work into three chapters:
Expression Language Syntax and Semantics
Expression Language Parsing
Expression Language Compilation (a partial chapter)
In Summer 2018, I divided the previous Expression Language Syntax and Semantics chapter into three new chapters in the 2018 version of the textbook, now titled Exploring Languages with Interpreters and Functional Programming.
Previous section 10.2 became new Chapter 41, Calculator Concrete Syntax.
Previous sections 10.3-5 and 10.7-8 became new Chapter 42, Calculator Abstract Syntax and Evaluation.
Previous sections 10.6 and 10.9 became new Chapter 43, Calculator Modular Structure, and were expanded.
In Fall 2018, I divided the 2017 Expression Language Parsing chapter into two new chapters in the 2018 version of the textbook, now titled Exploring Languages with Interpreters and Functional Programming.
Previous sections 11.1-11.4 became new Chapter 44, Calculator Parsing.
Previous sections 11.6-11.7 became new Chapter 45, Parsing Combinators.
Previous section 11.5 was merged into new Chapter 43, Calculator Modular Structure.
In Fall 2018, I also renumbered previous chapter 12 to become new Chapter 46.
I retired from the full-time faculty in May 2019. As one of my post-retirement projects, I am continuing work on the ELIFP textbook. In January 2022, I began refining the existing content, integrating additional separately developed materials, reformatting the document (e.g., using CSS), constructing a unified bibliography (e.g., using citeproc), and improving the build workflow and use of Pandoc.
I maintain this chapter as text in Pandoc’s dialect of Markdown using embedded LaTeX markup for the mathematical formulas and then translate the document to HTML, PDF, and other forms as needed.
Syntax, concrete syntax, formal grammar (variable and terminal symbols, alphabet, start or goal symbol), production rule, sentence, sentential form, language, context-free grammar, Backus-Naur Form (BNF), derivation, leftmost derivation, derivation tree, right-lean and right-linear grammar, regular grammar, deterministic finite automaton (DFA), pushdown automaton (PDA), infix and prefix syntaxes, lexical level, parsing, parser, parse tree, infix and prefix syntax.