Copyright (C) 2015, H. Conrad Cunningham
Acknowledgements: MS student Clay McLeod assisted in preparation of these notes. These lecture notes are for use with Chapter 5 of the textbook: Peter Linz. Introduction to Formal Languages and Automata, Fifth Edition, Jones and Bartlett Learning, 2012. The terminology and notation used in these notes are similar to those used in the Linz textbook. This document uses several figures from the Linz textbook.
Advisory: The HTML version of this document requires use of a browser that supports the display of MathML. A good choice as of December 2015 seems to be a recent version of Firefox from Mozilla.
In Linz Section 4.3, we saw that not all languages are regular. We examined the Pumping Lemma for Regular Languages as a means to prove that a specific language is not regular.
In Linz Example 4.6, we proved that
is not regular.
If we let = "(" and = ")", then becomes a language of nested parenthesis.
This language is in a larger family of languages called the context-free languages.
Context-free languages are very important because many practical aspects of programming languages are in this family.
In this chapter, we explore the context-free languages, beginning with context-free grammars.
One key process we explore is parsing, which is the process of determining the grammatical structure of a sentence generated from a grammar.
Remember the restrictions we placed on regular grammars in Linz Section 3.3:
To create more powerful grammars (i.e., that describe larger families of languages), we relax these restrictions.
For context-free grammars, we maintain the left-side restriction but relax the restriction on the right side.
Linz Definition 5.1 (Context-Free Grammar): A grammar is context-free if all productions in have the form
where and . A language is context-free if and only if there is a context-free grammar such that .
The family of regular languages is a subset of the family of context-free languages!
Thus, context-free grammars
enable the right side of a production to be substituted for a variable on the left side at any time in a sentential form
with no dependencies on other symbols in the sentential form.
Consider the grammar with productions:
Note that this grammar satisfies the definition of context-free.
A possible derivation using this grammar is as follows:
From this derivation, we see that
.
The language is context-free, but, as we demonstrated in Linz Example 4.8, it is not regular.
This grammar is linear because it has at most one variable on the right.
Consider the grammar with productions:
Note that this grammar also satisfies the definition of context free.
A possible derivation using this grammar is:
We can see that:
This grammar is also linear (as defined in Linz Section 3.3). Although linear grammars are context free, not all context free grammars are linear.
Consider the language
.
This language is context free. To show that this is the case, we must construct a context-free grammar that generates the language
First, consider the case. This can be generated by the productions:
Now, consider the case. We can modify the above to generate extra 's on the left.
Finally, consider the case. We can further modify the grammar to generate extra 's on right.
This grammar is context free, but it is not linear because the productions with on the left are not in the required form.
Although this grammar is not linear, there exist other grammars for this language that are linear.
Consider the grammar with productions:
This grammar is also context-free but not linear.
Example strings in include , , and . Note that:
and are always generated in pairs.
precedes the matching .
A prefix of a string may contain several more 's than 's.
We can see that is
{ and for any prefix of }.
What is a programming language connection for this language?
Let = "(" and = ")".
This gives us a language of properly nested parentheses.
Consider grammar with productions:
This grammar generates the language .
Now consider the two derivations:
These derivations yield the same sentence using exactly the same productions. However, the productions are applied in different orders.
To eliminate such incidental factors, we often require that the variables be replaced in a specific order.
Linz Definition 5.2 (Leftmost and Rightmost Derivations): A derivation is leftmost if, in each step, the leftmost variable in the sentential form is replaced. If, in each step, the rightmost variable is replaced, then the derivation is rightmost.
Consider the grammar with productions:
A leftmost derivation of the string is:
Similarly, a rightmost derivation of the string is:
We can also use a derivation tree to show derivations in a manner that is independent of the order in which productions are applied.
A derivation tree is an ordered tree in which we label the nodes with the left sides of productions and the children of a node represent its corresponding right sides.
The production
is shown as a derivation tree in Linz Figure 5.1.
Linz Definition 5.3 (Derivation Tree): Let be a context-free grammar. An ordered tree is a derivation tree for if and only if it has the following properties:
The root is labeled .
Every leaf has a label from .
Every interior vertex (i.e., a vertex that is not a leaf) has a label from .
If a vertex has a label , and its children are labeled (from left to right) , then must contain a production of the form .
A leaf labeled has no siblings, that is, a vertex with a child labeled can have no other children.
If properties 3, 4, and 5 and modified property 2 (below) hold for a tree, then it is a partial derivation tree.
If we read the leaves of a tree from left to right, omitting any 's encountered, we obtain a string called the yield of the tree.
The descriptive term from left to right means that we traverse the tree in a depth-first manner, always exploring the leftmost unexplored branch first. The yield is the string of terminals encountered in this traversal.
Consider the grammar with productions:
Linz Figure 5.2 shows a partial derivation tree for with the yield . This is a sentential form of the grammar .
Linz Figure 5.3 shows a derivation tree for with the yield . This is a sentence of .
Derivation trees give explicit (and visualizable) descriptions of derivations. They enable us to reason about context-free languages much as transition graphs enable use to reason about regular languages.
Linz Theorem 5.1 (Connection between Derivations and Derivation Trees): Let be a context-free grammar. Then the following properties hold:
For every , there exists a derivation tree of whose yield is .
The yield of any derivation tree of is in .
If is any partial derivation tree for whose root is labeled , then the yield of is a sentential form of .
Proof: See the proof in the Linz textbook.
Derivation trees:
show which productions are used to generate a sentence
abstract out the order in which individual productions are applied
enable the construction of eiher a leftmost or rightmost derivation
The previous section concerns the generative aspects of grammars--using a grammar to generate strings in the language.
This section concerns the analytical aspects of grammars--processing strings from the language to determine their derivations. This process is called parsing.
For example, a compiler for a programming language must parse a program (i.e., a sentence in the language) to determine the derivation tree for the program.
This verifies that the program is indeed in the language (syntactically).
Construction of the derivation tree is needed to execute the program (e.g., to generate the machine-level code corresponding to the program).
Given some , we can parse with respect to grammar by:
systematically constructing all derivations
determining whether any derivation matches
This is called exhaustive search parsing or brute force parsing. A more complete description of the algorithm is below.
This is a form of top-down parsing in which a derivation tree is constructed from the root downward.
Note: An alternative approach is bottom-up parsing in which the derivation tree is constructed from leaves upward. Bottom-up parsing techniques often have limitations in terms of the grammars supported but often give more efficient algorithms.
Exhaustive Search Parsing Algorithm
-- Add root and 1st level of all derivation trees
in of
while and do
-- Add next level of all derivation trees
for all do
if can generate then
leftmost variable of
for all productions in do
where with
Note: The above algorithm determines whether a string is in . It can be modified to build the actual derivation or derivation tree.
Note: The presentation here uses the algorithm above, rather than the approach in the Linz textbook.
Consider the grammar with productions:
Parse the string .
After initialization: (from the righthand sides of the grammar's four productions with on the left).
First iteration: The loop test is true because is nonempty and is not present.
The algorithm does not need to consider the sentential forms and in because neither can generate .
The inner loop thus adds from the leftmost derivations from sentential form and also adds from the leftmost derivations from sentential form .
Thus at the end of the first iteration.
Second iteration: The algorithm enters the loop a second time because is nonempty and does not contain .
The algorithm does not need to consider any sentential form beginning with or , thus eliminating and leaving of interest.
This iteration generates 20 new sentential forms from applying each of the 4 productions to each of the 5 remaining sentential forms.
In particular, note that that sentential form yields the target string when production is applied.
Third iteration: The loop terminates because is present in .
Thus we can conclude .
Exhaustive search parsing has serious flaws:
It is tedious and inefficient.
It might not terminate when .
For example, if we choose in the previous example, the algorithm goes into an infinite loop.
The fix for nontermination is to ensure sentential forms increase in length for each production. That is, we eliminate productions of forms:
Chapter 6 of the Linz textbook (which we will not cover this semester) shows that this does not reduce the power of the grammar.
Consider the grammar with productions:
This grammar generates the same language as the one in Linz Example 5.7 above, but it satisfies the restrictions given in the previous subsection.
Given any nonempty string , exhaustive search will terminate in no more than rounds for such grammars.
Linz Theorem 5.2 (Exhaustive Search Parsing): Suppose that is a context-free grammar that does not have any rules of one of the forms
where . Then the exhaustive search parsing method can be formulated as an algorithm which, for any , either parses or tells us that parsing is impossible.
Proof outline
Each production must increase either the length or number of terminals.
The maximum length of a sentential form is , which is the maximum number of terminal symbols.
Thus for some , the number of loop iterations is at most .
But exhaustive search is still inefficient. The number of sentential forms to be generated is
.
That is, it grows exponentially with the length of the string.
Linz Theorem 5.3 (Efficient Parsing): For every context-free grammar there exists an algorithm that parses any in a number of steps proportional to .
Construction of more efficient context-free parsing methods is left to compiler courses.
is still inefficient.
We would prefer linear () parsing.
Again we must restrict the grammar in our search for more efficient parsing. The next subsection illustrates on such grammar.
Linz Definition 5.4 (Simple Grammar): A context-free grammar is said to be a simple grammar or s-grammar if all its productions are of the form
where , and any pair occurs at most once in .
The grammar
is an s-grammar.
The grammar
is not an s-grammar because occurs twice.
Although s-grammars are quite restrictive, many features of programming languages can be described with s-grammars (e.g., grammars for arithmetic expressions).
If is s-grammar, then can be parsed in linear time.
To see this, consider string and use the exhaustive search parsing algorithm.
The s-grammar has at most one rule with on left: . Choose it!
Then the s-grammar has at most one rule with on left: . Choose it!
And so forth up to the th terminal.
The number of steps is proportional to because each step consumes one symbol of .
A derivation tree for some string generated by a context-free grammar may not be unique.
Linz Definition 5.5 (Ambiguity): A context-free grammar is said to be ambiguous if there exists some that has at least two distinct derivation trees. Alternatively, ambiguity implies the existence of two or more leftmost or rightmost derivations.
Again consider the grammar in Linz Example 5.4. Its productions are
.
The string has two derivation trees as shown in Linz Figure 5.4
The left tree corresponds to the leftmost derivation .
The right tree corresponds to the leftmost derivation .
Thus the grammar is ambiguous.
Consider the grammar with
and including the productions:
This grammar generates a subset of the arithmetic expressions for a language like C or Java. It contains strings such as and .
Linz Figure 5.5 shows two derivation trees for the string . Thus this grammar is ambiguous.
Why is ambiguity a problem?
Remember that the semantics (meaning) of the expression is also associated with the structure of the expression. The structure determines how the (machine language) code is generated to carry out the computation.
How do real programming languages resolve this ambiguity?
Often, they add precedence rules that give priority to "" over "". That is, the multiplication operator binds more tightly than addition.
This solution is totally outside the world of the context-free grammar. It is, in some sense, a hack.
A better solution is to rewrite the grammar (or sometimes redesign te language) to eliminate the ambiguity.
To rewrite the grammar in Linz Example 5.11, we introduce new variables, making the set , and replacing the productions with the following:
Linz Figure 5.6 shows the only derivation tree for string in this revised grammar for arithmetic expressions.
Linz Definition 5.6: If is a context-free language for which there exists an unambiguous grammar, then is said to be unambiguous. If every grammar that generates is ambiguous, then language is called inherently ambiguous.
It is difficult to demonstrate that a grammar is inherently ambiguous. Often the best we can do is to give examples and argue informally that all grammars must be ambiguous.
The language
,
with and non-negative, is an inherently ambiguous context-free language.
Note that .
We can generate with the context-free grammar:
Similarly, we can generate with the context-free grammar:
We can thus construct the union of these two sublanguages by adding a new production:
Thus this is a context-free language.
But consider a string of the form (i.e., ). It has two derivations, one starting with
and another starting with
.
Thus the grammar is ambiguous.
and have conflicting requirements. places restrictions on the number of 's and 's while places restrictions on the number of 's and 's. It is imposible to find production rules that satisfy the case uniquely.
The syntax for practical programming language syntax is usually expressed with context-free grammars. Compilers and interpreters must parse programs in these language to execute them.
The grammar for programming languages is often expressed using the Backus-Naur Form (BNF) to express productions.
For example, the language for arithmetic expressing in Linz Example 5.12 can be written in BNF as:
<expression> ::= <term> | <expression> + <term>
<term> ::= <factor> | <term> * <factor>
The items in angle brackets are variables, the symbols such as "+" and "-" are terminals, the "|" denotes alternatives, and "::=" separates the left and right sides of the productions.
Programming languages often use restricted grammars to get linear parsing: e.g., regular grammars, s-grammars, LL grammars, and LR grammars.
The aspects of programming languages that can be modeled by context-free grammars are called the the syntax.
Aspects such as type-checking are not context-free. Such issues are sometimes considered (incorrectly in your instructor's view) as part of the semantics of the language.
These are really still syntax, but they must be expressed in ways that are not context free.