Copyright (C) 2015, H. Conrad Cunningham

Acknowledgements: MS student Clay McLeod assisted in preparation of these notes. These lecture notes are for use with Chapter 5 of the textbook: Peter Linz. Introduction to Formal Languages and Automata, Fifth Edition, Jones and Bartlett Learning, 2012. The terminology and notation used in these notes are similar to those used in the Linz textbook. This document uses several figures from the Linz textbook.

Advisory: The HTML version of this document requires use of a browser that supports the display of MathML. A good choice as of December 2015 seems to be a recent version of Firefox from Mozilla.

Introduction

In Linz Section 4.3, we saw that not all languages are regular. We examined the Pumping Lemma for Regular Languages as a means to prove that a specific language is not regular.

In Linz Example 4.6, we proved that

L={anbn:n0}L = \{a^{n}b^{n} : n \geq 0\}

is not regular.

If we let aa = "(" and bb = ")", then LL becomes a language of nested parenthesis.

This language is in a larger family of languages called the context-free languages.

Context-free languages are very important because many practical aspects of programming languages are in this family.

In this chapter, we explore the context-free languages, beginning with context-free grammars.

One key process we explore is parsing, which is the process of determining the grammatical structure of a sentence generated from a grammar.

5.1 Context-Free Grammars

5.1.1 Definition of Context-Free Grammars

Remember the restrictions we placed on regular grammars in Linz Section 3.3:

To create more powerful grammars (i.e., that describe larger families of languages), we relax these restrictions.

For context-free grammars, we maintain the left-side restriction but relax the restriction on the right side.

Linz Definition 5.1 (Context-Free Grammar): A grammar G=(V,T,S,P)G = (V,T,S,P) is context-free if all productions in PP have the form

AxA \rightarrow x

where AVA \in V and x(VT)*x \in (V \cup T)^{*}. A language LL is context-free if and only if there is a context-free grammar GG such that L=L(G)L = L(G).

The family of regular languages is a subset of the family of context-free languages!

Thus, context-free grammars

5.1.2 Linz Example 5.1

Consider the grammar G=({S},{a,b},S,P)G = (\{S\}, \{a, b\}, S, P) with productions:

SaSaS \rightarrow aSa
SbSbS \rightarrow bSb
SλS \rightarrow \lambda

Note that this grammar satisfies the definition of context-free.

A possible derivation using this grammar is as follows:

SaSaaaSaaaabSbaaaabbaaS \Rightarrow aSa \Rightarrow aaSaa \Rightarrow aabSbaa \Rightarrow aabbaa

From this derivation, we see that

L(G)={wwR:w{a,b}*}L(G) = \{ww^{R} : w \in \{a,b\}^{*}\}.

The language is context-free, but, as we demonstrated in Linz Example 4.8, it is not regular.

This grammar is linear because it has at most one variable on the right.

5.1.3 Linz Example 5.2

Consider the grammar GG with productions:

SabBS \rightarrow abB
AaaBbA \rightarrow aaBb
BbbAaB \rightarrow bbAa
AλA \rightarrow \lambda

Note that this grammar also satisfies the definition of context free.

A possible derivation using this grammar is:

       SabBabbbAaabbbaaBbaabbbaabbAabaS \Rightarrow abB \Rightarrow abbbAa \Rightarrow abbbaaBba\Rightarrow abbbaabbAaba
          abbbaabbaaBbabaabbbaabbaabbAababaabbbaabbaabbababa\Rightarrow abbbaabbaaBbaba \Rightarrow abbbaabbaabbAababa \Rightarrow abbbaabbaabbababa

We can see that:

L(G)={ab(bbaa)nbba(ba)n:n0}L(G) = \{ab(bbaa)^{n}bba(ba)^{n} : n \geq 0\}

This grammar is also linear (as defined in Linz Section 3.3). Although linear grammars are context free, not all context free grammars are linear.

5.1.4 Linz Example 5.3

Consider the language

L={anbm:nm}L = \{a^{n}b^{m} : n \neq m\}.

This language is context free. To show that this is the case, we must construct a context-free grammar that generates the language

First, consider the n=mn = m case. This can be generated by the productions:

SaSb|λS \rightarrow aSb\ | \ \lambda

Now, consider the n>mn > m case. We can modify the above to generate extra aa's on the left.

SAS1S \rightarrow AS_{1}
S1aS1b|λS_{1} \rightarrow aS_{1}b\ | \ \lambda
AaA|aA \rightarrow aA\ | \ a

Finally, consider the n<mn < m case. We can further modify the grammar to generate extra bb's on right.

SAS1|S1BS \rightarrow AS_{1}\ | \ S_{1}B
S1aS1b|λS_{1} \rightarrow aS_{1}b\ | \ \lambda
AaA|aA \rightarrow aA\ | \ a
BbB|bB \rightarrow bB\ | \ b

This grammar is context free, but it is not linear because the productions with SS on the left are not in the required form.

Although this grammar is not linear, there exist other grammars for this language that are linear.

5.1.5 Linz Example 5.4

Consider the grammar with productions:

SaSb|SS|λS \rightarrow aSb\ | \ SS\ | \ \lambda

This grammar is also context-free but not linear.

Example strings in L(G)L(G) include abaabbabaabb, aababbaababb, and abababababab. Note that:

We can see that L(G)L(G) is

{ w{a,b}*:na(w)=nb(w)w \in \{a, b\}^{*} : n_{a}(w) = n_{b}(w) and na(v)nb(v)n_{a}(v) \geq n_{b}(v) for any prefix vv of ww }.

What is a programming language connection for this language?

5.1.6 Leftmost and Rightmost Derivations

Consider grammar G=({A,B,S},{a,b},S,P)G = (\{A, B, S\}, \{a, b\}, S, P) with productions:

SABS \rightarrow AB
AaaAA \rightarrow aaA
AλA \rightarrow \lambda
BBbB \rightarrow Bb
BλB \rightarrow \lambda

This grammar generates the language L(G)={a2nbm:n0,m0}L(G) = \{a^{2n}b^{m} : n \geq 0, m \geq 0\}.

Now consider the two derivations:

SABaaABaaBaaBbaabS \Rightarrow AB \Rightarrow aaAB \Rightarrow aaB \Rightarrow aaBb \Rightarrow aab
SABABbaaABbaaAbaabS \Rightarrow AB \Rightarrow ABb \Rightarrow aaABb \Rightarrow aaAb \Rightarrow aab

These derivations yield the same sentence using exactly the same productions. However, the productions are applied in different orders.

To eliminate such incidental factors, we often require that the variables be replaced in a specific order.

Linz Definition 5.2 (Leftmost and Rightmost Derivations): A derivation is leftmost if, in each step, the leftmost variable in the sentential form is replaced. If, in each step, the rightmost variable is replaced, then the derivation is rightmost.

5.1.7 Linz Example 5.5

Consider the grammar with productions:

SaABS \rightarrow aAB
AbBbA \rightarrow bBb
BA|λB \rightarrow A\ | \ \lambda

A leftmost derivation of the string abbbbabbbb is:

SaABabBbBabAbBabbBbbBabbbbBabbbbS \Rightarrow aAB \Rightarrow abBbB \Rightarrow abAbB \Rightarrow abbBbbB \Rightarrow abbbbB \Rightarrow abbbb

Similarly, a rightmost derivation of the string abbbbabbbb is:

SaABaAabBbabAbabbBbbabbbbS \Rightarrow aAB \Rightarrow aA \Rightarrow abBb \Rightarrow abAb \Rightarrow abbBbb \Rightarrow abbbb

5.1.8 Derivation Trees

We can also use a derivation tree to show derivations in a manner that is independent of the order in which productions are applied.

A derivation tree is an ordered tree in which we label the nodes with the left sides of productions and the children of a node represent its corresponding right sides.

The production

AabABcA \rightarrow abABc

is shown as a derivation tree in Linz Figure 5.1.

Linz Fig. 5.1: Derivation Tree for Production A \rightarrow abABc

Linz Fig. 5.1: Derivation Tree for Production AabABcA \rightarrow abABc

Linz Definition 5.3 (Derivation Tree): Let G=(V,T,S,P)G = (V, T, S, P) be a context-free grammar. An ordered tree is a derivation tree for GG if and only if it has the following properties:

  1. The root is labeled SS.

  2. Every leaf has a label from T{λ}T \cup \{\lambda\}.

  3. Every interior vertex (i.e., a vertex that is not a leaf) has a label from VV.

  4. If a vertex has a label AVA \in V, and its children are labeled (from left to right) a1,a2,,ana_{1}, a_{2}, \cdots, a_{n}, then PP must contain a production of the form Aa1a2anA \rightarrow a_{1} a_{2} \cdots\ a_{n}.

  5. A leaf labeled λ\lambda has no siblings, that is, a vertex with a child labeled λ\lambda can have no other children.

If properties 3, 4, and 5 and modified property 2 (below) hold for a tree, then it is a partial derivation tree.

  1. (modified) Every leaf has a label from VT{λ}V \cup T \cup \{\lambda\}

If we read the leaves of a tree from left to right, omitting any λ\lambda's encountered, we obtain a string called the yield of the tree.

The descriptive term from left to right means that we traverse the tree in a depth-first manner, always exploring the leftmost unexplored branch first. The yield is the string of terminals encountered in this traversal.

5.1.9 Linz Example 5.6

Consider the grammar GG with productions:

SaABS \rightarrow aAB
AbBbA \rightarrow bBb
BA|λB \rightarrow A\ | \ \lambda

Linz Figure 5.2 shows a partial derivation tree for GG with the yield abBbBabBbB. This is a sentential form of the grammar GG.

Linz Fig. 5.2: Partial Derivation Tree

Linz Fig. 5.2: Partial Derivation Tree

Linz Figure 5.3 shows a derivation tree for GG with the yield abbbbabbbb. This is a sentence of L(G)L(G).

Linz Fig. 5.3: Derivation Tree

Linz Fig. 5.3: Derivation Tree

5.1.10 Relation Between Sentential Forms and Derivation Trees

Derivation trees give explicit (and visualizable) descriptions of derivations. They enable us to reason about context-free languages much as transition graphs enable use to reason about regular languages.

Linz Theorem 5.1 (Connection between Derivations and Derivation Trees): Let G=(V,T,S,P)G = (V, T, S, P) be a context-free grammar. Then the following properties hold:

Proof: See the proof in the Linz textbook.

Derivation trees:

5.2 Parsing and Ambiguity

5.2.1 Generation versus Parsing

The previous section concerns the generative aspects of grammars--using a grammar to generate strings in the language.

This section concerns the analytical aspects of grammars--processing strings from the language to determine their derivations. This process is called parsing.

For example, a compiler for a programming language must parse a program (i.e., a sentence in the language) to determine the derivation tree for the program.

5.2.2 Exhaustive Search Parsing

Given some wL(G)w \in L(G), we can parse ww with respect to grammar GG by:

This is called exhaustive search parsing or brute force parsing. A more complete description of the algorithm is below.

This is a form of top-down parsing in which a derivation tree is constructed from the root downward.

Note: An alternative approach is bottom-up parsing in which the derivation tree is constructed from leaves upward. Bottom-up parsing techniques often have limitations in terms of the grammars supported but often give more efficient algorithms.

Exhaustive Search Parsing Algorithm

    -- Add root and 1st level of all derivation trees
    F{x:sxF \leftarrow \{x : s \rightarrow x in PP of G}G \}
    while FF \neq \emptyset and wFw \notin F do
        FF' \leftarrow \emptyset
        -- Add next level of all derivation trees
        for all xFx \in F do
            if xx can generate ww then
                VV \leftarrow leftmost variable of xx
                for all productions VyV \rightarrow y in GG do
                    FF{x}F' \leftarrow F' \cup \{x'\} where x=xx' = x with VyV \leftarrow y
       FFF \leftarrow F'

Note: The above algorithm determines whether a string ww is in L(G)L(G). It can be modified to build the actual derivation or derivation tree.

5.2.3 Linz Example 5.7

Note: The presentation here uses the algorithm above, rather than the approach in the Linz textbook.

Consider the grammar GG with productions:

SSS|aSb|bSa|λS \rightarrow SS\ | \ aSb\ | \ bSa\ | \ \lambda

Parse the string w=aabbw = aabb.

After initialization: F={SS,aSb,bSa,λ}F = \{ SS, aSb, bSa, \lambda \} (from the righthand sides of the grammar's four productions with SS on the left).

First iteration: The loop test is true because FF is nonempty and ww is not present.

The algorithm does not need to consider the sentential forms bSabSa and λ\lambda in FF because neither can generate ww.

The inner loop thus adds {SSS,aSbS,bSaS,S}\{ SSS, aSbS, bSaS, S \} from the leftmost derivations from sentential form SSSS and also adds {aSSb,aaSbb,abSab,ab}\{ aSSb, aaSbb, abSab, ab \} from the leftmost derivations from sentential form aSbaSb.

Thus F={SSS,aSbS,bSaS,S,aSSb,aaSbb,abSab,ab}F = \{ SSS, aSbS, bSaS, S, aSSb, aaSbb, abSab, ab \} at the end of the first iteration.

Second iteration: The algorithm enters the loop a second time because FF is nonempty and does not contain ww.

The algorithm does not need to consider any sentential form beginning with bb or abab, thus eliminating {bSaS,abSab,ab}\{ bSaS, abSab, ab \} and leaving {SSS,aSbS,S,aSSb,aaSbb}\{ SSS, aSbS, S, aSSb, aaSbb \} of interest.

This iteration generates 20 new sentential forms from applying each of the 4 productions to each of the 5 remaining sentential forms.

In particular, note that that sentential form aaSbbaaSbb yields the target string aabbaabb when production SλS \rightarrow \lambda is applied.

Third iteration: The loop terminates because ww is present in FF.

Thus we can conclude wL(G)w \in L(G).

5.2.4 Flaws in Exhaustive Search Parsing

Exhaustive search parsing has serious flaws:

For example, if we choose w=abbw = abb in the previous example, the algorithm goes into an infinite loop.

The fix for nontermination is to ensure sentential forms increase in length for each production. That is, we eliminate productions of forms:

AλA \rightarrow \lambda
ABA \rightarrow B

Chapter 6 of the Linz textbook (which we will not cover this semester) shows that this does not reduce the power of the grammar.

5.2.5 Linz Example 5.8

Consider the grammar with productions:

SSS|aSb|bSA|ab|baS \rightarrow SS\ | \ aSb\ | \ bSA\ | \ ab\ | \ ba

This grammar generates the same language as the one in Linz Example 5.7 above, but it satisfies the restrictions given in the previous subsection.

Given any nonempty string ww, exhaustive search will terminate in no more than |w||w| rounds for such grammars.

5.2.6 Toward Better Parsing Algorithms

Linz Theorem 5.2 (Exhaustive Search Parsing): Suppose that G=(V,T,S,P)G = (V, T, S, P) is a context-free grammar that does not have any rules of one of the forms

AλA \rightarrow \lambda
ABA \rightarrow B

where A,BVA, B \in V. Then the exhaustive search parsing method can be formulated as an algorithm which, for any wT*w \in T{*}, either parses ww or tells us that parsing is impossible.

Proof outline

But exhaustive search is still inefficient. The number of sentential forms to be generated is

i=12|w||P|i\sum_{i=1}^{2|w|} |P|^{i}.

That is, it grows exponentially with the length of the string.

Linz Theorem 5.3 (Efficient Parsing): For every context-free grammar there exists an algorithm that parses any wL(G)w \in L(G) in a number of steps proportional to |w|3|w|^{3}.

5.2.7 Simple Grammar Definition

Linz Definition 5.4 (Simple Grammar): A context-free grammar G=(V,T,S,P)G=(V,T,S,P) is said to be a simple grammar or s-grammar if all its productions are of the form

AaxA \rightarrow ax

where AV,aT,xV*A \in V, a \in T, x \in V^{*}, and any pair (A,a)(A, a) occurs at most once in PP.

5.2.8 Linz Example 5.9

The grammar

SaS|bSS|cS \rightarrow aS\ | \ bSS\ | \ c

is an s-grammar.

The grammar

SaS|bSS|aSS|cS \rightarrow aS\ | \ bSS\ | \ aSS\ | \ c

is not an s-grammar because (S,a)(S, a) occurs twice.

5.2.9 Parsing Simple Grammars

Although s-grammars are quite restrictive, many features of programming languages can be described with s-grammars (e.g., grammars for arithmetic expressions).

If GG is s-grammar, then wL(G)w \in L(G) can be parsed in linear time.

To see this, consider string w=a1a2anw = a_{1}a_{2} \cdots a_{n} and use the exhaustive search parsing algorithm.

  1. The s-grammar has at most one rule with a1a_{1} on left: Sa1A1A2S \rightarrow a_{1}A_{1}A_{2} \cdots. Choose it!

  2. Then the s-grammar has at most one rule with a2a_{2} on left: A1a2B1B2A_{1} \rightarrow a_{2}B_{1}B_{2} \cdots. Choose it!

  3. And so forth up to the nnth terminal.

The number of steps is proportional to |w||w| because each step consumes one symbol of ww.

5.2.10 Ambiguity in Grammars and Languages

A derivation tree for some string generated by a context-free grammar may not be unique.

Linz Definition 5.5 (Ambiguity): A context-free grammar GG is said to be ambiguous if there exists some wL(G)w \in L(G) that has at least two distinct derivation trees. Alternatively, ambiguity implies the existence of two or more leftmost or rightmost derivations.

5.2.11 Linz Example 5.10

Again consider the grammar in Linz Example 5.4. Its productions are

SaSb|SS|λS \rightarrow aSb\ | \ SS\ | \ \lambda.

The string w=aabbw = aabb has two derivation trees as shown in Linz Figure 5.4

Linz Fig. 5.4: Two Derivation Trees for aabb

Linz Fig. 5.4: Two Derivation Trees for aabbaabb

The left tree corresponds to the leftmost derivation SaSbaaSbbaabbS \Rightarrow aSb \Rightarrow aaSbb \Rightarrow aabb.

The right tree corresponds to the leftmost derivation SSSλSaSbaaSbbaabbS \Rightarrow SS \Rightarrow \lambda S \Rightarrow aSb \Rightarrow aaSbb \Rightarrow aabb.

Thus the grammar is ambiguous.

5.2.12 Linz Example 5.11

Consider the grammar G=(V,T,E,P)G = (V, T, E, P) with

V={E,I}V = \{E, I\}
T={a,b,c,+,*,(,)}T = \{a, b, c, +, *, (, )\}

and PP including the productions:

EIE \rightarrow I
EE+EE \rightarrow E + E
EE*EE \rightarrow E * E
E(E)E \rightarrow (E)
Ia|b|cI \rightarrow a\ | \ b\ | \ c

This grammar generates a subset of the arithmetic expressions for a language like C or Java. It contains strings such as (a+b)*c(a+b)*c and a*b+ca*b+c.

Linz Figure 5.5 shows two derivation trees for the string a+b*ca+b*c. Thus this grammar is ambiguous.

Linz Fig. 5.5: Two Derivation Trees for a+b*c

Linz Fig. 5.5: Two Derivation Trees for a+b*ca+b*c

Why is ambiguity a problem?

Remember that the semantics (meaning) of the expression is also associated with the structure of the expression. The structure determines how the (machine language) code is generated to carry out the computation.

How do real programming languages resolve this ambiguity?

Often, they add precedence rules that give priority to "**" over "++". That is, the multiplication operator binds more tightly than addition.

This solution is totally outside the world of the context-free grammar. It is, in some sense, a hack.

A better solution is to rewrite the grammar (or sometimes redesign te language) to eliminate the ambiguity.

5.2.13 Linz Example 5.12

To rewrite the grammar in Linz Example 5.11, we introduce new variables, making VV the set {E,T,F,I}\{E, T, F, I\}, and replacing the productions with the following:

ETE \rightarrow T
TFT \rightarrow F
FIF \rightarrow I
EE+TE \rightarrow E + T
TT*FT \rightarrow T * F
F(E)F \rightarrow (E)
Ia|b|cI \rightarrow a\ | \ b\ | \ c

Linz Figure 5.6 shows the only derivation tree for string a+b*ca+b*c in this revised grammar for arithmetic expressions.

Linz Fig. 5.6: Derivation Tree for a+b*c in Revised Grammar

Linz Fig. 5.6: Derivation Tree for a+b*ca+b*c in Revised Grammar

5.2.14 Inherently Ambiguous

Linz Definition 5.6: If LL is a context-free language for which there exists an unambiguous grammar, then LL is said to be unambiguous. If every grammar that generates LL is ambiguous, then language is called inherently ambiguous.

It is difficult to demonstrate that a grammar is inherently ambiguous. Often the best we can do is to give examples and argue informally that all grammars must be ambiguous.

5.2.15 Linz Example 5.13

The language

L={anbncm}{anbmcm}L = \{a^{n}b^{n}c^{m}\} \cup \{a^{n}b^{m}c^{m}\},

with nn and mm non-negative, is an inherently ambiguous context-free language.

Note that L=L1L2L = L_{1} \cup L_{2}.

We can generate L1L_{1} with the context-free grammar:

S1=S1c|AS_{1} = S_{1}c\ | \ A
AaAb|λA \rightarrow aAb\ | \ \lambda

Similarly, we can generate L2L_{2} with the context-free grammar:

S2=aS2|BS_{2} = aS_{2}\ | \ B
BbBc|λB \rightarrow bBc\ | \ \lambda

We can thus construct the union of these two sublanguages by adding a new production:

SS1|S2S \rightarrow S_{1}\ | \ S_{2}

Thus this is a context-free language.

But consider a string of the form anbncna^{n}b^{n}c^{n} (i.e., n=mn=m). It has two derivations, one starting with

SS1S \Rightarrow S_{1}

and another starting with

SS2S \Rightarrow S_{2}.

Thus the grammar is ambiguous.

L1L_{1} and L2L_{2} have conflicting requirements. L1L_{1} places restrictions on the number of aa's and bb's while L2L_{2} places restrictions on the number of bb's and cc's. It is imposible to find production rules that satisfy the n=mn = m case uniquely.

5.3 Context-Free Grammars and Programming Languages

The syntax for practical programming language syntax is usually expressed with context-free grammars. Compilers and interpreters must parse programs in these language to execute them.

The grammar for programming languages is often expressed using the Backus-Naur Form (BNF) to express productions.

For example, the language for arithmetic expressing in Linz Example 5.12 can be written in BNF as:

    <expression> ::= <term>   | <expression> + <term>
          <term> ::= <factor> | <term> * <factor>

The items in angle brackets are variables, the symbols such as "+" and "-" are terminals, the "|" denotes alternatives, and "::=" separates the left and right sides of the productions.

Programming languages often use restricted grammars to get linear parsing: e.g., regular grammars, s-grammars, LL grammars, and LR grammars.

The aspects of programming languages that can be modeled by context-free grammars are called the the syntax.

Aspects such as type-checking are not context-free. Such issues are sometimes considered (incorrectly in your instructor's view) as part of the semantics of the language.

These are really still syntax, but they must be expressed in ways that are not context free.