Copyright (C) 2015, H. Conrad Cunningham

Acknowledgements: These lecture notes are for use with Chapter 1 of the textbook: Peter Linz. Introduction to Formal Languages and Automata, Fifth Edition, Jones and Bartlett Learning, 2012. The terminology and notation used in these notes are similar to those used in the Linz textbook. This document uses several figures from the Linz textbook.

Advisory: The HTML version of this document requires use of a browser that supports the display of MathML. A good choice as of December 2015 seems to be a recent version of Firefox from Mozilla.

Introduction

Why study theory?

  1. To understand the concepts and principles underlying the fundamental nature of computing
  2. To learn to apply the theory to practical areas of computing
  3. To have fun!

In this course, we study the following models:

  1. automaton (automata)
  2. formal language
  3. algorithm

1.1 Mathematical Preliminaries and Notation

The mathematical concepts used in the Linz textbook include:

1.1.1 Sets

Students in this course should be familiar with the following set concepts from previous study in mathematics:

Laws for operations on the empty set:

DeMorgan's Laws:

1.1.2 Functions

Function f:DRf: D \rightarrow R means that

Function ff is a

1.1.3 Relations

A relation on XX and YY is any subset of X×YX \times Y.

An equivalence relation \equiv is a generalization of equality. It is:

  1. reflexive: xxxx \equiv x \forall x
  2. symmetric: if xyx \equiv y then yxy \equiv x
  3. transitive: if xyx \equiv y and yzy \equiv z then xzx \equiv z

1.1.4 Graphs

A graphgraph V,E\langle V, E \rangle is a mathematical entity where

A directed graph (or digraph) is a graph in which each edge ei=(vj,vk)e_{i} = ( v_{j}, v_{k} ) has a direction from vertex vjv_{j} to vertex vkv_{k}.

Edge ei=(vj,vk)e_{i} = ( v_{j}, v_{k} ) on a digraph is an outgoing edge from vertex vjv_{j} and an incoming edge to vertex vkv_{k}.

If there are no directions associated with the edges, then the graph is undirected.

Graphs may be labeled by assigning names or other information to vertices or edges.

We can visualize a graph with a diagram in which the vertices are shown as circles and edges as lines connecting a pair of vertices. For directed graphs, the direction of an edge is shown by an arrow.

Linz Figure 1.1 shows a digraph V,E\langle V, E \rangle where V={v1,v2,v3}V = \{ v_{1}, v_{2}, v_{3} \} and edges E={(v1v3),(v3,v1),(v3,v2),(v3,v3)}E = \{ (v_{1} v_{3}),(v_{3},v_{1}),(v_{3},v_{2}),(v_{3},v_{3}) \}.

Linz Fig. 1.1: Diagram of a Digraph

Linz Fig. 1.1: Diagram of a Digraph

A sequence of edges (vi,vj),(vj,vk),,(vm,vn)(v_{i},v{j}), (v_{j},v{k}), \ldots, (v_{m},v_{n}) is a walk from vjv_{j} to vnv_{n}. The length of the walk is the total number of edges traversed.

A path is a walk with no edge repeated. A path is simple if no vertex is repeated.

A walk from some vertex viv_{i} to itself is a cycle with base viv_{i}. If no vertex other than the base is repeated, then the cycle is simple.

In Linz Figure 1.1:

If the edges of a graph are labelled, then the label of a walk (or path) is the sequence of edges encountered on a traversal.

1.1.5 Trees

A tree is a directed graph with no cycles and a distinct root vertex such that there is exactly one path from the root to every other vertex.

The root of a tree has no incoming edges.

A vertex of a tree without any outgoing edges is a leaf of the tree.

If there is an edge from viv_{i} to vjv_{j} in a tree, then:

The level associated with each vertex is the number of edges in the path from the root to the vertex.

The height of a tree is the largest level number of any vertex.

If we associated an ordering with the vertices at each level, then the tree is an ordered tree.

The above terminology is illustrated in Linz Figure 1.2.

Linz Fig. 1.2: A Tree

Linz Fig. 1.2: A Tree

1.1.6 Proof Techniques

Students in this course should be familiar with the following proof techniques from previous study in mathematics:

We will see an example of an inductive proof in the next section.

1.2 Three Basic Concepts

Three fundamental ideas are the major themes of this course:

  1. languages
  2. grammars
  3. automata

1.2.1 Languages

Our concept of language is an abstraction of the concept of a natural language.

1.2.1.1 Language Concepts

Linz Definition (Alphabet): An alphabet, denoted by Σ\Sigma, is a finite, nonempty set of symbols.

By convention, we use lowercase letters near the beginning of the English alphabet a,b,c,a, b, c, \ \cdots to represent elements of Σ\Sigma.

For example, if Σ={a,b}\Sigma = \{ a, b \}, then the alphabet has two unique symbols denoted by aa and bb.

Linz Definition (String): A string is a finite sequence of symbols from the alphabet.

By convention, we use lowercase letters near the end of the English alphabet u,v,w,x,y,z\cdots\ u, v, w, x, y, z to represent strings. We write strings left to right. That is, symbols appearing to the left are before those appearing to the right.

For example, w=baabaaw = baabaa is a string from the above alphabet. The string begins with a bb and ends with an aa.

Linz Definition (Concatenation): The concatenation of strings uu and vv means appending the symbols of vv to the right end (i.e., after) the symbols of uu, denoted by uv.

If u=a1a2a3u = a_{1} a_{2} a_{3} and v=b1b2b3v = b_{1} b_{2} b_{3}, then uv=a1a2a3b1b2b3uv = a_{1} a_{2} a_{3} b_{1} b_{2} b_{3}.

Definition (Associativity): Operation \oplus is associative on set SS if, for all xx, yy, and zz in SS, (xy)z=x(yz)(x \oplus y) \oplus z = x \oplus (y \oplus z). We often write associative expressions without explicit parentheses, e.g., xyzx \oplus y \oplus z.

String concatenation is associative, that is, (uv)w=u(vw)(u v) w = u (v w).

Thus we normally just write uvwuvw without parentheses.

Definition (Commutativity): Operation \oplus is commutative on set SS if, for all xx and yy in SS, xy=yxx \oplus y = y \oplus x.

String concatenation is not commutative. That is, uvvuuv \neq vu.

Linz Definition (Reverse): The reverse of a string ww, denoted by wRw^{R}, is the string with same symbols, but with the order reversed.

If w=a1a2a3w = a_{1}a_{2}a_{3}, then wR=a3a2a1w^{R} = a_{3}a_{2}a_{1}.

Linz Definition (Length): The length of a string w, denoted by |w||w|, is the number of symbols in string ww.

Linz Definition (Empty String): The empty string, denoted by λ\lambda, is the string with no symbols, that is, |λ|=0|\lambda| = 0.

Definition (Identity Element): An operation \oplus has an identity element ee on set SS if, for all xSx \in S, xe=x=exx \oplus e = x = e \oplus x.

The empty string λ\lambda is the identity element for concatenation. That is, λw=wλ=w\lambda w = w \lambda = w.

1.2.1.2 Formal Interlude: Inductive Definitions and Induction

We can define the length of a string with the following inductive definition:

  1. |λ|=0|\lambda| = 0 (base case)
  2. |wa|=|w|+1|wa| = |w| + 1 (inductive case)

Note: This inductive defintion and proof differs from the textbook. Here we begin with the empty string.

Using the fact that λ\lambda is the identity element and the above definition, we see that

|a|=|λa|=|λ|+1=0+1=1|a| = |\lambda a| = |\lambda| + 1 = 0 + 1 = 1.

Prove |uv|=|u|+|v||uv|\ =\ |u| + |v|.

Noting the definition of length above, we choose to do an induction over string v (or, if you prefer, over the length of vv, basing induction at 0).

Base case v=λv\ =\ \lambda (that is, length is 0)

|uλ|| u \lambda |
= { identity for concatenation } \longleftarrow justification for step in braces
|u|| u |
= { identity for + }
|u|+0| u | + 0
= { definition of length }
|u|+|λ|| u | + | \lambda |

Inductive case v=wav=wa (that is, length is greater than 0)
Induction hypothesis: |uw|=|u|+|w||uw| = |u| + |w|

|u(wa)|| u (w a) |
= { associativity of concatenation }
|(uw)a|| (uw)a |
= { definition of length }
|uw|+1| uw | + 1
= { induction hypothesis }
(|u|+|w|)+1(|u| + |w|) + 1
= { associativity of + }
|u|+(|w|+1)|u| + (|w| + 1)
= { definition of length (right to left) }
|u|+(|wa|)|u| + (|w a|)

Thus we have proved |uv|=|u|+|v||uv| = |u| + |v|. QED.

1.2.1.3 More Language Concepts

Linz Definition (Substring): A substring of a string ww is any string of consecutive symbols in ww.

If w=a1a2a3w = a_{1}a_{2}a_{3}, then the substrings are λ,a1,a1a2,a1a2a3,a2,a2a3,a3\lambda, a_{1}, a_{1}a_{2}, a_{1}a_{2}a_{3}, a_{2}, a_{2}a_{3}, a_{3}.

Linz Definition (Prefix, Suffix): If w=vuw = vu, then vv is a prefix of ww and uu is a suffix.

If w=a1a2a3w = a_{1}a_{2}a_{3}, the prefixes are λ,a1,a1a2,a1a2a3\lambda, a_{1}, a_{1}a_{2}, a_{1}a_{2}a_{3}.

Linz Definition (wnw^{n}): wnw^{n}, for any string w and n0n \geq 0 denotes nn repetitions of string (or symbol) ww. We further define w0=λw^{0} = \lambda.

Linz Definition (Star-Closure): Σ*\Sigma^{*}, for alphabet Σ\Sigma, is the set of all strings obtained by concatenating zero or more symbols from the alphabet.

Note: An alphabet must be a finite set.

Linz Definition (Positive Closure): Σ+=Σ*λ\Sigma^{+} = \Sigma^{*} - \lambda

Although Σ\Sigma is finite, Σ*\Sigma^{*} and Σ+\Sigma^{+} are infinite.

For a string ww, we also write w*w^{*} and w+w^{+} to denote zero or more repetitions of the string and one or more repetitions of the string, respectively.

Linz Definition (Language): A language, for some alphabet Σ\Sigma, is a subset of Σ*\Sigma^{*}.

Linz Definition (Sentence): A sentence of some language LL is any string from LL (i.e., from Σ*\Sigma^{*}).

1.2.1.4 Linz Example 1.9: Example Languages

Let Σ={a,b}\Sigma = \{ a, b \}.

Since the language has a finite number of sentences, it is a finite language.

Sentences aabbaabb and aaaabbbbaaaabbbb are in LL, but aaabbaaabb is not.

As with most interesting languages, LL is an infinite language.

1.2.1.5 Operations on Languages

Languages are represented as sets. Operations on languages can be defined in terms of set operations.

Linz Definition (Union, Intersection, and Difference): Language union, intersection, and difference are defined directly as the corresponding operations on sets.

Linz Definition (Concatenation): Language complementation with respect to Σ*\Sigma^{*} is defined such that L=Σ*L\bar{L} = \Sigma^{*} - L.

Linz Definition (Reverse): Language reverse is defined such that LR={wR:wL}L^{R} = \{ w^{R} : w \in L \}. (That is, reverse all strings.)

Linz Definition (Concatenation): Language concatenation is defined such that L1L2={xy:xL1,yL2}L_{1} L_{2} = \{ x y : x \in L_{1}, y \in L_{2} \}.

Linz Definition (LnL^{n}): LnL^{n} means LL concatenated with itself nn times.

L0={λ}L^{0} = \{ \lambda \} and Ln+1=LnLL^{n+1} = L^{n}L

Definition (Star-Closure): Star-closure (Kleene star) is defined such that L*=L0L1L2L^{*} = L^{0} \cup L^{1} \cup L^{2} \cup \cdots.

Definition (Positive Closure): Positive closure is defined such that L+=L1L2L^{+} = L^{1} \cup L^{2} \cup \cdots.

1.2.1.6 Language Operation Examples

Let L={anbn:n0}L = \{ a^{n} b^{n} : n \geq 0 \}.

How would we express in L\bar{L} and L*L^{*}?

Although set notation is useful, it is not a convenient notation for expressing complicated languages.

1.2.2 Grammars

1.2.2.1 Grammar Concepts

Linz Definition 1.1 (Grammar): A grammar GG is a quadruple G=(V,T,S,P)G = (V, T, S, P) where

VV is a finite set of objects called variables.
TT is a finite set of objects called terminal symbols.
SVS \in V is a special symbol called the start symbol.
PP is a finite set of productions.
VV and TT are nonempty and disjoint.

Linz Definition (Productions): Productions have form xyx \rightarrow y where:

x(VT)+x \in (V \cup T)^{+}, i.e., xx is some non-null string of terminals and variables
y(VT)*y \in (V \cup T)^{*}, i.e., yy is some, possibly null, string of terminals and variables

Consider application of productions, given w=uxvw = uxv:

Linz Definition (Derives): w1*wnw_{1} \overset{*}{\Rightarrow} w_{n} means that w1w_{1} derives wnw_{n} in zero or more production steps.

Linz Definition (Language Generated): Let G=(V,T,S,P)G = (V, T, S, P) be a grammar. Then L(G)={wT*:S*w}L(G) = \{ w \in T^{*} : S \overset{*}{\Rightarrow} w \} is the language generated by GG.

That is, L(G)L(G) is the set of all strings that can be generated from the start symbol SS using the productions PP.

Linz Definition (Derivation): A derivation of some sentence wL(G)w \in L(G) is a sequence Sw1w2w3wnwS\Rightarrow w_{1} \Rightarrow w_{2} \Rightarrow w_{3} \Rightarrow \cdots \Rightarrow w_{n} \Rightarrow w.

The strings S,w1,,wnS, w_{1}, \cdots, w_{n} above are sentential forms of the derivation of sentence ww.

1.2.2.2 Linz Example 1.11 (Grammar)

Consider G=({S},{a,b},S,P)G = (\{S\},\{a,b\},S,P) where P is the set of productions

Consider SaSbaaSbbaabbS \Rightarrow aSb \Rightarrow aaSbb \Rightarrow aabb. Hence, S*aabbS \overset{*}{\Rightarrow} aabb.

aabbaabb is a sentence of the language; the other strings in the derivation are sentential forms.

Conjecture: The language formed by this grammar, L(G)L(G), is {anbn:n0}\{ a^{n} b^{n} : n \geq 0 \}.

Usually, however, it is difficult to construct an explicit set definition for a language generated by a grammar.

Now prove the conjecture.

First, prove that all sentential forms of the language have the structure wi=aiSbiw_{i} = a^{i}Sb^{i} for i0i \geq 0 by induction on i.

Basis step:

Clearly, w0=Sw_{0} = S is a sentential form, the start symbol.

Inductive step:

Assume wm=amSbmw_{m} = a^{m}Sb^{m} is a sentential form, show that wm+1=am+1Sbm+1w_{m+1} = a^{m+1}Sb^{m+1}.

Case 1: If we begin with the assumption and apply production SaSbS \rightarrow aSb, we get sentential form wm+1=am+1Sbm+1w_{m+1} = a^{m+1}Sb^{m+1}.

Case 2: If we begin with the assumption and apply production SλS \rightarrow \lambda, we get the sentence ambma^{m}b^{m} rather than a sentential form.

Hence, all sentential forms have the form aiSbia^{i}Sb^{i}.

Given that SλS \rightarrow \lambda is the only production with terminals on the right side, we must apply it to derive any sentence. As we noted in case 2 above, application of the production to any sentential form gives a sentence of the form ambma^{m}b^{m}. QED.

1.2.2.3 Linz Example 1.12: Finding a Grammar for a Language

Given L={anbn+1:n0}L = \{a^{n}b^{n+1} : n \geq 0 \}.

A slightly different grammar might introduce nonterminal A as follows:

1.2.2.4 More Grammar Concepts

To show that a language LL is generated by a grammar GG, we must prove:

  1. For every wLw \in L, there is a derivation using GG.

  2. Every string derived from GG is in LL.

Linz Definition (Equivalence): Two grammars are equivalent if they generate the same language.

For example, the two grammars given above for the language L={anbn+1:n0}L = \{a^{n}b^{n+1} : n \geq 0 \} are equivalent.

1.2.2.5 Linz Example 1.13

Let Σ={a,b}\Sigma = \{a,b\} and let na(w)n_{a}(w) and nb(w)n_{b}(w) denote the number of aa's and bb's in the string ww.

Let grammar GG have productions

SSSS \rightarrow SS
SλS \rightarrow \lambda
SaSbS \rightarrow aSb
SbSaS \rightarrow bSa

Let L={w:na(w)=nb(w)}L = \{w : n_{a}(w) = n_{b}(w) \}.

Prove L(G)=LL(G) = L.

Informal argument follows. Actual proof would be an induction over length of ww.

Consider cases for ww.

  1. Case wL(G)w \in L(G). Show wLw \in L.

    Any production adding an a also adds a b. Thus there is the same number of a's and b's.

  2. Case wLw \in L. Show wL(G)w \in L(G).

1.2.3 Automata

An automaton is an abstract model of a compute

Linz Fig. 1.4: Schematic Representation of a General Automaton

Linz Fig. 1.4: Schematic Representation of a General Automaton

As shown in Linz Figure 1.4, a computer:

  1. reads input from an input file -- one symbol from each cell -- left to right

  2. produces output

  3. may use storage -- unlimited number of cells (may be different alphabet)

  4. has a control unit

A configuration is a state of the control unit, input, and storage.

A move is a transition from one state configuration to another.

Automata can be categorized based on control:

Automata can also be categorized based on output:

Various models differ in

1.3 Applications

1.3.1 Linz Example 1.15: C Identifiers

The syntax rules for identifiers in the language C are as follows:

Formally, we can describe these rules with the grammar:

    <id>       -> <letter><rest>   | <underscr><rest>
    <rest>     -> <letter><rest>   | <digit><rest>    |
                  <underscr><rest> | <lambda>
    <letter>   -> a|b|c|...|z|A|B|C|...|Z
    <digit>    -> 0|1|2|...|9
    <underscr> -> _

Above <lambda> represents the symbol λ\lambda,-> is the \rightarrow for productions, and | denotes alternative right-hand-sides of the productions.

The variables are <id>, <letter>, <digit>, <underscr>, and <rest>. The other alphanumeric symbols are literals.

Linz Figure 1.6 shows a drawing of an automaton that accepts all legal C identifiers as defined above.

Linz Fig. 1.7: Automaton to Accept C Identifiers

Linz Fig. 1.7: Automaton to Accept C Identifiers

We can interpret the automaton in Linz Figure 1.6 as follows:

1.3.2 Linz Example 1.17: Binary Adder

Let x=a0a1a0anx = a_{0} a_{1} a_{0} \cdots a_{n} where aia_{i} are bits.

Then value(x)=i=0nai2ivalue(x) = \sum_{i=0}^{n} a_{i} 2^{i}.

This is the usual binary representation in reverse.

A serial adder process two such numbers x and y, bit by bit, starting at the left end. Each bit addition creates a digit for the sum and a carry bit as shown in Linz Figure 1.7.

Linz Fig. 1.7: Binary Addition Table

Linz Fig. 1.7: Binary Addition Table

A block diagram for the machine is shown in Linz Figure 1.8.

Linz Fig. 1.8: Binary Adder Block Diagram

Linz Fig. 1.8: Binary Adder Block Diagram

A transducer automaton to carry out the addition of two numbers is shown in Linz Figure 1.9.

The pair on the edges represents the two inputs. The value following the slash is the output.

Linz Fig. 1.9: Binary Adder Transducer Automaton

Linz Fig. 1.9: Binary Adder Transducer Automaton