Copyright (C) 2015, H. Conrad Cunningham
Acknowledgements: These lecture notes are for use with Chapter 1 of the textbook: Peter Linz. Introduction to Formal Languages and Automata, Fifth Edition, Jones and Bartlett Learning, 2012. The terminology and notation used in these notes are similar to those used in the Linz textbook. This document uses several figures from the Linz textbook.
Advisory: The HTML version of this document requires use of a browser that supports the display of MathML. A good choice as of December 2015 seems to be a recent version of Firefox from Mozilla.
Why study theory?
In this course, we study the following models:
The mathematical concepts used in the Linz textbook include:
Students in this course should be familiar with the following set concepts from previous study in mathematics:
Laws for operations on the empty set:
DeMorgan's Laws:
Function means that
Function is a
A relation on and is any subset of .
An equivalence relation is a generalization of equality. It is:
A is a mathematical entity where
is a finite set of vertices (or nodes)
is a finite set of edges (or arcs)
each edge is a pair of vertices
A directed graph (or digraph) is a graph in which each edge has a direction from vertex to vertex .
Edge on a digraph is an outgoing edge from vertex and an incoming edge to vertex .
If there are no directions associated with the edges, then the graph is undirected.
Graphs may be labeled by assigning names or other information to vertices or edges.
We can visualize a graph with a diagram in which the vertices are shown as circles and edges as lines connecting a pair of vertices. For directed graphs, the direction of an edge is shown by an arrow.
Linz Figure 1.1 shows a digraph where and edges .
A sequence of edges is a walk from to . The length of the walk is the total number of edges traversed.
A path is a walk with no edge repeated. A path is simple if no vertex is repeated.
A walk from some vertex to itself is a cycle with base . If no vertex other than the base is repeated, then the cycle is simple.
In Linz Figure 1.1:
If the edges of a graph are labelled, then the label of a walk (or path) is the sequence of edges encountered on a traversal.
A tree is a directed graph with no cycles and a distinct root vertex such that there is exactly one path from the root to every other vertex.
The root of a tree has no incoming edges.
A vertex of a tree without any outgoing edges is a leaf of the tree.
If there is an edge from to in a tree, then:
is the parent of
is a child of
The level associated with each vertex is the number of edges in the path from the root to the vertex.
The height of a tree is the largest level number of any vertex.
If we associated an ordering with the vertices at each level, then the tree is an ordered tree.
The above terminology is illustrated in Linz Figure 1.2.
Students in this course should be familiar with the following proof techniques from previous study in mathematics:
We will see an example of an inductive proof in the next section.
Three fundamental ideas are the major themes of this course:
Our concept of language is an abstraction of the concept of a natural language.
Linz Definition (Alphabet): An alphabet, denoted by , is a finite, nonempty set of symbols.
By convention, we use lowercase letters near the beginning of the English alphabet to represent elements of .
For example, if , then the alphabet has two unique symbols denoted by and .
Linz Definition (String): A string is a finite sequence of symbols from the alphabet.
By convention, we use lowercase letters near the end of the English alphabet to represent strings. We write strings left to right. That is, symbols appearing to the left are before those appearing to the right.
For example, is a string from the above alphabet. The string begins with a and ends with an .
Linz Definition (Concatenation): The concatenation of strings and means appending the symbols of to the right end (i.e., after) the symbols of , denoted by uv.
If and , then .
Definition (Associativity): Operation is associative on set if, for all , , and in , . We often write associative expressions without explicit parentheses, e.g., .
String concatenation is associative, that is, .
Thus we normally just write without parentheses.
Definition (Commutativity): Operation is commutative on set if, for all and in , .
String concatenation is not commutative. That is, .
Linz Definition (Reverse): The reverse of a string , denoted by , is the string with same symbols, but with the order reversed.
If , then .
Linz Definition (Length): The length of a string w, denoted by , is the number of symbols in string .
Linz Definition (Empty String): The empty string, denoted by , is the string with no symbols, that is, .
Definition (Identity Element): An operation has an identity element on set if, for all , .
The empty string is the identity element for concatenation. That is, .
We can define the length of a string with the following inductive definition:
Note: This inductive defintion and proof differs from the textbook. Here we begin with the empty string.
Using the fact that is the identity element and the above definition, we see that
.
Prove .
Noting the definition of length above, we choose to do an induction over string v (or, if you prefer, over the length of , basing induction at 0).
Base case (that is, length is 0)
= | { identity for concatenation } justification for step in braces |
= | { identity for + } |
= | { definition of length } |
Inductive case (that is, length is greater than 0)
Induction hypothesis:
= | { associativity of concatenation } |
= | { definition of length } |
= | { induction hypothesis } |
= | { associativity of + } |
= | { definition of length (right to left) } |
Thus we have proved . QED.
Linz Definition (Substring): A substring of a string is any string of consecutive symbols in .
If , then the substrings are .
Linz Definition (Prefix, Suffix): If , then is a prefix of and is a suffix.
If , the prefixes are .
Linz Definition (): , for any string w and denotes repetitions of string (or symbol) . We further define .
Linz Definition (Star-Closure): , for alphabet , is the set of all strings obtained by concatenating zero or more symbols from the alphabet.
Note: An alphabet must be a finite set.
Linz Definition (Positive Closure):
Although is finite, and are infinite.
For a string , we also write and to denote zero or more repetitions of the string and one or more repetitions of the string, respectively.
Linz Definition (Language): A language, for some alphabet , is a subset of .
Linz Definition (Sentence): A sentence of some language is any string from (i.e., from ).
Let .
.
is a language on .
Since the language has a finite number of sentences, it is a finite language.
Sentences and are in , but is not.
As with most interesting languages, is an infinite language.
Languages are represented as sets. Operations on languages can be defined in terms of set operations.
Linz Definition (Union, Intersection, and Difference): Language union, intersection, and difference are defined directly as the corresponding operations on sets.
Linz Definition (Concatenation): Language complementation with respect to is defined such that .
Linz Definition (Reverse): Language reverse is defined such that . (That is, reverse all strings.)
Linz Definition (Concatenation): Language concatenation is defined such that .
Linz Definition (): means concatenated with itself times.
and
Definition (Star-Closure): Star-closure (Kleene star) is defined such that .
Definition (Positive Closure): Positive closure is defined such that .
Let .
(where n and m are unrelated).
.
How would we express in and ?
Although set notation is useful, it is not a convenient notation for expressing complicated languages.
Linz Definition 1.1 (Grammar): A grammar is a quadruple where
is a finite set of objects called variables.
is a finite set of objects called terminal symbols.
is a special symbol called the start symbol.
is a finite set of productions.
and are nonempty and disjoint.
Linz Definition (Productions): Productions have form where:
, i.e., is some non-null string of terminals and variables
, i.e., is some, possibly null, string of terminals and variables
Consider application of productions, given :
is applicable to string .
To use the production, substitute for .
Thus the new string is .
We say derives , written .
Continue by applying any applicable productions in arbitrary order.
.
Linz Definition (Derives): means that derives in zero or more production steps.
Linz Definition (Language Generated): Let be a grammar. Then is the language generated by .
That is, is the set of all strings that can be generated from the start symbol using the productions .
Linz Definition (Derivation): A derivation of some sentence is a sequence .
The strings above are sentential forms of the derivation of sentence .
Consider where P is the set of productions
Consider . Hence, .
is a sentence of the language; the other strings in the derivation are sentential forms.
Conjecture: The language formed by this grammar, , is .
Usually, however, it is difficult to construct an explicit set definition for a language generated by a grammar.
Now prove the conjecture.
First, prove that all sentential forms of the language have the structure for by induction on i.
Clearly, is a sentential form, the start symbol.
Assume is a sentential form, show that .
Case 1: If we begin with the assumption and apply production , we get sentential form .
Case 2: If we begin with the assumption and apply production , we get the sentence rather than a sentential form.
Hence, all sentential forms have the form .
Given that is the only production with terminals on the right side, we must apply it to derive any sentence. As we noted in case 2 above, application of the production to any sentential form gives a sentence of the form . QED.
Given .
Recursive production would generate sentential forms .
Need production(s) to add the extra b to the final sentence.
Suggest .
A slightly different grammar might introduce nonterminal A as follows:
To show that a language is generated by a grammar , we must prove:
For every , there is a derivation using .
Every string derived from is in .
Linz Definition (Equivalence): Two grammars are equivalent if they generate the same language.
For example, the two grammars given above for the language are equivalent.
Let and let and denote the number of 's and 's in the string .
Let grammar have productions
Let .
Prove .
Informal argument follows. Actual proof would be an induction over length of .
Consider cases for .
Case . Show .
Any production adding an a also adds a b. Thus there is the same number of a's and b's.
Case . Show .
Consider or for some .
String was generated by either or in the first step.
Thus .
Consider (or ) for some .
Examine the symbols of w from the left -- add 1 for each a, subtract 1 for each b.
Since sum must be 0 at right, there must be a point where the sum crosses 0.
Break at that point into form where .
First production is .
Thus .
An automaton is an abstract model of a compute
As shown in Linz Figure 1.4, a computer:
reads input from an input file -- one symbol from each cell -- left to right
produces output
may use storage -- unlimited number of cells (may be different alphabet)
A configuration is a state of the control unit, input, and storage.
A move is a transition from one state configuration to another.
Automata can be categorized based on control:
A deterministic automaton has a unique next state from the current configuration.
A nondeterministic automaton has several possible next states.
Automata can also be categorized based on output:
An accepter has only yes/no output.
A transducer has strings or symbols for output,
Various models differ in
how the output is produced
the nature of temporary storage
The syntax rules for identifiers in the language C are as follows:
An identifier is a sequence of letters, digits, and underscores.
An identifier must start with a letter or underscore.
Identifiers allow both uppercase and lowercase letters.
Formally, we can describe these rules with the grammar:
<id> -> <letter><rest> | <underscr><rest>
<rest> -> <letter><rest> | <digit><rest> |
<underscr><rest> | <lambda>
<letter> -> a|b|c|...|z|A|B|C|...|Z
<digit> -> 0|1|2|...|9
<underscr> -> _
Above <lambda>
represents the symbol ,->
is the for productions, and |
denotes alternative right-hand-sides of the productions.
The variables are <id>
, <letter>
, <digit>
, <underscr>
, and <rest>
. The other alphanumeric symbols are literals.
Linz Figure 1.6 shows a drawing of an automaton that accepts all legal C identifiers as defined above.
We can interpret the automaton in Linz Figure 1.6 as follows:
<digit>
then the machine moves to state 3. The machine stops reading with answer No (non-accepting).<letter>
or <underscr>
then it moves to state 2. The machine continues.<letter>
, <underscr>
, or <digit>
, then the machine reads the input and remains in state 2.Let where are bits.
Then .
This is the usual binary representation in reverse.
A serial adder process two such numbers x and y, bit by bit, starting at the left end. Each bit addition creates a digit for the sum and a carry bit as shown in Linz Figure 1.7.
A block diagram for the machine is shown in Linz Figure 1.8.
A transducer automaton to carry out the addition of two numbers is shown in Linz Figure 1.9.
The pair on the edges represents the two inputs. The value following the slash is the output.