Exploring Languages
with Interpreters
and Functional Programming
Chapter 27
H. Conrad Cunningham
04 April 2022
Browser Advisory: The HTML version of this textbook requires a browser that supports the display of MathML. A good choice as of April 2022 is a recent version of Firefox from Mozilla.
Chapter 26 illustrates how to synthesize function definitions from their specifications.
This chapter (27) applies these program synthesis techniques to a larger set of examples on text processing.
In this section we develop a text processing package similar to the one in Section 4.3 of the Bird and Wadler textbook [3]. The text processing package in the Haskell standard Prelude is slightly different in its treatment of newline characters.
A textual document can be viewed in many different ways. At the lowest level, we can view it as just a character string and define a type synonym as follows:
type Text = String
However, for other purposes, we may want to consider the document as having more structure (i.e., view it as a sequence of words, lines, paragraphs, pages, etc). We sometimes want to convert the text from one view to another.
Consider the problem of converting a Text
document
to the corresponding sequence of lines. Suppose that in the
Text
{.haskell document, the newline characters
'\n'
{.haskell serve as separators of lines, not
themselves part of the lines. Because each line is a sequence of
characters, we define a type synonym Line
as
follows:
type Line = String
We want a function lines'
that will take a Text
document
and return the corresponding sequence of lines in the document. The
function has the type signature:
lines' :: Text -> [Line]
For example, the Haskell expression
"This has\nthree\nlines" lines'
yields:
"This has", "three ", "lines"] [
Writing function lines'
is not trivial. However, its inverse unlines'
is quite easy. Function
unlines'
takes a list of
Line
s,
inserts a newline character between each pair of adjacent lines, and
returns the Text
document
resulting from the concatenation.
unlines' :: [Line] -> Text
Let’s see if we can develop lines'
from unlines'
.
The basic computational pattern for function unlines'
is a folding operation.
Because we are dealing with the construction of a list and the list
constructors are nonstrict in their right arguments, a foldr
operation seems more appropriate than a foldl
operation.
To use foldr
, we need
a binary operation that will append two lines with a newline character
inserted between them. The following, a bit more general, operation
insert'
will do that for us.
The first argument is the element that is to be inserted between the two
list arguments.
insert' :: a -> [a] -> [a] -> [a]
= xs ++ [a] ++ ys -- insert.1 insert' a xs ys
Informally, it is easy to see that (insert' a)
is an associative
operation but that it has no right (or left) identity element.
Given that (insert' a)
has no identity element, there is no obvious “seed” value to use with
fold
. Thus we will need to find
a different way to express unlines'
.
If we restrict the domain of unlines'
to non-nil lists of
lines, then we can use foldr1
, a right-folding operation
defined over non-empty lists (in the Prelude). This function does not
require an identity element for the operation. Function foldr1
can be
defined as follows:
foldr1 :: (a -> a -> a) -> [a] -> a
foldr1 f [x] = x
foldr1 f (x:xs) = f x (foldr1 f xs)
Note: There is a similar function (in the Prelude), foldl1
that
takes a non-nil list and does a left-folding operation.
Thus we can now define unlines'
as follows:
unlines' :: [Line] -> Text
= foldr1 (insert' '\n') xss unlines' xss
Given the definition of unlines'
, we can now specify what
we want lines'
to do. It
must satisfy the following specification for any non-nil xss
of type [Line]
:
= xss lines' (unlines' xss)
That is, lines'
is the
inverse of unlines'
for all
non-nil arguments.
The first step in the synthesis of lines'
is to guess at a possible
structure for the lines'
function definition. Then we will attempt to calculate the unknown
pieces of the definition.
Because unlines'
uses a
right-folding operation, it is reasonable to guess that its inverse will
also use a right-folding operation. Thus we speculate that lines'
can be defined as follows,
given an appropriately defined operation op
and “seed value” a
.
lines' :: Text -> [Line]
= foldr op a lines'
Because of the definition of foldr
and type
signature of lines'
,
function op
must have the type
signature
op :: Char -> [Line] -> [Line]
and a
must be the right
identity of op
and hence have type [Line]
.
The task now is to find appropriate definitions for op
and a
.
From what we know about unlines'
, foldr1
, lines'
, and foldr
, we see
that the following identities hold. (These can be proved, but we do not
do so here.)
= xs -- unlines.1
unlines' [xs] ++xss) =
unlines' ([xs]'\n' xs (unlines' xss) -- unlines.2
insert'
= a -- lines.1
lines' [] ++xs) = op x (lines' xs) -- lines.2 lines' ([x]
Note the names we give each of the above identities (e.g., unlines.1
).
We use these equations to justify our steps in the calculations
below.
Next, let us calculate the unknown identity element a
. The strategy is to transform a
by use of the definition and derived
properties for unlines'
and
the specification and derived properties for lines'
until we arrive at a
constant.
a
{
lines.1
(right to left) }
lines' []
{
unlines'.1
(right to left) with xs = []
}
lines' (unlines' [[]])
{
specification of lines'
(left to right)
}
[[]]
Therefore we define a
to be [[]]
. Note that because of lines.1
,
we have also defined lines'
in the case where its argument is []
.
Now we proceed to calculate a definition for op
. Remember that we assume xss /= []
.
As above, the strategy is to use what we know about unlines'
and what we have assumed
about lines'
to calculate
appropriate definitions for the unknown parts of the definition of lines'
. We first expand our
expression to bring in unlines'
.
op x xss
{
specification for lines'
(right to left)
}
op x (lines' (unlines' xss))
{
lines.2
(right to left) }
++ unlines' xss) lines' ([x]
Because there seems no other way to proceed with our calculation, we
distinguish between cases for the variable x
. In particular, we consider the case
where x
is the line separator
and the case where it is not, i.e., x == '\n'
and x /= '\n'
.
Case x == '\n'
:
Our strategy is to absorb the '\n'
into the unlines'
, then
apply the specification of lines'
.
"\n" ++ unlines' xss) lines' (
{
[]
is the identity for ++
}
++ "\n" ++ unlines' xss) lines' ([]
{
insert.1
(right to left) with a == '\n'
}
'\n' [] (unlines' xss)) lines' (insert'
{
unlines.2
(right to left) }
++ xss)) lines' (unlines' ([[]]
{
specification of lines'
(left to right)
}
++ xss [[]]
Thus op '\n' xss = [[]] ++ xss
.
Case x /= '\n'
:
Our strategy is to absorb the [x]
into the unlines'
, then apply the
specification of lines
.
++ unlines' xss) lines' ([x]
{
Assumption xss /= []
, let
xss = [ys] ++ yss
}
++ unlines' ([ys] ++ yss)) lines' ([x]
{
unlines.2
(left to right) with a = '\n'
}
++ insert' '\n' ys (unlines' yss)) lines' ([x]
{
insert.1
(left to right) }
++ (ys ++ "\n" ++ unlines' yss)) lines' ([x]
{
++
associativity }
++ ys) ++ "\n" ++ unlines' yss) lines' (([x]
{
insert.1
(right to left) }
'\n' ([x]++ys) (unlines' yss)) lines' (insert'
{
unlines.2
(right to left) }
++ys] ++ yss)) lines' (unlines' ([[x]
{
specification of lines'
(left to right)
}
++ys] ++ yss [[x]
Thus, for x /= '\n'
and xss /= []
:
= [[x] ++ head xss] ++ (tail xss) op x xss
To generalize op
like we did
insert'
and give it a more
appropriate name, we define op
to be breakOn '\n'
as follows:
breakOn :: Eq a => a -> a -> [[a]] -> [[a]]
= error "breakOn applied to nil"
breakOn a x [] | a == x = [[]] ++ xss
breakOn a x xss | otherwise = [[x] ++ ys] ++ yss
where (ys:yss) = xss
Thus, we get the following definition for lines'
:
lines' :: Text -> [Line]
= foldr (breakOn '\n') [[]] xs lines' xs
Let’s review what we have done in this example. We have synthesized
lines'
from its
specification and the definition for unlines'
, its inverse. Starting
from a precise, but non-executable specification, and using only
equational reasoning, we have derived an executable definition of the
required function.
The technique used is a familiar one in many areas of mathematics:
We guessed at a form for the solution.
We then calculated the unknowns.
Note: The definition of lines
and
unlines
in the standard Prelude treat newlines as line terminators
instead of line separators. Their definitions follow.
lines :: String -> [String]
lines "" = []
lines s = l : (if null s' then [] else lines (tail s'))
where (l, s') = break ('\n'==) s
unlines :: [String] -> String
unlines = concat . map (\l -> l ++ "\n")
Let’s continue the text processing example from the previous subsection a bit further. We want to synthesize a function to break a text into a sequence of words.
For the purposes here, we define a word as any nonempty sequence of characters not containing a space or newline character. That is, a group of one or more spaces and newlines separate words. We introduce a type synonym for words.
type Word = String
We want a function words'
that breaks a line up into a sequence of words. Function words'
thus has the following type
signature:
words' :: Line -> [Word]
For example, expression
"Hi there" words'
yields:
"Hi", "there"] [
As in the synthesis of lines'
, we proceed by defining the
“inverse” function first, then we calculate the definition for words'
.
All unwords'
needs to do
is to insert a space character between adjacent elements of the sequence
of words and return the concatenated result. Following the development
in the previous subsection, we can thus define unwords'
as follows.
unwords' :: [Word] -> Line
= foldr1 (insert' ' ') xs unwords' xs
Using calculations similar to those for lines'
, we derive the inverse of
unwords'
to be the following function:
foldr (breakOn' ') [[]]
However, this identifies zero-length words where there are adjacent spaces. We need to filter those out.
words' :: Line -> [Word]
= filter (/= []) . foldr (breakOn' ') [[]] words'
Note that
= xss words' (unwords' xss)
for all xss
of type [Word]
, but
that
= xs unwords' (words' xs)
for some xs
of type Line
. The
latter is undefined when words' {.haskell} xs
returns []
. Where it is defined, adjacent
spaces in xs
are replaced by a
single space in unwords' (words' xs)
.
Note: The functions words
and
unwords
in the standard Prelude differ in that unwords [] = []
,
which is more complete.
Let’s continue the text processing example one step further and synthesize a function to break a sequence of lines into paragraphs.
For the purposes here, we define a paragraph as any nonempty sequence of nonempty lines. That is, a group of one or more empty lines separate paragraphs. As above, we introduce an appropriate type synonym:
type Para = [Line]
We want a function paras'
that breaks a sequence of lines into a sequence of paragraphs:
paras' :: [Line] -> [Para]
For example, expression
"Line 1.1","Line 1.2","","Line 2.1"] paras' [
yields:
"Line 1.1","Line 1.2"],["Line 2.1"]] [[
As in the synthesis of lines'
and words'
, we can start with the
inverse and calculate the definition of paras'
. The inverse function unparas'
takes a sequence of
paragraphs and returns the corresponding sequence of lines with an empty
line inserted between adjacent paragraphs.
unparas' :: [Para] -> [Line]
= foldr1 (insert' []) unparas'
Using calculations similar to those for lines'
and words'
, we can derive the
following definitions:
paras' :: [Line] -> [Para]
= filter (/= []) . foldr (breakOn []) [[]] paras'
The filter (/= [])
operation removes all “empty paragraphs” corresponding to two or more
adjacent empty lines.
Note: There are no equivalents of paras'
and `unparas'
in the standard prelude.
As with unwords
, unparas'
should be redefined so
that unparas' [] = []
,
which is more complete.
Using the six functions in our text processing package, we can build other useful functions.
Count the lines in a text.
countLines :: Text -> Int
= length . lines' countLines
Count the words in a text.
countWords :: Text -> Int
= length . concat . (map words') . lines' countWords
An alternative using a list comprehension is:
=
countWords xs length [ w | l <- lines' xs, w <- words' l]
Count the paragraphs in a text.
countParas :: Text -> Int
= length . paras' . lines' countParas
Normalize text by removing redundant empty lines and spaces.
The following functions take advantage of the fact that paras'
and words'
discard empty paragraphs
and words, respectively.
normalize :: Text -> Text
= unparse . parse
normalize
parse :: Text -> [[[Word]]]
= (map (map words')) . paras' . lines'
parse
unparse :: [[[Word]]] -> Text
= unlines' . unparas' . map (map unwords') unparse
We can also state parse
and
unparse
in terms of list
comprehensions.
=
parse xs | l <- p] | p <- paras' (lines' xs) ]
[ [words' l
=
unparse xssss | l<-p] | p<-xssss]) unlines' (unparas' [ [unwords' l
Section 4.3.5 of the Bird and Wadler textbook [3] goes on to build functions to fill and left-justify lines of text.
Chapter 26 illustrates how to synthesize (i.e., derive or calculate) function definitions from their specifications. This chapter (27) applies these program synthesis techniques to larger set of examples on text processing.
No subsequent chapter depends explicitly upon the program synthesis content from these chapters. However, if practiced regularly, the techniques explored in this chapter can enhance a programmer’s ability to solve problems and construct correct functional programming solutions.
TODO
In Summer 2018, I adapted and revised this chapter and the next from Chapter 12 of my Notes on Functional Programming with Haskell [9].
These previous notes drew on the presentations in the first edition of the classic Bird and Wadler textbook [3] and other functional programming sources [1,2,15,17,18]. They were also influenced by my research, study, and teaching related to program specification, verification, derivation, and semantics [[4]; [5]; [6]; [7]; [8]; [10]; [11]; [12]; [13]; [14]; [16]; vanGesteren1990].
I incorporated this work as new Chapter 26, Program Synthesis, and new Chapter 27, Text Processing (this chapter), in the 2018 version of the textbook Exploring Languages with Interpreters and Functional Programming and continue to revise it.
I retired from the full-time faculty in May 2019. As one of my post-retirement projects, I am continuing work on this textbook. In January 2022, I began refining the existing content, integrating additional separately developed materials, reformatting the document (e.g., using CSS), constructing a bibliography (e.g., using citeproc), and improving the build workflow and use of Pandoc.
I maintain this chapter as text in Pandoc’s dialect of Markdown using embedded LaTeX markup for the mathematical formulas and then translate the document to HTML, PDF, and other forms as needed.
Program synthesis, synthesizing a function from its inverse, text processing, line, word, paragraph, terminator, separator.