CSci 450, Org. of Programming Languages

Chapter 11
Expression Language Parsing

H. Conrad Cunningham

29 November 2017

Copyright (C) 2017, H. Conrad Cunningham

Advisory: The HTML version of this document may require use of a browser that supports the display of MathML. A good choice as of November 2017 is a recent version of Firefox from Mozilla.

TODO:

11.1 Chapter Introduction

TODO: Update for breakout of Parsing chapter

`This case study examines how we can represent and process simple arithmetic expressions using Haskell. We call the language used in this case study the Expression Language.

The case study examines two different concrete syntaxes for expressions written as text and an abstract syntax represented as an algebraic data type. It shows an a hand-coded lexical analyzer and two hand-coded recursive descent parsers for these syntaxes.

TODO: Add outcomes

11.2 Parsing

A programming language processor uses a parser to determine whether a program satisfies the grammar for the language’s concrete syntax. The parser typically constructs some kind of internal representation of the program to enable further processing.

A common approach to parsing is to divide it into at least two phases:

If the language has aspects that cannot be described with a context-free grammar, then additional phases may be needed to handle issues such as checking types of variables and expressions and ensuring that variables are declared before they are used.

Of course, regular grammars are context-free grammars, so a separate lexical analyzer is not required. But use of a separate lexical analyzer often leads to a simpler parser and better performance.

However, some approaches to parsing, such as the use of parser combinators, can conveniently handle lexical issues as a part of the parser.

In this subsection, we use the two-stage approach to parsing of the Expression Language. We define a lexical analyzer and parsers constructed using a technique called recursive descent parsing. The parsers construct abstract syntax trees using the algebraic data type defined in the Abstract Syntax section.

TODO: Discuss parser combinator if that subsection added.

11.3 Lexical Analysis

In computing science, lexical analysis is typically the process of reading a sequence of characters from a language text and assembling the characters into a sequence of lexemes, the smallest meaningful syntactic units. In a natural language like English, the lexemes are typically the words of the language.

The output of lexical analysis is a sequence of lexical tokens (usually just called tokens). A token associates a syntactic category with a lexeme. In a natural language, the syntactic category may be the word’s part of speech (noun, verb, etc.).

We call the program that carries out the lexical analysis a lexical analyzer, lexer, tokenizer, or scanner. (However, the latter term actually refers to one phase of the overall process.)

In a programming language, the syntactic categories of tokens consist of entities such as identifiers, integer literals, and operators.

The “whitespace” characters such as blanks, tabs, and newlines are usually not tokens themselves. Instead, they are delimiters which define the boundaries of the other lexemes. However, in some programming languages, the end of a line or the indentation at the beginning of a line have implicit structural meaning in the language.

Consider the Expression language infix syntax. The character sequence

    30   +   ( x1 * 2)

includes seven tokens:

Tokenization has two stages–a scanner and an evaluator.

A scanner processes the character sequence and breaks it into lexeme strings. It usually recognizes a language corresponding to a regular grammar, one of the simplest classes of grammars, and is, hence, based on a finite state machine. However, in some cases, a scanner may require more complex grammars and processors.

A token evaluator determines the syntactic category of the lexeme string and tags the token with this syntactic information.

Sometimes a lexical analyzer program combines the two stages into the same algorithm.

`### Prefix syntax

Now let’s consider a lexical analyzer for the prefix syntax for the Expression language.

File LexExpr.hs gives an example Haskell module that implements a lexical analyzer for this concrete syntax.

The Expression Language’s prefix syntax includes the following syntactic categories: identifiers, keywords, integer literals, operators, left parenthesis, and right parenthesis.

The left and right parenthesis characters are the only lexemes in those two syntactic categories, respectively.

An identifier is the name for variable or other entity. We define an identifier to begin with an alphabetic or underscore character and include all contiguous alphabetic, numeric, or underscore characters that follow. It is delimited by a whitespace or another character not allowed in an identifier.

As a sequence of characters, a keyword is just an identifier in this language, so the scanner does not distinguish between two categories. The lexical analyzer subsequently separates out keywords by checking each identifier against the list of keywords.

An integer literal begins with a numeric character and includes all contiguous numeric characters that follow. It is delimited by a whitespace or nonnumeric character.

We plan to extend this language with additional operators. To enable flexible use of the scanner, we design it to collect all contiguous characters from a list of supported operator characters. Of course, we exclude alphabetic, numeric, underscore, parentheses, and similar characters from the list for the prefix Expression language.

The lexer subsequently compares each scanned operator against a list of valid operators to remove invalid operators.

The language uses keywords in similar ways to operators, so the lexer also subsequently tags keywords as operators. The current lexical analyzer does not use the TokKey token category.

The LexExpr module defines a Token algebraic data type, defined below, to represent the lexical tokens. The constructors identify the various syntactic categories.

import Values ( NumType, Name, toNumType )
-- e.g., NumType = Int , Name = String
data Token = TokLeft -- left parenthesis
| TokRight -- right parenthesis
| TokNum NumType -- unsigned integer literal
| TokId Name -- names of variables, etc.
| TokOp Name -- names of primitive functions
| TokKey Name -- keywords (no use currently)
| TokOther String -- other characters
deriving (Show, Eq)

The function lexx, shown below, incorporates the scanner and most of the lexeme evaluator functionality. It takes a string and returns a list of tokens.

import Data.Char ( isSpace, isDigit, isAlpha, isAlphaNum )
lexx :: String -> [Token]
lexx [] = []
lexx xs@(x:xs')
| isSpace x = lexx xs'
| x == ';' = lexx (dropWhile (/='\n') xs')
| x == '(' = TokLeft : lexx xs'
| x == ')' = TokRight : lexx xs'
| isDigit x = let (num,rest) = span isDigit xs
in (TokNum (convertNumType num)) : lexx rest
| isFirstId x = let (id,rest) = span isRestId xs
in (TokId id) : lexx rest
| isOpChar x = let (op,rest) = span isOpChar xs
in (TokOp op) : lexx rest
| otherwise = (TokOther [x]) : lexx xs'
where
isFirstId c = isAlpha c || c == '_'
isRestId c = isAlphaNum c || c == '_'
isOpChar c = elem c opchars
opchars = "+-*/~<=>!&|@#$%^?:" -- not " ' ` ( ) [ ] { } , . ;

Function lexx pattern matches on the first character of the string and then collects any additional characters of the token using the higher order function Data.Char.span. Function span breaks the string into two part–the prefix consisting of all contiguous characters that satisfy its predicate and the suffix beginning with the first character that does not.

Boolean function isOpChar returns True for characters potentially allowed in operator symbols. These are defined in the string opchars, which makes this aspect of the scanner relatively easy to modify.

Function lexer, shown below, calls lexx and then carries out the following transformations on the list of tokens:

The lexer does not generate error messages. Instead it tags characters that do not fit in any lexeme as a TokOther token. The parser can use these as needed (e.g., to generate error messages).

lexer :: String -> [Token]
lexer xs = markSpecials (lexx xs)
markSpecials :: [Token] -> [Token]
markSpecials ts = map xformTok ts
xformTok :: Token -> Token
xformTok t@(TokId id)
| elem id keywords = TokOp id
| otherwise = t
xformTok t@(TokOp op)
| elem op primitives = t
| otherwise = TokOther op
xformTok t = t
keywords = [] -- none defined currently
primitives = ["+","-","*","/"]

In the above code, the function xformTok transforms any identifier that is a defined keyword into an operator token, leaves other identifiers and defined primitive operators alone, and marks everything else with the token type TokOther.

11.3.1 Infix syntax

The lexer for the prefix syntax given in the previous subsection can also be used for the simple infix syntax. However, future extensions of the language may require differences in the lexers.

11.4 Recursive Descent Parsing

A recursive descent parser is an approach to parsing languages that have relatively simple grammars.

It is a top-down parser, a type of parser that begins with start symbol of the grammar and seeks to determine the parse tree by working down the levels of the parse tree toward the program (i.e., sentence).

By contrast, a bottom-up parser first recognizes the low-level syntactic units of the grammar and builds the parse tree upward from these leaves toward the root (i.e., start symbol). Bottom-up parsers support a wider range of grammars and tend to be more efficient for production compilers. However, their development tends to be less intuitive and more complex. We leave discussion of these parsers to courses on compiler construction.

A recursive descent parser consists of a set of mutually recursive functions. It typically includes one hand-coded function for each nonterminal of the grammar and one clause for each production for that nonterminal.

The recursive descent approach works well when the grammar can be transformed into an LL(k) (especially LL(1)) grammar. Discussion of these techniques are left to courses on compiler construction.

For an LL(1) grammar, we can write recursive descent parsers that can avoid backtracking to an earlier point in the parse to start down another path.

For example, consider a simple grammar with with rules:

    S ::= A | B 
    A ::= C D 
    B ::= { E }  -- zero or more occurrence of E 
    C ::= [ F ]  -- zero or one occurrence of F 
    D ::= '1' | '@' S 
    E ::= '3'
    F ::= '2'

Consider the nonterminal S, which has alternatives A and B.

These sets of first symbols are disjoint, so the parser can distinguish among the alternatives based on the first terminal symbol. (Hence, the grammar is backtrack free.)

11.4.1 Constructing recursive descent parsers

A simple recognizer for the grammar above could include functions similar to those shown below. We consider the five different situations for nonterminals S, A, B, C, and E.

In the Haskell code, a parsing function takes a String with the text of the expression to be processed and returns a tuple (Bool,String) where the first component indicates whether or not the parser succeeded (i.e., the output of the parse) and the second component gives the new state of the input.

If the first component is True, then the second component holds the input remaining after the parse. If the first component is False, then the second component is the remaining part of the input to be processed after the parser failed.

Of course, instead of strings, the parser could work on lists of tokens or other symbols.

  1. Alternatives: S ::= A | B

    parseS :: String -> (Bool,String) -- A | B
    parseS xs =
    case parseA xs of -- try A
    (True, ys) -> (True, ys) -- A succeeds
    (False, _ ) ->
    case parseB xs of -- else try B
    (True, ys) -> (True, ys) -- B succeeds
    (False, _) -> (False, xs) -- both A & B fail

    Function parseS succeeds whenever any alternative succeeds. Otherwise, it continues to check subsequent alternatives. It fails if the final alternative fails.

    If there are more than two alternatives, we can nest each additional alternative more deeply within the conditional structure. (That is, we replace the parseB failure case value with a case expression for the third option. Etc.)

  2. Sequencing: A ::= C D

    parseA :: String -> (Bool,String) -- C D
    parseA xs =
    case parseC xs of -- try C
    (True, ys) ->
    case parseD ys of -- then try D
    (True, zs) -> (True, zs) -- C D succeeds
    (False, _) -> (False, xs) -- D fails
    (False, _ ) -> (False,xs) -- C fails

    Function parseA fails whenever any component fails. Otherwise, it continues to check subsequent components. It succeeds when the final component succeeds.

    If there are more than two components in sequence, we nest each additional component more deeply within the conditional structure. (That is, we replace parseD xs with case parseD xs of ....)

  3. Repetition zero or more times: B ::= { E }

    parseB :: String -> (Bool,String) -- { E }
    parseB xs =
    case parseE xs of -- try E
    (True, ys) -> parseB ys -- one E, try again
    (False, _) -> (True,xs) -- stop, succeeds

    Function parseB always succeeds if parseE terminates. However, it may succeed for zero occurrences of E or for some positive
    number of occurrences.

  4. Optional elements: C ::= [ F ]

    parseC :: String -> (Bool,String) -- [ F ]
    parseC xs =
    case parseF xs of -- try F
    (True, ys) -> (True,ys)
    (False, _ ) -> (True,xs)

    Function parseC always succeeds if parseF terminates. However, it may succeed for at most one occurrence of F.

  5. Base cases to parse low-level syntactic elements: E ::= '3'

    parseE :: String -> (Bool,String)
    parseE (x:xs') = (x == '3', xs')
    parseE xs = (False, xs )

On success in any of these cases, the new input state is the string remaining after the successful alternative.

On failure, the input state should be left unchanged by any of the functions.

To use the above templates, it may sometimes be necessary to refactor the rules that involve more than one of the above cases. For example, consider the rule

    D ::= '1' | '@' S 

which consists of two alternatives, the second of which is itself a sequence. To see how to apply the templates straightforwardly, we can refactor D to be the two rules:

    D  ::= '1' | DS
    DS ::= '@' S 

In addition to the above parsers for the various rules, we might have a function parse that calls the top-level parser (parseS) and ensures that all the input is parsed.

parse :: String -> Bool
parse xs =
case parseS xs of
(True, []) -> True
(_, _ ) -> False

See file ParserS03.hs for experimental Haskell code for this example recursive descent parser.

To have a useful parser, the above prototype functions likely need to be modified to build the intermediate representation and to return appropriate error messages for unsuccessful parses.

The above prototype functions use Haskell, but a similar technique can be used with any language that supports recursive function calls.

11.4.2 Prefix syntax

This subsection describes an example recursive descent parser for the Expression Language’s prefix syntax. The complete code for the ParsePrefixExpr module is given in the file ParsePrefixExpr.hs.

As given in a previous subsection, the prefix parser embodies the the following grammar:

    <expression> ::=  <var> | <val> | <operexpr>
    <var>        ::=  <id>
    <val>        ::=  [ '-' ] <unsigned> 
    <operexpr>   ::=  '(' <operator> <operandseq> ')'
    <operandseq> ::=  { <expression> }
    <operator>   ::=  '+'  |  '*'  |  '-'  |  '/' | ...

The ParserPrefixExpr module imports and uses the LexExpr module) for lexical analysis. In particular, it uses the algebraic data type Token, types NumType and Name, and function lexer.

import Values ( NumType, Name, toNumType )
data Token = TokLeft -- left parenthesis
| TokRight -- right parenthesis
| TokNum NumType -- unsigned integer literal
| TokId Name -- names of variables, etc.
| TokOp Name -- names of primitive functions
| TokKey Name -- keywords
| TokOther String -- other characters
deriving (Show, Eq)
lexer :: String -> [Token]

For the prefix grammar above, the nonterminals <id> and <unsigned> and the terminals are parsed into their corresponding tokens by the lexical analyzer.

TODO: Update this code and reference. The incomplete module TestPrefixExpr (in file TestPrefixExpr.hs) provides some testing of the prefix parser.

The output of the parser is an abstract syntax tree constructed with the algebraic data type Expr defined in the previous chapter. This is in the Abstract Syntax Tree module.

import Values ( ValType, Name )
data Expr = Add Expr Expr
| Sub Expr Expr
| Mul Expr Expr
| Div Expr Expr
| Var Name
| Val ValType

11.4.2.1 Parse <expression>

Now let’s build a recursive descent parser using the method described in the previous subsection. We begin with the start symbol <expression>.

The parsing function parseExpression, shown below, implements the following BNF rule:

    <expression> ::= <var> | <val> | <operexpr>

It uses the recursive descent template #1 with three alternatives.

type ParErr = String
parseExpression :: [Token] -> (Either ParErr Expr, [Token])
parseExpression xs =
case parseVar xs of
r@(Right _, _) -> r -- <var>
_ ->
case parseVal xs of
r@(Right _, _) -> r -- <val>
_ ->
case parseOperExpr xs of
r@(Right _, _) -> r -- <operexpr>
(Left m, ts) -> (missingExpr m ts, ts)
missingExpr m ts =
Left ("Missing expression at " ++ (showTokens (pref ts))
++ "..\n..Nested error { " ++ m ++ " }")

Function parseExpression takes a Token list and attempts to parse an <expression>. If the parse succeeds, the function returns a pair consisting of the Right value of an Either wrapping the corresponding Expr abstract syntax tree and the list of input Tokens remaining after the Expr. If the parse fails, then the function returns an error in a Left value for theEither and the unchanged list of input Tokens.

We define an auxiliary function missingExpr to generate an appropriate error message.

The function parse, shown below, is the primary entry point for the ParsePrefixExpr module. It first calls the lexical analysis function lexer (from the module LexExpr) on the input list of characters and then calls the parsing function parseExpression with the corresponding list of tokens.

If a parsing error occurs or if there are leftover tokens, then the function returns an appropriate error message.

parse :: String -> Either ParErr Expr
parse xs =
case lexer xs of
[] -> incompleteExpr xs
ts ->
case parseExpression ts of
(ex@(Right _), []) -> ex
(ex@(Left _), _ ) -> ex
(ex, ss) -> extraAtEnd ex ss
incompleteExpr xs =
Left ("Incomplete expression: " ++ xs)
extraAtEnd ex xs =
Left ("Nonspace token(s) \"" ++ (showTokens xs) ++
"\" at end of the expression \"" ++ (show ex) ++ "\"")

11.4.2.2 Parse <var>

Function parseVar implements the BNF rule:

    <var> ::= <id>

Variable <id> denotes an identifier token recognized by the lexer. So we implement function parseVar as a base case of the recursive descent parser (i.e., template #5).

parseVar :: [Token] -> (Either ParErr Expr, [Token])
parseVar ((TokId id):ts) = (Right (Var id),ts)
parseVar ts = (missingVar ts, ts)
missingVar ts =
Left ("Missing variable at " ++ (showTokens (pref ts)))

Function parseVar has the same type signature as parseExpression. It attempts to match an identifier token at the front of the token sequence. If it finds an identifier, it transforms the token to a Var expression and returns it with the remaining token list. Otherwise, it returns an error message and the unchanged token list.

11.4.2.3 Parse <val>

Function parseVal implements the BNF rule:

    <val> ::= [ '-' ] <unsigned>

To implement this rule, we can refactor it into two rules that correspond to the recursive descent template functions:

    <val>      ::= <optminus> <unsigned>
    <optminus> ::= [ '-' ]

Then <val> can be implemented using the sequencing (#2) prototype, <optminus> using the optional element (#4) prototype, and <unsigned> and - using base case (#5) prototypes.

However, <unsigned> denotes a numeric token and - denotes a single operator token. Thus we can easily implement parseVal as a base case of the recursive descent parser.

parseVal :: [Token] -> (Either ParErr Expr, [Token])
parseVal ((TokNum i):ts) = (Right (Val i), ts)
parseVal ((TokOp "-"):(TokNum i):ts) = (Right (Val (-i)), ts)
parseVal ts = (missingVal ts, ts)
missingVal ts =
Left ("Missing value at " ++ (showTokens (pref ts)))

Function parseVal has the same type signature as parseExpression. It attempts to match a numeric token, which is optionally preceded by a negative sign, at the front of the token sequence. If it finds this, it transforms the tokens to a Val expression and returns the expression and the remaining token list. Otherwise, it returns an error message and the unchanged token list.

11.4.2.4 Parse <operexpr>

Function parseOperExpr implements following BNF rule:

    <operexpr>   ::=  "(" <operator> <operandseq> ")"

It uses a modified version of recursive descent template #2 for sequences of terms.

parseOperExpr :: [Token] -> (Either ErrMsg Expr, [Token])
parseOperExpr xs@(TokLeft:(TokOp op):ys) = -- ( <operator>
case parseOperandSeq ys of -- <operandseq>
(args, zs) ->
case zs of -- )
(TokRight:zs') -> (makeExpr op args, zs')
zs' -> (missingRParen zs, xs)
-- ill-formed <operexpr>s
parseOperExpr (TokLeft:ts) = (missingOp ts, ts)
parseOperExpr (TokRight:ts) = (invalidOpExpr ")", ts)
parseOperExpr ((TokOther s):ts) = (invalidOpExpr s, ts)
parseOperExpr ((TokOp op):ts) = (invalidOpExpr op, ts)
parseOperExpr ((TokId s):ts) = (invalidOpExpr s, ts)
parseOperExpr ((TokNum i):ts) = (invalidOpExpr (show i), ts)
parseOperExpr [] = (incompleteExpr, [])
missingRParen ts =
Left ("Missing `)` at " ++ (show (take 3 ts)))
missingOp ts =
Left ("Missing operator at " ++ (show (take 3 ts)))
invalidOpExpr s =
Left ("Invalid operation expression beginning with " ++ s)
incompleteExpr = Left "Incomplete expression"

Function parseOperExpr has the same type signature as parseExpression. It directly matches against the first two tokens to see whether they are a left parenthesis and an operator, respectively, rather than calling separate functions to parse each. If successful, it then parses zero or more operands and examines the last token to see whether it is a right parenthesis.

If the operator expression is ill-formed, the function returns an appropriate error message.

The function parseOperExpr delegates the construction of the corresponding Expr (i.e., abstract syntax tree) to function makeExpr, which we discuss later in the subsection.

The values yielded by the components of <operexpr> must be handled differently than the previous components of expressions we have examined. They are not themselves Expr values.

11.4.2.5 Parse <operandseq>

Function parseOperandSeq implements the BNF rule:

    <operandseq> ::=  { <expression> }

It uses the recursive descent template #3 for repeated symbols.

parseOperandSeq :: [Token] -> ([Expr],[Token])
parseOperandSeq xs =
case parseExpression xs of
(Left _, _ ) -> ([],xs)
(Right ex, ys) ->
let (exs,zs) = parseOperandSeq ys
in (ex:exs,zs)

The function parseOperandSeq takes a token list and collects a list of 0 or more operand Exprs. An empty list means that no operands were found.

11.4.2.6 AST construction (makeExpr)

Operators in the current abstract syntax take a fixed number of operands. Add and Mul each take two operands, but a negation operator would take one operand and a conditional “if” operation would take three.

However, the current concrete prefix syntax does not distinguish among the different operators and the number of operands they require. It allows any operator in an <operexpr> to have any finite number of operands.

We could, of course, define a grammar that distinguishes among the operators, but we choose to keep the grammar flexible, thus enabling easy extension. We handle the operator-operand matching in the makeExpr function using data structures to define the mapping.

Thus, function makeExpr takes the operator string and a list of operand Exprs and constructs an appropriate Expr. It uses function arity to determine the number of operands required for the operator and then calls the appropriate opConsN function to construct the Expr.

makeExpr :: String -> [Expr] -> Either ErrMsg Expr
makeExpr op exs =
case arity op of
0 -> opCons0 op exs -- not implemented
1 -> opCons1 op exs
2 -> opCons2 op exs
3 -> opCons3 op exs
4 -> opCons4 op exs -- not implemented
5 -> opCons5 op exs -- not implemented
_ -> opConsX op exs -- not implemented

Function arity takes an operator symbol and returns the number of operands that operator requires. It uses the arityMap association list to map the operator symbols to the number of arguments expected.

import Data.Maybe
arityMap = [ ("+",2), ("-",2), ("*",2), ("/",2) ]
-- add (operator,arity) pairs as needed
arity :: String -> Int
arity op = fromMaybe (-1) (lookup op arityMap)

Function opCons2 takes a binary operator string and an operand list with two elements and returns the corresponding Expr structure wrapped in a Right. An error is denoted by passing back an error message wrapped in a Left.

assocOpCons2 =
[ ("+",Add), ("-",Sub), ("*",Mul), ("/",Div) ]
-- add new pairs as needed
opCons2 :: String -> [Expr] -> Either ParErr Expr
opCons2 op exs =
case length exs of
2 -> case lookup op assocOpCons2 of
Just c -> Right (c (exs!!0) (exs!!1))
Nothing -> invalidOp op
n -> arityErr op n
invalidOp op =
Left ("Invalid operator '" ++ op ++ "'")
arityErr op n =
Left ("Operator '" ++ op ++ "' incorrectly called with "
++ (show n) ++ " operand(s)")

Currently, the only supported operators are the binary operators +, -, *, and /. These map to the binary Expr constructors Add, Sub,Mul, and Div. (These are two-argument functions.)

If we extend the supported operators, then we must extend the definitions of arityMap and assocOpCons2 and add new definitions for opConsN and assocOpConsN for other arities N. (We may also need to modify the LexExpr module and the definition of Expr.)

For now, we respond to unknown operators using function opConsX and return an appropriate error message. (In the future, this function may be redefined to support operators with variable numbers of operands.)

opConsX :: String -> [Expr] -> Either ErrMsg Expr
opConsX op exs = unsupportedOp op
unsupportedOp op = Left ("Unsupported operator '" ++ op ++ "'")

11.4.3 Infix syntax (UNFINISHED)

TODO: Update the parser to reflect the grammar change and recursive descent explanation.

TODO: Describe the recursive descent infix parser in module ParseInfixExpr.hs. An incomplete module that does some testing is TestInfixExpr.hs.

11.5 Source Code and Module Dependencies

As we saw in the previous chapter, an Expression Language interpreter consists of 7 modules with the module dependencies shown in Figure 1.

Figure 1: Expression Language module dependencies
Figure 1: Expression Language module dependencies

The Haskell source code for the Expression Language modules discussed in this chapter are linked below:

The Haskell source code for the other Expression Language modules were discussed in the previous chapter. These are linked below:

The main entry points for use of the interpreters are the REPLs.

11.6 Parsing Combinators

In a previous section, we examined a set of prototype parsing functions and then used them as patterns for hand-coding of recursive descent parsing functions. We can benefit by generalizing these functions and collecting them into a library.

11.6.1 State actions and combinators

Consider parseS, one of the prototype parsing functions from a previous section. It parses the grammar rule S ::= A | B, which has two alternatives.

parseS :: String -> (Bool,String)
parseS xs =
case parseA xs of -- try A
(True, ys) -> (True, ys) -- A succeeds
(False, _ ) ->
case parseB xs of -- else try B
(True, ys) -> (True, ys) -- B succeeds
(False, _) -> (False, xs) -- both A,B fail

Note that parseS and the other prototype parsing functions have the type:

String -> (Bool,String)

The occurrence of type String in the argument of the function represents the state of the input before evaluation of the function; the second occurrence of String represents the state after evaluation. The type Bool represents the result of the evaluation.

In an imperative program, the state is often left implicit and only the result type is returned. However, in a purely functional program, we must also make both the state change explicit.

Functions that have a type similar to parseS are called state actions or state transitions. We can generalize this parsing state transition as a function type:

type Parser a b = a -> (b,a)

In the case of parseS, we specialize this to:

Parser String Bool

In the case of richer parsing case studies for the prefix and infix parsers, we specialize this type as:

Parser [Token] (Either ErrMsg Expr)

Given the Parser type, we can define a set of combinators that allow us to combine simpler parsers to construct more complex parsers. These combinators can pass along the state implicitly, avoiding some tedious and repetitive work.

We can define a combinator parseAlt that generalizes the parseS prototype function above. It implements a recognizer, so we fix type b to Bool, but leave type argument a general.

parseAlt :: Parser a Bool -> Parser a Bool -> Parser a Bool
parseAlt p1 p2 =
\xs ->
case p1 xs of
(True, ys) -> (True, ys)
(False, _ ) ->
case p2 xs of
(True, ys) -> (True, ys)
(False, _ ) -> (False, xs)

Note the use of the anonymous function in the body. Function parseAlt takes two Parser values and then returns a Parser value. The Parser function returned binds in the two component function values. When this function is applied to the parser input (which is the argument of the anonymous function), it applies the two component parsers as needed.

We can easily redefine parseS in terms of the parseAlt combinator and simpler parsers parseA and parseB.

parseS = parseAlt parseA parseB

Given parsing input inp, we can invoke the parser with the expression:

parseS inp

Note that this formulation enables us to handle the passing of state among the component parsers implicitly, much as we can in an imperative computation. But it still preserves the nature of purely functional computation.

11.6.2 Completing a combinator library

Now consider the parseA prototype, which implements a two-component sequencing rule A ::= C D.

parseA xs =
case parseC xs of -- try C
(True, ys) -> -- then try D
case parseD ys of
(True, zs) -> (True, zs) -- C D succeeds
(False, _) -> (False, xs) -- both C, D fail
(False, _ ) -> (False,xs) -- C fails

As with parseS, we can generalize parseA as a combinator parseSeq.

parseSeq :: Parser a Bool -> Parser a Bool -> Parser a Bool
parseSeq p1 p2 =
\xs ->
case p1 xs of
(True, ys) ->
case p2 ys of
t@(True, zs) -> t
(False, _ ) -> (False, xs)
(False, _ ) -> (False, xs)

Thus we can redefine parseA in terms of the parseSeq combinator and simpler parsers parseC and parseD.

parseA = parseSeq parseC parseD

Similarly, we consider the parseB prototype, which implements a repetition rule B ::= { E }.

parseB xs =
case parseE xs of -- try E
(True, ys) -> parseB ys -- try again
(False, ys) -> (True,xs) -- stop

As above, we generalize this as combinator parseStar.

parseStar :: Parser a Bool -> Parser a Bool
parseStar p1 =
\xs ->
case p1 xs of
(True, ys) -> parseStar p1 ys
(False, _ ) -> (True, xs)

We can redefine parseB in terms of combinator parseStar and simpler parser parseE.

parseB = parseStar parseB

Finally, consider parsing prototype parseC, which implements an optional rule C ::= [ F ].

parseC xs =
case parseF xs of -- try F
(True, ys) -> (True,ys)
(False, _ ) -> (True,xs)

We generalize this pattern as parseOpt, as follows.

parseOpt :: Parser a Bool -> Parser a Bool
parseOpt p1 =
\xs ->
case p1 xs of
(True, ys) -> (True, ys)
(False, _ ) -> (True, xs)

We can thus redefine parseC in terms of simpler parser parseF and combinator parseOpt.

parseC = parseOpt parseF

In this simple example grammar, function parseD is a simple instance of a sequence and parseE and parseF are simple parsers for symbols. These can be directly implemented as basic parsers, as before. However, the technique work if these are more complex parsers built up from combinators.

For convenience and completeness, we include extended alternative and sequencing combinators and parsers that always fail or always succeed.

parseAltList :: [Parser a Bool] -> Parser a Bool
parseSeqList :: [Parser a Bool] -> Parser a Bool
parseFail, parseSucceed :: Parser a Bool

The combinators in this library are in the Haskell module ParserComb.hs. A module that does some testing is TestParserComb.hs.

TODO: Update and document the Parser Combinator library code.

11.6.3 Adding parse tree generations (UNFINISHED)

TODO: Expand this library to allow returns of “parse trees” and error messages.

11.7 Standard libraries for parsing (UNFINISHED)

There are a number of relatively standard parsing combinator libraries–e.g., the library Parsec. Readers who wish to develop other parsers may want to study that library.

11.8 Exercises

TODO

11.9 Acknowledgements

I initially developed this case study for the Haskell-based offering of CSci 556 (Multiparadigm Programming) in Spring 2017. I continued work on it during Summer and Fall 2017. I based this work, in part, on ideas from:

I maintain these notes as text in Pandoc’s dialect of Markdown using embedded LaTeX markup for the mathematical formulas and then translate the notes to HTML, PDF, and other forms as needed. The HTML version of this document may require use of a browser that supports the display of MathML.

11.10 References

TODO: Edit this

[Abelson-Sussman 1996]
Harold Abelson and Gerald Jay Sussman. Structure and Interpretation of Computer Programs (SICP), Second Edition, MIT Press, 1996.
[Appel 1998]
Andrew W. Appel. Modern Compiler Implementation in ML, Cambridge, 1998. (Especially section 3.2 “Predictive Parsing”)
[Chiusano-Bjarnason 2015]
Paul Chiusano and Runar Bjarnason, Functional Programming in Scala, Manning, 2015. (Especially chapters 6 “Purely Functional State” and 9 “Parser Combinators”)
[Fowler-Parsons 2011]
Martin Fowler and Rebecca Parsons. Domain-Specific Languages, Addison Wesley, 2011. (Especially chapter 21 “Recursive Descent Parser”)
[Kamin 1990]
Samuel N. Kamin. Programming Languages: An Interpreter-Based Approach, Addison-Wesley, 1990.
[Linz 2017]
Peter Linz. An Introduction to Formal Languages and Automata, Fifth Edition, Jones and Bartlett, 2017. (Especially sections 1.2, 3.3, and 5.1)
[Schinz-Haller 2017]
Michel Schinz and Philipp Haller. A Scala Tutorial for Java Programmers, accessed February 2016.
[Sestoft 2012]
Peter Sestoft. Programming Language Concepts, Springer, 2012. (Especially sections 1.3, 2.5, and 2.7 and chapter 8)
[Wikipedia 2017]
Wikipedia articles “Regular Grammar”, “Context-Free Grammar”, “Backus-Naur Form”, “Lexical Analysis”, “Parsing”, “LL Parser”, “Recursive Descent Parser”, and “Abstract Syntax”.

11.11 Terms and Concepts

TODO