Acknowledgements: I originally created these slides in Fall 2017 to accompany what is now Chapter 44, Calculator: Parsing, in the 2018 version of the textbook Exploring Languages with Interpreters and Functional Programming.

Browser Advisory: The HTML version of this document may require use of a browser that supports the display of MathML. A good choice as of November 2018 is a recent version of Firefox from Mozilla.

Lecture Goals

Introduce basic parsing and lexical analysis concepts
Explore lexical analysis using ELI Calculator language lexer
Introduce recursive descent parsing and templates
Explore recursive descent parsing using ELI Calculator language parser
Reinforce Haskell programming concepts for larger programs

ELI Calculator Modules

Parsing

Parser (syntactic analyzer)
- determines whether program satisfies grammar for concrete syntax
- constructs intermediate representation of program
Approach in ELI Calculator language interpreter
- lexical analyzer converts character stream into lexical token (lexeme) stream — regular language
- parser converts token stream into abstract syntax tree — context-free language
Hand-coded recursive descent parser

Lexical Analysis

Informally called lexer, scanner, or tokenizer

Phases (often combined)

Scanner — breaks character stream into lexemes (words)
Evaluator — determines syntactic category of lexeme (not Evaluator module)

Tokenization

Infix expression with whitespace
```
    30   +   ( x1 * 2)
```
1. integer literal 30
2. addition operator +
3. left parenthesis symbol (
4. identifier x1
5. multiplication operator *
6. integer literal 2
7. right parenthesis symbol )

Infix Grammar (Ch. 41)

-- Context-free grammar (CFG)
<expression> ::= <term> { <addop> <term> }
<term>       ::= <factor> { <mulop> <factor> }
<factor>     ::= <var> | <val>
               | '(' <expression> ')'
<val>        ::= [ '-' ] <unsigned>
<var>        ::= <id>
<addop>      ::=  '+'  |  '-' 
<mulop>      ::=  '*'  |  '/' 
-- Regular grammar (same as prefix grammar)
<id>         ::=  <firstid>  |  <firstid> <idseq>
<idseq>      ::=  <restid>   |  <restid> <idseq>
<firstid>    ::=  <alpha>| '_'
<restid>     ::=  <alpha>| '_'  | <digit>
<unsigned>   ::=  <digit>|  <digit> <unsigned> 
<digit>      ::=  any numeric character
<alpha>      ::=  any alphabetic character

Prefix Grammar (Ch. 41)

-- Context-free Grammar (CFG)
<expression> ::= <var> | <val> | <operexpr>
<var>        ::=  <id>
<val>        ::=  [ "-" ] <unsigned> 
<operexpr>   ::=  '(' <operator> <operandseq> ')'
<operandseq> ::=  { <expression> }
<operator>   ::=  '+'  |  '*'  |  '-'  |  '/' | ...
-- Reglar grammar (same as infix grammar)
<id>         ::=  <firstid>  |  <firstid> <idseq>
<idseq>      ::=  <restid>   |  <restid> <idseq>
<firstid>    ::=  <alpha>| '_'
<restid>     ::=  <alpha>| '_'  | <digit>
<unsigned>   ::=  <digit>|  <digit> <unsigned> 
<digit>      ::=  any numeric character
<alpha>      ::=  any alphabetic character

Lexical Analyzer

Same for infix and prefix syntax
Module LexCalc
Data type Token
Function lexx does scanning and most of lexeme evaluation
Function lexer separates any keywords from identifiers, validates operator symbols

Data Type `Token`

import Values ( NumType, Name, toNumType ) 
-- e.g., NumType = Int , Name = String 

data Token = TokLeft          -- left parenthesis 
           | TokRight         -- right parenthesis 
           | TokNum NumType   -- unsigned integer literal 
           | TokId Name       -- names of variables, etc. 
           | TokOp Name       -- names of primitive functions 
           | TokKey Name      -- keywords (no use currently)
           | TokOther String  -- other characters 
             deriving (Show, Eq)

Function `lexx`

import Data.Char ( isSpace, isDigit, isAlpha, isAlphaNum )

lexx :: String -> [Token]  -- scan, most of lexeme eval
lexx []  = []
lexx xs@(x:xs') 
    | isSpace x   = lexx xs' 
    | x == ';'    = lexx (dropWhile (/='\n') xs')
    | x == '('    = TokLeft  : lexx xs'
    | x == ')'    = TokRight : lexx xs'
    | isDigit x   = let (num,rest) = span isDigit xs
                    in (TokNum (convertNumType num)) : lexx rest
    | isFirstId x = let (id,rest) = span isRestId xs
                    in  (TokId id) : lexx rest
    | isOpChar x  = let (op,rest) = span isOpChar xs
                    in (TokOp op) : lexx rest
    | otherwise   = (TokOther [x]) : lexx xs'
    where 
        isFirstId c = isAlpha c    || c == '_' 
        isRestId  c = isAlphaNum c || c == '_' 
        isOpChar  c = elem c opchars
opchars = "+-*/~<=>!&|@#$%^?:"  -- not " ' ` ( ) [ ] { } , . ;

Function `lexer`

lexer :: String -> [Token] -- rest of lexeme eval
lexer xs = markSpecials (lexx xs)

markSpecials :: [Token] -> [Token]
markSpecials ts = map xformTok ts

xformTok :: Token -> Token
xformTok t@(TokId id) -- is keyword?
    | elem id keywords   = TokOp id
    | otherwise          = t
xformTok t@(TokOp op) -- is valid operator?
    | elem op primitives = t
    | otherwise          = TokOther op
xformTok t               = t

keywords   = []               -- none defined currently
primitives = ["+","-","*","/"]

Recursive Descent Parsing

Top-down parser—intuitive—begin with CFG start symbol, work down levels toward program
- contrast to bottom-up — works from program toward start
Mutually recursive functions, one for each nonterminal
Good technique for relatively simple CFGs — e.g. transformable to LL(k)

Example CFG

S ::= A | B  -- (1) alternatives
A ::= C D    -- (2) sequencing
B ::= { E }  -- (3) zero or more occurrences of E 
C ::= [ F ]  -- (4) zero or one occurrence of F 
D ::= '1' | '@' S 
E ::= '3'
F ::= '2'    -- (5) base case

Alternative A begins with terminals 1, 2, or @
Alternaive B empty or begins with terminal 3
Choose A or B based on first terminal, no backtracking

Recursive Descent Recognizer

Recognize syntactically correct strings with yes/no result
Return (Bool,String) – (success?, remaining-input)
Base on 5 patterns seen on previous slide (alternatives, sequencing, repetition, optional elements, base cases)
Can use with any recursive language such as Java
See code details for ParserS module

(1) Alternatives: `S ::= A | B`

parseS :: String -> (Bool,String)  -- A | B
parseS xs = 
    case parseA xs of              -- try A
        (True,  ys) -> (True, ys)  -- A succeeds
        (False, _ ) ->
            case parseB xs of      -- else try B 
                (True, ys) -> (True,  ys) -- B succeeds
                (False, _) -> (False, xs) -- both A & B fail
                                   -- nest more alternatives?

(2) Sequencing: `A ::= C D`

parseA :: String -> (Bool,String) -- C D 
parseA xs =
    case parseC xs of             -- try C 
        (True,  ys) -> 
            case parseD ys of     -- then try D 
                (True, zs) -> (True,  zs) -- C D succeeds
                (False, _) -> (False, xs) -- D fails
                                  -- nest more sequentially? 
        (False, _) -> (False,xs)  -- C fails (ERR in handout)

(3) Repetition: `B ::= { E }`

parseB :: String -> (Bool,String)  -- { E }
parseB xs =
    case parseE xs of              -- try E
        (True,  ys) -> parseB ys   -- one E, try again
        (False, _)  -> (True,xs)   -- stop, succeeds

(4) Optional Element: `C ::= [ F ]`

parseC :: String -> (Bool,String)  -- [ F ]
parseC xs =
    case parseF xs of              -- try F
        (True,  ys) -> (True,ys) 
        (False, _ ) -> (True,xs)

(5) Base Case: `E ::= '3'`

parseE :: String -> (Bool,String)
parseE (x:xs') = (x == '3', xs')
parseE xs      = (False,    xs )

Refactor: `D ::= '1' | '@' S`

Has two alternatives, with second a sequence
Refactor into two rules
```
D  ::= '1' | DS 
DS ::= '@' S 
```

parseD :: String -> (Bool,String)  -- '1' | '@' S
parseD ('1':xs) = (True, xs)       -- try '1' (shortcut)
parseD ('@':xs) =                  -- try '@' on DS (shortcut)
    case parseS xs of              -- try S
        (True,  ys) -> (True, ys)
        (False, _ ) -> (False,xs)  -- 
parseD xs      = (False, xs)

Outer Level Recognizer

parse :: String -> Bool
parse xs =
    case parseS xs of
        (True,  []) -> True
        (_,     _ ) -> False

Calculator Language Prefix CFG

<expression> ::=  <var> | <val> | <operexpr>
<var>        ::=  <id>
<val>        ::=  [ "-" ] <unsigned> 
<operexpr>   ::=  '(' <operator> <operandseq> ')'
<operandseq> ::=  { <expression> }
<operator>   ::=  '+'  |  '*'  |  '-'  |  '/' | ...

See code details for ParsePrefixCalc module

Parse `<expression>`

type ParErr = String

-- <expression> ::= <var> | <val> | <operexpr>
parseExpression :: [Token] -> (Either ParErr Expr, [Token])
parseExpression xs =         -- use template (1)
    case parseVar xs of
        r@(Right _, _) -> r  -- <var>
        _ ->
          case parseVal xs of
              r@(Right _, _) -> r  -- <val>
              _ ->
                  case parseOperExpr xs of
                      r@(Right _, _) -> r  -- <operexpr>
                      (Left m, ts)  -> (missingExpr m ts, ts)

missingExpr m ts  =
    Left ("Missing expression at  " ++ (showTokens (pref ts))
           ++ "..\n..Nested error { " ++ m ++ " }")

Outer-Level Parser

parse :: String -> Either ParErr Expr
parse xs =
    case lexer xs of  -- do lexical analysis
        [] -> incompleteExpr xs
        ts ->
            case parseExpression ts of
               (ex@(Right _), []) -> ex
               (ex@(Left  _), _ ) -> ex
               (ex, ss)           -> extraAtEnd ex ss

incompleteExpr xs = Left ("Incomplete expression: " ++ xs)
extraAtEnd ex xs =
   Left ("Nonspace token(s) \"" ++ (showTokens xs) ++ 
         "\" at end of the expression \"" ++ (show ex) ++ "\"")

Parse `<var>`

-- <var> ::= <id>    -- use template (5), constructs Var node
parseVar :: [Token] -> (Either ParErr Expr, [Token])
parseVar ((TokId id):ts) = (Right (Var id),ts) -- Var node
parseVar ts              = (missingVar ts, ts)

missingVar ts =
    Left ("Missing variable at " ++ (showTokens (pref ts)))

Parse `<val>`

--     <val> ::= [ '-' ] <unsigned>
-- REFACTOR AS
--     <val>      ::= <optminus> <unsigned>  -- template (2)
--     <optminus> ::= [ '-' ]                -- template (4)
-- Implement as special base case, constructs Val

parseVal :: [Token] -> (Either ParErr Expr, [Token]) -- Val node
parseVal ((TokNum i):ts)             = (Right (Val i), ts)
parseVal ((TokOp "-"):(TokNum i):ts) = (Right (Val (-i)), ts)
parseVal ts                          = (missingVal ts, ts)

missingVal ts = 
    Left ("Missing value at " ++ (showTokens (pref ts)))

Parse `<operexpr>`

-- <operexpr>   ::=  "(" <operator> <operandseq> ")"
-- use modified version of template (2)
parseOperExpr :: [Token] -> (Either ErrMsg Expr, [Token])
parseOperExpr xs@(TokLeft:(TokOp op):ys) =  -- ( <operator>
    case parseOperandSeq ys of              -- <operandseq>
        (args, zs) -> case zs of                      -- )
                (TokRight:zs') -> (makeExpr op args, zs') -- **
                zs'            -> (missingRParen zs, xs)
-- ill-formed <operexpr>s
parseOperExpr (TokLeft:ts)      = (missingOp ts, ts)
parseOperExpr (TokRight:ts)     = (invalidOpExpr ")", ts)
parseOperExpr ((TokOther s):ts) = (invalidOpExpr s, ts)
parseOperExpr ((TokOp op):ts)   = (invalidOpExpr op, ts)
parseOperExpr ((TokId s):ts)    = (invalidOpExpr s, ts)
parseOperExpr ((TokNum i):ts)   = (invalidOpExpr (show i), ts)
parseOperExpr []                = (incompleteExpr, [])

missingRParen ts = Left ("Missing `)` at " ++ (show (take 3 ts))) 
missingOp ts     = Left ("Missing operator at " ++ (show (take 3 ts))) 
invalidOpExpr s  = Left ("Invalid operation expression beginning with " ++ s) 
incompleteExpr   = Left "Incomplete expression"

Parse `<operandseq>`

-- <operandseq> ::=  { <expression> }
-- uses template (3), has special type signature

parseOperandSeq :: [Token] -> ([Expr],[Token])
parseOperandSeq xs =
    case parseExpression xs of 
        (Left  _,  _ ) -> ([],xs)
        (Right ex, ys) ->
            let (exs,zs) = parseOperandSeq ys
            in  (ex:exs,zs)

AST Construction (`makeExpr`)

import Data.Maybe 

makeExpr :: String -> [Expr] -> Either ErrMsg Expr
makeExpr op exs =
    case arity op of
        0 -> opCons0 op exs  -- not implemented
        1 -> opCons1 op exs
        2 -> opCons2 op exs
        3 -> opCons3 op exs
        4 -> opCons4 op exs  -- not implemented
        5 -> opCons5 op exs  -- not implemented
        _ -> opConsX op exs  -- not implemented

arityMap = [ ("+",2), ("-",2), ("*",2), ("/",2) ]
              -- add (operator,arity) pairs as needed
arity :: String -> Int
arity op = fromMaybe (-1) (lookup op arityMap)

AST Construction (`opCons2`)

assocOpCons2 = [ ("+",Add), ("-",Sub), ("*",Mul), ("/",Div) ]
                  -- add new pairs as needed

opCons2 :: String -> [Expr] -> Either ParErr Expr
opCons2 op exs =
    case length exs of
        2 -> case lookup op assocOpCons2 of -- construct op
                Just c  -> Right (c (exs!!0) (exs!!1))
                Nothing -> invalidOp op
        n -> arityErr op n

invalidOp op =` Left ("Invalid operator '" ++ op ++ "'")
arityErr op n = 
    Left ("Operator '" ++ op ++ "' incorrectly called with " 
          ++ (show n) ++ " operand(s)")

Calculator Language Infix CFG

<expression> ::= <term> { <addop> <term> }
<term>       ::= <factor> { <mulop> <factor> }
<factor>     ::= <var> | <val>
               | '(' <expression> ')'
<val>        ::= [ '-' ] <unsigned>
<var>        ::= <id>
<addop>      ::=  '+'  |  '-' 
<mulop>      ::=  '*'  |  '/'

See code details for ParseInfixCalc module
No explanation yet in Chapter 44

Other Parsing Approaches

Instructor’s partial library of recursive descent combinators
- Chapter 45, Parsing Combinators
- Partial ParserComb module (prototype recognizers, but no ability yet to construct ASTs, little testing)
Standard combinator libraries such as Parsec
Parser generators like Happy and Alex

Key Ideas

Lexical analysis
Parsing
Recursive descent parsing and templates
Lexer and parser for ELI Calculator language interpreter

Exploring Languages with Interpreters and Functional Programming Chapter 44 Calculator: Parsing

Calculator: Parsing

Lecture Goals

ELI Calculator Modules

Parsing

Lexical Analysis

Tokenization

Infix Grammar (Ch. 41)

Prefix Grammar (Ch. 41)

Lexical Analyzer

Data Type Token

Function lexx

Function lexer

Recursive Descent Parsing

Example CFG

Recursive Descent Recognizer

(1) Alternatives: S ::= A | B

(2) Sequencing: A ::= C D

(3) Repetition: B ::= { E }

(4) Optional Element: C ::= [ F ]

(5) Base Case: E ::= '3'

Refactor: D ::= '1' | '@' S