H. Conrad Cunningham
26 November 2018
Copyright (C) 2017, 2018, H. Conrad Cunningham
Acknowledgements: I originally created these slides in Fall 2017 to accompany what is now Chapter 44, Calculator: Parsing, in the 2018 version of the textbook Exploring Languages with Interpreters and Functional Programming.
Browser Advisory: The HTML version of this document may require use of a browser that supports the display of MathML. A good choice as of November 2018 is a recent version of Firefox from Mozilla.
Introduce basic parsing and lexical analysis concepts
Explore lexical analysis using ELI Calculator language lexer
Introduce recursive descent parsing and templates
Explore recursive descent parsing using ELI Calculator language parser
Reinforce Haskell programming concepts for larger programs
Parser (syntactic analyzer)
Approach in ELI Calculator language interpreter
lexical analyzer converts character stream into lexical token (lexeme) stream — regular language
parser converts token stream into abstract syntax tree — context-free language
Hand-coded recursive descent parser
Informally called lexer, scanner, or tokenizer
Phases (often combined)
Scanner — breaks character stream into lexemes (words)
Evaluator — determines syntactic category of lexeme (not Evaluator module)
Infix expression with whitespace
30 + ( x1 * 2)
30
+
(
x1
*
2
)
-- Context-free grammar (CFG)
<expression> ::= <term> { <addop> <term> }
<term> ::= <factor> { <mulop> <factor> }
<factor> ::= <var> | <val>
| '(' <expression> ')'
<val> ::= [ '-' ] <unsigned>
<var> ::= <id>
<addop> ::= '+' | '-'
<mulop> ::= '*' | '/'
-- Regular grammar (same as prefix grammar)
<id> ::= <firstid> | <firstid> <idseq>
<idseq> ::= <restid> | <restid> <idseq>
<firstid> ::= <alpha>| '_'
<restid> ::= <alpha>| '_' | <digit>
<unsigned> ::= <digit>| <digit> <unsigned>
<digit> ::= any numeric character
<alpha> ::= any alphabetic character
-- Context-free Grammar (CFG)
<expression> ::= <var> | <val> | <operexpr>
<var> ::= <id>
<val> ::= [ "-" ] <unsigned>
<operexpr> ::= '(' <operator> <operandseq> ')'
<operandseq> ::= { <expression> }
<operator> ::= '+' | '*' | '-' | '/' | ...
-- Reglar grammar (same as infix grammar)
<id> ::= <firstid> | <firstid> <idseq>
<idseq> ::= <restid> | <restid> <idseq>
<firstid> ::= <alpha>| '_'
<restid> ::= <alpha>| '_' | <digit>
<unsigned> ::= <digit>| <digit> <unsigned>
<digit> ::= any numeric character
<alpha> ::= any alphabetic character
Same for infix and prefix syntax
Data type Token
Function lexx
does
scanning and most of lexeme evaluation
Function lexer
separates
any keywords from identifiers, validates operator symbols
Token
import Values ( NumType, Name, toNumType )
-- e.g., NumType = Int , Name = String
data Token = TokLeft -- left parenthesis
| TokRight -- right parenthesis
| TokNum NumType -- unsigned integer literal
| TokId Name -- names of variables, etc.
| TokOp Name -- names of primitive functions
| TokKey Name -- keywords (no use currently)
| TokOther String -- other characters
deriving (Show, Eq)
lexx
import Data.Char ( isSpace, isDigit, isAlpha, isAlphaNum )
lexx :: String -> [Token] -- scan, most of lexeme eval
lexx [] = []
lexx xs@(x:xs')
| isSpace x = lexx xs'
| x == ';' = lexx (dropWhile (/='\n') xs')
| x == '(' = TokLeft : lexx xs'
| x == ')' = TokRight : lexx xs'
| isDigit x = let (num,rest) = span isDigit xs
in (TokNum (convertNumType num)) : lexx rest
| isFirstId x = let (id,rest) = span isRestId xs
in (TokId id) : lexx rest
| isOpChar x = let (op,rest) = span isOpChar xs
in (TokOp op) : lexx rest
| otherwise = (TokOther [x]) : lexx xs'
where
isFirstId c = isAlpha c || c == '_'
isRestId c = isAlphaNum c || c == '_'
isOpChar c = elem c opchars
opchars = "+-*/~<=>!&|@#$%^?:" -- not " ' ` ( ) [ ] { } , . ;
lexer
lexer :: String -> [Token] -- rest of lexeme eval
lexer xs = markSpecials (lexx xs)
markSpecials :: [Token] -> [Token]
markSpecials ts = map xformTok ts
xformTok :: Token -> Token
xformTok t@(TokId id) -- is keyword?
| elem id keywords = TokOp id
| otherwise = t
xformTok t@(TokOp op) -- is valid operator?
| elem op primitives = t
| otherwise = TokOther op
xformTok t = t
keywords = [] -- none defined currently
primitives = ["+","-","*","/"]
Top-down parser—intuitive—begin with CFG start symbol, work down levels toward program
Mutually recursive functions, one for each nonterminal
Good technique for relatively simple CFGs — e.g. transformable to LL(k)
S ::= A | B -- (1) alternatives
A ::= C D -- (2) sequencing
B ::= { E } -- (3) zero or more occurrences of E
C ::= [ F ] -- (4) zero or one occurrence of F
D ::= '1' | '@' S
E ::= '3'
F ::= '2' -- (5) base case
Alternative A
begins with terminals 1
,
2
, or @
Alternaive B
empty or begins with terminal
3
Choose A
or B
based on first terminal,
no backtracking
Recognize syntactically correct strings with yes/no result
Return (Bool,String)
– (success?, remaining-input)
Base on 5 patterns seen on previous slide (alternatives, sequencing, repetition, optional elements, base cases)
Can use with any recursive language such as Java
See code details for ParserS
module
S ::= A | B
A ::= C D
B ::= { E }
C ::= [ F ]
E ::= '3'
D ::= '1' | '@' S
Has two alternatives, with second a sequence
Refactor into two rules
D ::= '1' | DS
DS ::= '@' S
<expression> ::= <var> | <val> | <operexpr>
<var> ::= <id>
<val> ::= [ "-" ] <unsigned>
<operexpr> ::= '(' <operator> <operandseq> ')'
<operandseq> ::= { <expression> }
<operator> ::= '+' | '*' | '-' | '/' | ...
ParsePrefixCalc
module<expression>
type ParErr = String
-- <expression> ::= <var> | <val> | <operexpr>
parseExpression :: [Token] -> (Either ParErr Expr, [Token])
parseExpression xs = -- use template (1)
case parseVar xs of
r@(Right _, _) -> r -- <var>
_ ->
case parseVal xs of
r@(Right _, _) -> r -- <val>
_ ->
case parseOperExpr xs of
r@(Right _, _) -> r -- <operexpr>
(Left m, ts) -> (missingExpr m ts, ts)
missingExpr m ts =
Left ("Missing expression at " ++ (showTokens (pref ts))
++ "..\n..Nested error { " ++ m ++ " }")
parse :: String -> Either ParErr Expr
parse xs =
case lexer xs of -- do lexical analysis
[] -> incompleteExpr xs
ts ->
case parseExpression ts of
(ex@(Right _), []) -> ex
(ex@(Left _), _ ) -> ex
(ex, ss) -> extraAtEnd ex ss
incompleteExpr xs = Left ("Incomplete expression: " ++ xs)
extraAtEnd ex xs =
Left ("Nonspace token(s) \"" ++ (showTokens xs) ++
"\" at end of the expression \"" ++ (show ex) ++ "\"")
<var>
<val>
-- <val> ::= [ '-' ] <unsigned>
-- REFACTOR AS
-- <val> ::= <optminus> <unsigned> -- template (2)
-- <optminus> ::= [ '-' ] -- template (4)
-- Implement as special base case, constructs Val
parseVal :: [Token] -> (Either ParErr Expr, [Token]) -- Val node
parseVal ((TokNum i):ts) = (Right (Val i), ts)
parseVal ((TokOp "-"):(TokNum i):ts) = (Right (Val (-i)), ts)
parseVal ts = (missingVal ts, ts)
missingVal ts =
Left ("Missing value at " ++ (showTokens (pref ts)))
<operexpr>
-- <operexpr> ::= "(" <operator> <operandseq> ")"
-- use modified version of template (2)
parseOperExpr :: [Token] -> (Either ErrMsg Expr, [Token])
parseOperExpr xs@(TokLeft:(TokOp op):ys) = -- ( <operator>
case parseOperandSeq ys of -- <operandseq>
(args, zs) -> case zs of -- )
(TokRight:zs') -> (makeExpr op args, zs') -- **
zs' -> (missingRParen zs, xs)
-- ill-formed <operexpr>s
parseOperExpr (TokLeft:ts) = (missingOp ts, ts)
parseOperExpr (TokRight:ts) = (invalidOpExpr ")", ts)
parseOperExpr ((TokOther s):ts) = (invalidOpExpr s, ts)
parseOperExpr ((TokOp op):ts) = (invalidOpExpr op, ts)
parseOperExpr ((TokId s):ts) = (invalidOpExpr s, ts)
parseOperExpr ((TokNum i):ts) = (invalidOpExpr (show i), ts)
parseOperExpr [] = (incompleteExpr, [])
missingRParen ts = Left ("Missing `)` at " ++ (show (take 3 ts)))
missingOp ts = Left ("Missing operator at " ++ (show (take 3 ts)))
invalidOpExpr s = Left ("Invalid operation expression beginning with " ++ s)
incompleteExpr = Left "Incomplete expression"
<operandseq>
makeExpr
)import Data.Maybe
makeExpr :: String -> [Expr] -> Either ErrMsg Expr
makeExpr op exs =
case arity op of
0 -> opCons0 op exs -- not implemented
1 -> opCons1 op exs
2 -> opCons2 op exs
3 -> opCons3 op exs
4 -> opCons4 op exs -- not implemented
5 -> opCons5 op exs -- not implemented
_ -> opConsX op exs -- not implemented
arityMap = [ ("+",2), ("-",2), ("*",2), ("/",2) ]
-- add (operator,arity) pairs as needed
arity :: String -> Int
arity op = fromMaybe (-1) (lookup op arityMap)
opCons2
)assocOpCons2 = [ ("+",Add), ("-",Sub), ("*",Mul), ("/",Div) ]
-- add new pairs as needed
opCons2 :: String -> [Expr] -> Either ParErr Expr
opCons2 op exs =
case length exs of
2 -> case lookup op assocOpCons2 of -- construct op
Just c -> Right (c (exs!!0) (exs!!1))
Nothing -> invalidOp op
n -> arityErr op n
invalidOp op =` Left ("Invalid operator '" ++ op ++ "'")
arityErr op n =
Left ("Operator '" ++ op ++ "' incorrectly called with "
++ (show n) ++ " operand(s)")
<expression> ::= <term> { <addop> <term> }
<term> ::= <factor> { <mulop> <factor> }
<factor> ::= <var> | <val>
| '(' <expression> ')'
<val> ::= [ '-' ] <unsigned>
<var> ::= <id>
<addop> ::= '+' | '-'
<mulop> ::= '*' | '/'
See code details for ParseInfixCalc
module
No explanation yet in Chapter 44
Instructor’s partial library of recursive descent combinators
Chapter 45, Parsing Combinators
Partial ParserComb
module (prototype
recognizers, but no ability yet to construct ASTs, little
testing)
Standard combinator libraries such as Parsec
Parser generators like Happy and Alex
Lexical analysis
Parsing
Recursive descent parsing and templates
Lexer and parser for ELI Calculator language interpreter