Wednesday, December 11, 2013

An introduction to Functional Programming in F# -- Part 2: Lexers and Parsers

I hadn't meant this to be a weekly series, but that seems to be the tempo at which I'm transcribing things. So, without further ado, Part Two!

Topics covered in the series

  • Introduction to Functional Programming via circuits
  • Lexing and Parsing (this part)
  • Interpreters
  • Simple ‘Functional’ Distributed/Concurrent Programming
  • Non-Functional bits - Assignment, Exceptions, Objects, Libraries, etc.
  • Advanced stuff - Continuations, Monads, Generative Programming (Language Based programming), etc.

Our example

To illustrate the functional approach to the common problem of interpreting streams of data, this post will cover the lexing (or tokenizing) and parsing (rendering into syntactic form) of a language with a simple grammar. And there are no prizes for guessing that this simple language is LISP, with expressions looking like

  • (mult (add x 5) factor)
  • (bind ((x. 1) (y . 2) (z . 3)) (add x y z))
  • (defun append (x y)
            (if (null x) y
                 (cons (head x) (append (tail x) y))))

Schematically, what we want to build looks like

taking characters to lexemes to syntax tree.


Lexers

The lexer for our simple grammar is a state machine that looks like this.

It operates over a stream of characters, emitting tokens on the transitions marked with "⇒"

To start coding, we assume that the input is supplied to us as a string; and our first requirement is to turn this into a char list, which we can do, following the approach of recursive decomposition as introduced in the previous part:

let rec toList (i: string) =
if i.Length > 0
then (i.Chars(0))::(toList (i.Substring(1)))
else [] ;;
view raw gistfile1.fs hosted with ❤ by GitHub
Digression - purity of concept vs pragmatism

We are, in this post, working from first principles throughout, rather than going to library code -- the availability of mature facilities such as fslex/fsyacc or FParsec would reduce this example into just the incantations to initialise those.

This toList implementation is intended as a reminder of the technique used, without diverting into implementation detail. We can see that the repeated creation of temporary strings, for example, would be a potential performance issue in practice. Indeed, what one might write in practice would be more like

let toList s = (s :> char seq) |> Seq.toList;;
view raw gistfile1.fs hosted with ❤ by GitHub

but objects and modules come later in the series.


The basic tools


To describe our tokens, we introduce an algebraic type which, as well as being symbolic (as our Bits type before), also includes data

type lexical = BRA | KET | DOT | SYMATOM of char list | NUMATOM of int ;;
view raw gistfile1.fs hosted with ❤ by GitHub

-- this defines our lexemes. Then to interpret the input data, define the list of separators (referred to as stopcases in the state diagram above)

let stopcases = toList "().\n\t " ;;
view raw gistfile1.fs hosted with ❤ by GitHub

some of which are meaningful tokens in their own right, and digits

let digits = toList "1234567890" ;;
view raw gistfile1.fs hosted with ❤ by GitHub

We can then define predicates to identify the type of an input character. We build a method to test for membership in a list, and use that to define the tests for each type.

let rec mem x l =
match l with
[] -> false
| (y::ys) -> if x = y then true else mem x ys ;;
let stopcase x = mem x stopcases ;;
let digit x = mem x digits ;;
let alpha x = not(mem x (digits @ stopcases)) ;;
view raw gistfile1.fs hosted with ❤ by GitHub

We can also take digit streams and turn them into numeric values

let digitToint i =
match i with
'0' -> 0
| '1' -> 1
| '2' -> 2
| '3' -> 3
| '4' -> 4
| '5' -> 5
| '6' -> 6
| '7' -> 7
| '8' -> 8
| '9' -> 9
| (_) -> 0 ;;
let mkint s =
let rec mkintAux s a =
match s with
[] -> a
| (x::xs) -> mkintAux xs ((a*10) + (digitToint x))
mkintAux s 0;;
view raw gistfile1.fs hosted with ❤ by GitHub

The component lexers

We can now write the code representing the individual state transitions for symbol and number lexing

let rec lexSym s l =
match l with
[] -> (SYMATOM s, [])
| (x::xs) -> if stopcase x then (SYMATOM s, x::xs)
else lexSym (s@[x]) xs ;;
view raw gistfile1.fs hosted with ❤ by GitHub

which takes a list, and returns the lexeme and the rest of the list. The case for numbers is more complicated, since we have to deal with the case where an embedded letter makes the whole thing a symbol

let rec lexNumOrSym s l =
match l with
[]-> (NUMATOM(mkint s), [])
|(x::xs) ->
if (stopcase x) then ( NUMATOM(mkint s), x::xs)
else if (alpha x) then lexSym (s@[x]) xs
else lexNumOrSym (s@[x]) xs ;;
view raw gistfile1.fs hosted with ❤ by GitHub

all that remains is to assemble the pieces

let rec lexer l =
match l with
| [] -> []
| '\n'::xs -> lexer xs
| ' '::xs -> lexer xs
| '\t'::xs -> lexer xs
| '('::xs -> BRA::(lexer xs)
| ')'::xs -> KET::(lexer xs)
| '.'::xs -> DOT::(lexer xs)
| x::xs ->
let (l,r) = if alpha x then (lexSym [x] xs) else (lexNumOrSym [x] xs)
l :: lexer r ;;
view raw gistfile1.fs hosted with ❤ by GitHub

Exercise: Write a lexer for .ini files (see http://en.wikipedia.org/wiki/INI_file).


The component parsers

The syntax we want to develop to distinguish our LISP from arbitrary streams of lemexes looks looks like this

  • SEXPR = ATOM |
            BRA SEXPR DOT SEXPR KET |
            BRA SEXPR* KET
  • ATOM =  SYMBOL | NUMBER 

The approach we take is to translate the grammar directly into a program by creating elementary parsers for bra, ket, dot, symbols, atoms etc. and combine them by operations of sequencing, alternation and iteration.


In the same way that the lexer worked in terms of a list of lexemes and the input yet to process, we define our parse state as the tuple (remaining input to be parsed, list of results so far); a parsing operation will either transform a parse state to another (moving data from input to output) or explicitly fail

Consider the case (mult 1 2) and assume BRA has been read already. The input parse state looks like ( [SYMATOM mult ; NUMATOM 1; NUMATOM 2; KET] , _)
where we use the _ as a conventional wildcard, and we want our symbol parser to return ( [NUMATOM 1; NUMATOM 2; KET ] , [SYMATOM mult] ) on success.

Rendering this as code

type 'a Maybe = Return of 'a | Fail;;
view raw gistfile1.fs hosted with ❤ by GitHub

creates a type that captures the idea of a success with some meaningful value or failure with none. Then we define the parser we want to handle the state where we expect the next lexeme to be a symbol as

let parseSym s =
match s with
Return((SYMATOM y)::ys, (_)) -> Return(ys,[SYMATOM y])
| (_) -> Fail ;
view raw gistfile1.fs hosted with ❤ by GitHub

which we see will do the job (modulo the wrapping in a Return) that we specified above.


The parser for a number will be much the same

let parseNum s =
match s with
Return((NUMATOM y)::ys, a) -> Return(ys,[NUMATOM y])
| (_) -> Fail ;;
view raw gistfile1.fs hosted with ❤ by GitHub

and for all the simple tokens we can write a general parser

let parseToken x s =
match s with
Return(y::ys, a) -> if x = y then Return(ys, []) else Fail
| (_) -> Fail ;;
view raw gistfile1.fs hosted with ❤ by GitHub

from which we produce the specific parsers

let bra x = parseToken BRA x;;
let ket x = parseToken KET x;;
let dot x = parseToken DOT x;;
view raw gistfile1.fs hosted with ❤ by GitHub

Combining the simple parsers

To express the notion that there is more than one valid lexeme at a particular point define

let (||) f g x =
let result = f x
if result = Fail then (g x) else result ;;
view raw gistfile1.fs hosted with ❤ by GitHub

to express the notion of "Try A if it Fails Try B".

Similarly, if we expect two particular results in succession we can define

let (>>) f g x =
let u = f x
match u with
Return(rest1, result1)
-> let v = g u
match v with
Return(rest2, result2) -> Return(rest2, result1 @ result2)
| Fail -> Fail
| Fail -> Fail ;;
view raw gistfile1.fs hosted with ❤ by GitHub

in words, "Try A if it fails the whole thing fails; if it succeeds then try B on the new parse state; if B fails the whole thing fails."

Then when we expect a number of the same thing in sequence

let rec ITERATE f x =
let u = f x
match u with
Fail -> x
| Return(rest2, result2) ->
match x with
Fail -> Fail
| Return(rest1, result1) ->
ITERATE f (Return(rest2, result1 @ result2));;
view raw gistfile1.fs hosted with ❤ by GitHub

"Try A to match the longest possible sequence of A’s, including a zero length sequence."


Exercise: Write the operator OPT such that OPT A matches zero or one occurrences of A (note A is a parser!); i.e. A is OPTIONAL.


Exercise: Write the operator PLUS such that PLUS A matches one or more occurrences of A.


Transforms

So far parsers only check the structure of an expression – we also want them to build an abstract structure corresponding to the string, with general form.


  • SEXPR = ATOM ⇒ ATOM-ACTION |
            BRA SEXPR DOT SEXPR KET ⇒ PAIR-ACTION |
            BRA SEXPR* KET ⇒ LIST-ACTION 
  • ATOM =  SYMBOL ⇒ SYMBOL-ACTION |
            NUMBER  ⇒ NUMBER-ACTION 

Transforms take a parse state and build a parse state of a different type. Where our old Parse-State was (remaining input to be parsed, list of lexical items so far) we ant to construct a
new Parse-State of the form (remaining input to be parsed, list of Lisp Trees (i.e. items built from ATOMS,NIL and CONS)).

In code, we define a representation of an S-expression as

type SEXPR = SYMBOL of char list | NUMBER of int | NIL | CONS of SEXPR * SEXPR ;;
view raw gistfile1.fs hosted with ❤ by GitHub

and the sort of syntax tree structure we want to build in terms of this type will look like


The general shape of a transform is

let (>:) f h s =
let u = f s
match u with
Return(rest, result)
-> Return(rest, h result)
| Fail -> Fail ;;
view raw gistfile1.fs hosted with ❤ by GitHub

which parses input according to f then transforms the parse state according to h.


The grammar we need to make the shape of the figure above is

  • SEXPR = ATOM ⇒ ATOM-ACTION |
            BRA SEXPR DOT SEXPR KET ⇒ CONS-PAIR |
            BRA SEXPR* KET ⇒ CONS-LIST 
  • ATOM =  SYMBOL ⇒ ATOM-ACTION |
            NUMBER  ⇒ ATOM-ACTION 

where the operations are

  • CONSPAIR – produce a singleton list containing a CONS of the items
  • CONSLIST – produce an explicit CONS cell list containing the items
  • ATOMACTION – translate base cases between data types.

or, in code:

let CONSPAIR [x;y] = [CONS(x,y)];;
let rec CONSLIST l =
let rec conslist y =
match y with
x::xs -> CONS(x, conslist xs)
| [] -> NIL
[conslist l];;
let ATOMACTION v =
match v with
[SYMATOM u] -> [SYMBOL u]
| [NUMATOM i] -> [NUMBER i]
| (_) -> [];;
view raw gistfile1.fs hosted with ❤ by GitHub

The Final Parser

After all the hard work, putting it together is rather anticlimactic. We define a helper

let parseAtom s = ((parseSym >: ATOMACTION) || (parseNum >: ATOMACTION)) s ;;
view raw gistfile1.fs hosted with ❤ by GitHub

Finally, putting it all into one bundle, of type string -> SEXPR

let parse str =
let r = parseSEXPR (Return (lexer (toList str), []))
match r with
Return(_ , result::_) -> result
| Return(_, []) -> NIL
| Fail -> NIL ;;
view raw gistfile1.fs hosted with ❤ by GitHub

Exercise: Write a parser for an .ini file

Here's the sort of grammar to use:


  • File = Line*
  • Line = WHITESPACE CMD ‘NL’
  • WHITESPACE = (‘NL’|’TAB’|SPACE)*
  • CMD = SectionHeader | Binding | Comment
  • SectionHeader = ‘[‘ HeaderString ’]’
  • Binding = BinderString ‘=‘BinderString
  • HeaderString = anything but [ or ] or newline
  • BinderString = anything but = or newline | Quote
  • Quote = ‘”’ anything but quotes ‘”’
  • Comment = ‘;’ ANY ‘NL’

Exercise: What real-world performance implications are there in the use of this sort of parser? Re-implement one of the parsers to avoid the limitations of the strictly evaluated list datatype.



No comments :