Wednesday, December 11, 2013

An introduction to Functional Programming in F# -- Part 2: Lexers and Parsers

I hadn't meant this to be a weekly series, but that seems to be the tempo at which I'm transcribing things. So, without further ado, Part Two!

Topics covered in the series

  • Introduction to Functional Programming via circuits
  • Lexing and Parsing (this part)
  • Interpreters
  • Simple ‘Functional’ Distributed/Concurrent Programming
  • Non-Functional bits - Assignment, Exceptions, Objects, Libraries, etc.
  • Advanced stuff - Continuations, Monads, Generative Programming (Language Based programming), etc.

Our example

To illustrate the functional approach to the common problem of interpreting streams of data, this post will cover the lexing (or tokenizing) and parsing (rendering into syntactic form) of a language with a simple grammar. And there are no prizes for guessing that this simple language is LISP, with expressions looking like

  • (mult (add x 5) factor)
  • (bind ((x. 1) (y . 2) (z . 3)) (add x y z))
  • (defun append (x y)
            (if (null x) y
                 (cons (head x) (append (tail x) y))))

Schematically, what we want to build looks like

taking characters to lexemes to syntax tree.


Lexers

The lexer for our simple grammar is a state machine that looks like this.

It operates over a stream of characters, emitting tokens on the transitions marked with "⇒"

To start coding, we assume that the input is supplied to us as a string; and our first requirement is to turn this into a char list, which we can do, following the approach of recursive decomposition as introduced in the previous part:

Digression - purity of concept vs pragmatism

We are, in this post, working from first principles throughout, rather than going to library code -- the availability of mature facilities such as fslex/fsyacc or FParsec would reduce this example into just the incantations to initialise those.

This toList implementation is intended as a reminder of the technique used, without diverting into implementation detail. We can see that the repeated creation of temporary strings, for example, would be a potential performance issue in practice. Indeed, what one might write in practice would be more like

but objects and modules come later in the series.


The basic tools


To describe our tokens, we introduce an algebraic type which, as well as being symbolic (as our Bits type before), also includes data


-- this defines our lexemes. Then to interpret the input data, define the list of separators (referred to as stopcases in the state diagram above)


some of which are meaningful tokens in their own right, and digits


We can then define predicates to identify the type of an input character. We build a method to test for membership in a list, and use that to define the tests for each type.

We can also take digit streams and turn them into numeric values


The component lexers

We can now write the code representing the individual state transitions for symbol and number lexing

which takes a list, and returns the lexeme and the rest of the list. The case for numbers is more complicated, since we have to deal with the case where an embedded letter makes the whole thing a symbol

all that remains is to assemble the pieces


Exercise: Write a lexer for .ini files (see http://en.wikipedia.org/wiki/INI_file).


The component parsers

The syntax we want to develop to distinguish our LISP from arbitrary streams of lemexes looks looks like this

  • SEXPR = ATOM |
            BRA SEXPR DOT SEXPR KET |
            BRA SEXPR* KET
  • ATOM =  SYMBOL | NUMBER 

The approach we take is to translate the grammar directly into a program by creating elementary parsers for bra, ket, dot, symbols, atoms etc. and combine them by operations of sequencing, alternation and iteration.


In the same way that the lexer worked in terms of a list of lexemes and the input yet to process, we define our parse state as the tuple (remaining input to be parsed, list of results so far); a parsing operation will either transform a parse state to another (moving data from input to output) or explicitly fail

Consider the case (mult 1 2) and assume BRA has been read already. The input parse state looks like ( [SYMATOM mult ; NUMATOM 1; NUMATOM 2; KET] , _)
where we use the _ as a conventional wildcard, and we want our symbol parser to return ( [NUMATOM 1; NUMATOM 2; KET ] , [SYMATOM mult] ) on success.

Rendering this as code

creates a type that captures the idea of a success with some meaningful value or failure with none. Then we define the parser we want to handle the state where we expect the next lexeme to be a symbol as


which we see will do the job (modulo the wrapping in a Return) that we specified above.


The parser for a number will be much the same


and for all the simple tokens we can write a general parser


from which we produce the specific parsers


Combining the simple parsers

To express the notion that there is more than one valid lexeme at a particular point define

to express the notion of "Try A if it Fails Try B".

Similarly, if we expect two particular results in succession we can define

in words, "Try A if it fails the whole thing fails; if it succeeds then try B on the new parse state; if B fails the whole thing fails."

Then when we expect a number of the same thing in sequence

"Try A to match the longest possible sequence of A’s, including a zero length sequence."


Exercise: Write the operator OPT such that OPT A matches zero or one occurrences of A (note A is a parser!); i.e. A is OPTIONAL.


Exercise: Write the operator PLUS such that PLUS A matches one or more occurrences of A.


Transforms

So far parsers only check the structure of an expression – we also want them to build an abstract structure corresponding to the string, with general form.


  • SEXPR = ATOM ⇒ ATOM-ACTION |
            BRA SEXPR DOT SEXPR KET ⇒ PAIR-ACTION |
            BRA SEXPR* KET ⇒ LIST-ACTION 
  • ATOM =  SYMBOL ⇒ SYMBOL-ACTION |
            NUMBER  ⇒ NUMBER-ACTION 

Transforms take a parse state and build a parse state of a different type. Where our old Parse-State was (remaining input to be parsed, list of lexical items so far) we ant to construct a
new Parse-State of the form (remaining input to be parsed, list of Lisp Trees (i.e. items built from ATOMS,NIL and CONS)).

In code, we define a representation of an S-expression as

and the sort of syntax tree structure we want to build in terms of this type will look like


The general shape of a transform is

which parses input according to f then transforms the parse state according to h.


The grammar we need to make the shape of the figure above is

  • SEXPR = ATOM ⇒ ATOM-ACTION |
            BRA SEXPR DOT SEXPR KET ⇒ CONS-PAIR |
            BRA SEXPR* KET ⇒ CONS-LIST 
  • ATOM =  SYMBOL ⇒ ATOM-ACTION |
            NUMBER  ⇒ ATOM-ACTION 

where the operations are

  • CONSPAIR – produce a singleton list containing a CONS of the items
  • CONSLIST – produce an explicit CONS cell list containing the items
  • ATOMACTION – translate base cases between data types.

or, in code:


The Final Parser

After all the hard work, putting it together is rather anticlimactic. We define a helper

Finally, putting it all into one bundle, of type string -> SEXPR


Exercise: Write a parser for an .ini file

Here's the sort of grammar to use:


  • File = Line*
  • Line = WHITESPACE CMD ‘NL’
  • WHITESPACE = (‘NL’|’TAB’|SPACE)*
  • CMD = SectionHeader | Binding | Comment
  • SectionHeader = ‘[‘ HeaderString ’]’
  • Binding = BinderString ‘=‘BinderString
  • HeaderString = anything but [ or ] or newline
  • BinderString = anything but = or newline | Quote
  • Quote = ‘”’ anything but quotes ‘”’
  • Comment = ‘;’ ANY ‘NL’

Exercise: What real-world performance implications are there in the use of this sort of parser? Re-implement one of the parsers to avoid the limitations of the strictly evaluated list datatype.



No comments :