An introduction to Functional Programming in F# -- Part 2: Lexers and Parsers
I hadn't meant this to be a weekly series, but that seems to be the tempo at which I'm transcribing things. So, without further ado, Part Two!
Topics covered in the series
- Introduction to Functional Programming via circuits
- Lexing and Parsing (this part)
- Interpreters
- Simple ‘Functional’ Distributed/Concurrent Programming
- Non-Functional bits - Assignment, Exceptions, Objects, Libraries, etc.
- Advanced stuff - Continuations, Monads, Generative Programming (Language Based programming), etc.
Our example
To illustrate the functional approach to the common problem of interpreting streams of data, this post will cover the lexing (or tokenizing) and parsing (rendering into syntactic form) of a language with a simple grammar. And there are no prizes for guessing that this simple language is LISP, with expressions looking like
- (mult (add x 5) factor)
- (bind ((x. 1) (y . 2) (z . 3)) (add x y z))
(defun append (x y)
(if (null x) y
(cons (head x) (append (tail x) y))))
Schematically, what we want to build looks like
taking characters to lexemes to syntax tree.
Lexers
The lexer for our simple grammar is a state machine that looks like this.
It operates over a stream of characters, emitting tokens on the transitions marked with "⇒"
To start coding, we assume that the input is supplied to us as a string; and our first requirement is to turn this into a char list
, which we can do, following the approach of recursive decomposition as introduced in the previous part:
Digression - purity of concept vs pragmatism
We are, in this post, working from first principles throughout, rather than going to library code -- the availability of mature facilities such as fslex
/fsyacc
or FParsec would reduce this example into just the incantations to initialise those.
This toList
implementation is intended as a reminder of the technique used, without diverting into implementation detail. We can see that the repeated creation of temporary strings, for example, would be a potential performance issue in practice. Indeed, what one might write in practice would be more like
but objects and modules come later in the series.
The basic tools
To describe our tokens, we introduce an algebraic type which, as well as being symbolic (as our Bits
type before), also includes data
-- this defines our lexemes. Then to interpret the input data, define the list of separators (referred to as stopcases in the state diagram above)
some of which are meaningful tokens in their own right, and digits
We can then define predicates to identify the type of an input character. We build a method to test for membership in a list, and use that to define the tests for each type.
We can also take digit streams and turn them into numeric values
The component lexers
We can now write the code representing the individual state transitions for symbol and number lexing
which takes a list, and returns the lexeme and the rest of the list. The case for numbers is more complicated, since we have to deal with the case where an embedded letter makes the whole thing a symbol
all that remains is to assemble the pieces
Exercise: Write a lexer for .ini files (see http://en.wikipedia.org/wiki/INI_file).
The component parsers
The syntax we want to develop to distinguish our LISP from arbitrary streams of lemexes looks looks like this
SEXPR = ATOM | BRA SEXPR DOT SEXPR KET | BRA SEXPR* KET
ATOM = SYMBOL | NUMBER
The approach we take is to translate the grammar directly into a program by creating elementary parsers for bra, ket, dot, symbols, atoms etc. and combine them by operations of sequencing, alternation and iteration.
In the same way that the lexer worked in terms of a list of lexemes and the input yet to process, we define our parse state as the tuple (remaining input to be parsed, list of results so far); a parsing operation will either transform a parse state to another (moving data from input to output) or explicitly fail
Consider the case (mult 1 2)
and assume BRA
has been read already. The input parse state looks like ( [SYMATOM mult ; NUMATOM 1; NUMATOM 2; KET] , _)
where we use the _
as a conventional wildcard, and we want our symbol parser to return ( [NUMATOM 1; NUMATOM 2; KET ] , [SYMATOM mult] )
on success.
Rendering this as code
creates a type that captures the idea of a success with some meaningful value or failure with none. Then we define the parser we want to handle the state where we expect the next lexeme to be a symbol as
which we see will do the job (modulo the wrapping in a Return
) that we specified above.
The parser for a number will be much the same
and for all the simple tokens we can write a general parser
from which we produce the specific parsers
Combining the simple parsers
To express the notion that there is more than one valid lexeme at a particular point define
to express the notion of "Try A if it Fails Try B".
Similarly, if we expect two particular results in succession we can define
in words, "Try A if it fails the whole thing fails; if it succeeds then try B on the new parse state; if B fails the whole thing fails."
Then when we expect a number of the same thing in sequence
"Try A to match the longest possible sequence of A’s, including a zero length sequence."
Exercise: Write the operator OPT such that OPT A matches zero or one occurrences of A (note A is a parser!); i.e. A is OPTIONAL.
Exercise: Write the operator PLUS such that PLUS A matches one or more occurrences of A.
Transforms
So far parsers only check the structure of an expression – we also want them to build an abstract structure corresponding to the string, with general form.
SEXPR = ATOM ⇒ ATOM-ACTION | BRA SEXPR DOT SEXPR KET ⇒ PAIR-ACTION | BRA SEXPR* KET ⇒ LIST-ACTION
ATOM = SYMBOL ⇒ SYMBOL-ACTION | NUMBER ⇒ NUMBER-ACTION
Transforms take a parse state and build a parse state of a different type. Where our old Parse-State was (remaining input to be parsed, list of lexical items so far) we ant to construct a
new Parse-State of the form (remaining input to be parsed, list of Lisp Trees (i.e. items built from ATOMS,NIL and CONS)).
In code, we define a representation of an S-expression as
and the sort of syntax tree structure we want to build in terms of this type will look like
The general shape of a transform is
which parses input according to f then transforms the parse state according to h.
The grammar we need to make the shape of the figure above is
SEXPR = ATOM ⇒ ATOM-ACTION | BRA SEXPR DOT SEXPR KET ⇒ CONS-PAIR | BRA SEXPR* KET ⇒ CONS-LIST
ATOM = SYMBOL ⇒ ATOM-ACTION | NUMBER ⇒ ATOM-ACTION
where the operations are
CONSPAIR
– produce a singleton list containing a CONS of the itemsCONSLIST
– produce an explicit CONS cell list containing the itemsATOMACTION
– translate base cases between data types.
or, in code:
The Final Parser
After all the hard work, putting it together is rather anticlimactic. We define a helper
Finally, putting it all into one bundle, of type string -> SEXPR
Exercise: Write a parser for an .ini file
Here's the sort of grammar to use:
- File = Line*
- Line = WHITESPACE CMD ‘NL’
- WHITESPACE = (‘NL’|’TAB’|SPACE)*
- CMD = SectionHeader | Binding | Comment
- SectionHeader = ‘[‘ HeaderString ’]’
- Binding = BinderString ‘=‘BinderString
- HeaderString = anything but [ or ] or newline
- BinderString = anything but = or newline | Quote
- Quote = ‘”’ anything but quotes ‘”’
- Comment = ‘;’ ANY ‘NL’
Exercise: What real-world performance implications are there in the use of this sort of parser? Re-implement one of the parsers to avoid the limitations of the strictly evaluated list datatype.
No comments :
Post a Comment