Tuesday, January 18, 2011

Monadic Parsers in C#

Updated 17-Jan -- because I realised I'd gotten Input and Output the wrong way around; too easy to do when you're using the same type in each case.

Updated 18-Jan -- In which we bug-fix the heck out of the previous two iterations, and pretty much completely rewrite this post. Rule 1 of monadic parsers is -- if you return Monad.Zero, you're probably doing it wrong... It's likely that you mean monadic return of an empty list.

Following on from Mike Hadlow's recent series of posts on monads in C#, and the "Aha!" moment that SelectMany actually means Bind, in the same way that Select means map, I thought I'd take it for a spin on something more than just the usual Identity, Maybe and State monads.

For a long while now there have been things like illustrative F# ports of a Haskell monadic parser -- so why not, I thought, try to emulate that in C#?

So we have a type which looks like this (abusing KeyValuePair as a 2-tuple for .net 3.5 compatibility) --

public delegate List<KeyValuePair<TOutput, IEnumerable<TInput>>> Operation<TInput, TOutput>(IEnumerable<TInput> seq);
public struct Parser<TInput, TOutput>
{
public Operation<TInput, TOutput> Op;
}
view raw gistfile1.cs hosted with ❤ by GitHub

which is a monad upon the return type; the input stream parametrization being orthogonal to this.

The monad itself (and here I've reverted to using Bind for the simple case analogous to the F# case) looks like

public static class ParserMonad
{
public static Parser<TInput, TOutput> ToParser<TInput, TOutput>(this Operation<TInput, TOutput> value)
{
return new Parser<TInput, TOutput> { Op = value };
}
public static Parser<TInput, UOutput> Bind<TInput, TOutput, UOutput>(
this Parser<TInput, TOutput> value,
Func<TOutput, Parser<TInput, UOutput>> operation)
{
Operation<TInput, UOutput> bound = cs =>
{
var r = value.Op(cs);
var rr = r.Select(kv => operation(kv.Key).Op(kv.Value));
return rr.SelectMany(x => x).ToList();
};
return bound.ToParser();
}
public static Parser<TInput, VOutput> SelectMany<TInput, TOutput, UOutput, VOutput>(this Parser<TInput, TOutput> value,
Func<TOutput, Parser<TInput, UOutput>> operation,
Func<TOutput, UOutput, VOutput> select)
{
return value.Bind(t =>
operation(t).Bind(u =>
ParserMonad.Return<TInput, VOutput>(select(t, u))
));
}
public static Parser<TInput, TOutput> Return<TInput, TOutput>(TOutput value)
{
Operation<TInput, TOutput> bound = cs =>
{
var tmp = new KeyValuePair<TOutput, IEnumerable<TInput>>(value, cs);
return new List<KeyValuePair<TOutput, IEnumerable<TInput>>> { tmp };
};
return bound.ToParser();
}
public static Parser<TInput, TOutput> Zero<TInput, TOutput>()
{
Operation<TInput, TOutput> bound = cs =>
{
return new List<KeyValuePair<TOutput, IEnumerable<TInput>>>();
};
return bound.ToParser();
}
}
view raw gistfile1.cs hosted with ❤ by GitHub

Here ToParser is the convenience function-to-monad constructor; and Return is what lifts an output value into the monad -- creating a parser that places that output value into the output stream.

The combinators assemble all the parsers from the atomic units -- some here are more specialized than others.

public static class Combinators
{
/// Non-deterministic choice -- The (++) operator
/// is a (non-deterministic) choice operator for parsers. The parser p ++ q applies
/// both parsers p and q to the argument string, and appends their list of results.
public static Parser<TInput, TOutput> Opp<TInput, TOutput>(Parser<TInput, TOutput> p, Parser<TInput, TOutput> q)
{
Operation<TInput, TOutput> bound = cs =>
{
return p.Op(cs).Concat(q.Op(cs)).ToList();
};
return bound.ToParser();
}
/// a deterministic)choice operator (+++) that has
/// the same behaviour as (++), except that at most one result is returned:
public static Parser<TInput, TOutput> Oppp<TInput, TOutput>(Parser<TInput, TOutput> p, Parser<TInput, TOutput> q)
{
Operation<TInput, TOutput> bound = cs =>
{
var sum = Opp(p, q).Op(cs);
if (sum.Any())
{
return new List<KeyValuePair<TOutput, IEnumerable<TInput>>> { sum.First() };
}
return new List<KeyValuePair<TOutput, IEnumerable<TInput>>>();
};
return bound.ToParser();
}
/// Matches 1 or more applications of parser p
/// f# prototype is parser['a, 'b] -> parser['a list, 'b]
public static Parser<TInput, List<TOutput>> Many1<TInput, TOutput>(Parser<TInput, TOutput> p)
{
var parser = p.Bind(x =>
{
var manyp = Many(p);
var outer = manyp.Bind(xs =>
{
var tmp = new List<TOutput> { x };
var result = tmp.Concat(xs).ToList();
return ParserMonad.Return<TInput, List<TOutput>>(result);
});
return outer;
});
return parser;
}
/// Matches 0 or more applications of parser p
/// f# prototype is parser['a, 'b] -> parser['a list, 'b]
public static Parser<TInput, List<TOutput>> Many<TInput, TOutput>(Parser<TInput, TOutput> p)
{
var arg1 = Many1(p);
var arg2 = ParserMonad.Return<TInput, List<TOutput>>(new List<TOutput>());
return Combinators.Oppp(arg1, arg2);
}
/// Apply a parser
public static Parser<TInput, TOutput> Apply<TInput, TOutput>(Parser<TInput, TOutput> p)
{
var parser = p.Bind(x => ParserMonad.Return<TInput, TOutput>(x));
return parser;
}
/// A parser which successfully consumes the first element
/// of the input stream if it is is non-empty, and fails otherwise.
/// f# prototype is (unit ->) parser[char, string]
public static Parser<TInput, TOutput> Item<TInput, TOutput>(Func<TInput, TOutput> select)
{
Operation<TInput, TOutput> bound =
cs =>
{
if (cs.Any())
{
return new List<KeyValuePair<TOutput, IEnumerable<TInput>>>
{
new KeyValuePair<TOutput, IEnumerable<TInput>>(select(cs.First()), cs.Skip(1).ToList())
};
}
return new List<KeyValuePair<TOutput, IEnumerable<TInput>>>();
};
return bound.ToParser();
}
/// A parser that consumes a single token if it matches the predicate
/// or fails otherwise
/// f# prototype is (char->bool) -> parser[char, string]
public static Parser<TInput, TOutput> Sat<TInput, TOutput>(Predicate<TInput> p, Func<TInput, TInput, TOutput> select)
{
var parser = Item<TInput, TInput>(x => x).SelectMany(c =>
{
if (p(c))
{
return ParserMonad.Return<TInput, TInput>(c);
}
else
{
return ParserMonad.Zero<TInput, TInput>();
}
}, select);
return parser;
}
/// A parser for specific characters
/// f# prototype is char -> parser[char, string]
public static Parser<char, TOutput> Char<TOutput>(char c, Func<char, char, TOutput> select)
{
return Sat(s => s == c, select);
}
/// Parse a specific string
/// f# prototype is string -> parser[char list, string]
public static Parser<char, List<TOutput>> StringP<TOutput>(string input, Func<char, char, TOutput> select)
{
if (input.Any())
{
var parser = Char<TOutput>(input.ToCharArray()[0], select).Bind(c =>
{
var p2 = StringP(input.Substring(1), select);
var inner = p2.Bind(cs =>
{
var tmp = new List<TOutput> { c };
var result = tmp.Concat(cs).ToList();
return ParserMonad.Return<char, List<TOutput>>(result);
});
return inner;
});
return parser;
}
else
{
return ParserMonad.Return<char, List<TOutput>>(new List<TOutput>());
}
}
/// Parse a string of spaces.
/// f# prototype is (unit ->) parser[char list, string]
public static Parser<char, List<TOutput>> Space<TOutput>(Func<char, char, TOutput> select)
{
return Many(Sat(System.Char.IsWhiteSpace, select));
}
/// Parse a token using a parser p, throwing away any trailing space.
/// f# prototype is parser['a, string] -> parser['a, string]
public static Parser<char, TOutput> Token<TOutput>(Parser<char, TOutput> p, Func<char, char, TOutput> select)
{
var parser = p.Bind(x =>
Space(select).Bind(xs =>
{
return ParserMonad.Return<char, TOutput>(x);
}
));
return parser;
}
/// Parse a symbolic token:
/// f# prototype is string -> parser[char list, string]
public static Parser<char, List<TOutput>> Symbol<TOutput>(string symbol, Func<char, char, TOutput> select)
{
Parser<char, List<TOutput>> inner = StringP<TOutput>(symbol, select);
return Token<List<TOutput>>(inner, (x, y) => new List<TOutput> { select(x,y) } );
}
/// Parse repeated applications of a parser p, separated by applications of a parser
/// op whose result value is an operator that is assumed to associate to the left,
/// and which is used to combine the results from the p parsers.
public static Parser<TInput, TOutput> Chain1<TInput, TOutput> (
Parser<TInput, TOutput> p,
Parser<TInput, Func<TOutput, TOutput, TOutput>> op)
{
return p.Bind(x => Chain1Rest(x, p, op));
}
/// Helper method for Chain1
public static Parser<TInput, TOutput> Chain1Rest<TInput, TOutput> (
TOutput a,
Parser<TInput, TOutput> p,
Parser<TInput, Func<TOutput, TOutput, TOutput>> op)
{
var parser = op.Bind(f =>
p.Bind(b =>
Chain1Rest(f(a, b), p, op)
));
var stop = ParserMonad.Return<TInput, TOutput>(a);
return Oppp(parser, stop);
}
//F# Monadic Parser - Calculator Example :)))
// Grammar
//expr ::= expr addop term | term
//term ::= term mulop factor | factor
//factor ::= digit | ( expr )
//digit ::= 0 | 1 | : : : | 9
//addop ::= + | -
//mulop ::= * | /
public static Parser<char, Func<double, double, double>> AddOp()
{
var plus = Symbol<char>("+", (x, y) => x).Bind(x =>
ParserMonad.Return<char, Func<double, double, double>>(
(a, b) => a + b));
var minus = Symbol<char>("-", (x, y) => x).Bind(x =>
ParserMonad.Return<char, Func<double, double, double>>(
(a, b) => a - b));
return Oppp(plus, minus);
}
public static Parser<char, Func<double, double, double>> MulOp()
{
var plus = Symbol<char>("*", (x, y) => x).Bind(x =>
ParserMonad.Return<char, Func<double, double, double>>(
(a, b) => a * b));
var minus = Symbol<char>("/", (x, y) => x).Bind(x =>
ParserMonad.Return<char, Func<double, double, double>>(
(a, b) => a / b));
return Oppp(plus, minus);
}
public static Parser<char, double> Digit()
{
var sat = Sat<char, char>(System.Char.IsDigit, (x, y) => x);
return Token<char> (sat, (x,y) => x).Bind<char, char, double>
(x => ParserMonad.Return<char, double>((double)(x - '0')));
}
public static Parser<char, double> Expr()
{
return Chain1<char, double>(Term(), AddOp());
}
public static Parser<char, double> Term()
{
return Chain1<char, double>(Factor(), MulOp());
}
public static Parser<char, double> Factor()
{
var paren = Symbol("(", (x, y) => x).
Bind(x => Expr().
Bind(n => Symbol(")", (x2, y2) => x2).
Bind(z => ParserMonad.Return<char, double>(n))));
return Oppp(Digit(), paren);
}
}
view raw gistfile1.cs hosted with ❤ by GitHub

And we can top this off with an example

public static void Main()
{
var restToken = Combinators.Sat<string, int>(x =>
{
return x.Length > 2;
},
(x,y) => {
return Math.Max(x.Length, y.Length);
});
var tail = Combinators.Many(restToken);
var data = new List<string> { "the", "quick", "brown", "fox" };
var results = Combinators.Apply(tail).Op(data);
if (results.Any())
{
results.ForEach(x => Console.WriteLine(
"key={0}; value={1}",
"[" + string.Join("; ", x.Key.Select(n=>n.ToString()).ToArray()) + "]",
"[" + string.Join("; ", x.Value.ToArray()) + "]"));
}
else
{
Console.WriteLine("empty");
}
var calc = "5 * (3 + 4)";
var summed = Combinators.Apply(Combinators.Expr()).Op(calc.ToCharArray());
if (summed.Any())
{
summed.ForEach(x => Console.WriteLine(
"key={0}; value={1}",
"[" + x.Key + "]",
"[" + string.Join("; ", x.Value.Select(z => z.ToString()).ToArray()) + "]"));
}
else
{
Console.WriteLine("empty");
}
}
view raw gistfile1.cs hosted with ❤ by GitHub

which produces a successfully parsed output

key=[3; 5; 5; 3]; value=[]
key=[35]; value=[]

So what lessons do I draw from this exercise?

The big one is that old saying about needing to be twice as clever to debug code as to write it; and code that works with continuation passing can get pretty darn clever. Or, as Eric Lippert put it (in the context of writing asynchronous code in this style in C#):

Remember what I was saying about the pros and cons of CPS?

  • PRO: Arbitrarily complex and interesting control flows can be built out of simple parts – check.
  • CON: The reification of control flow via continuations is hard to read and hard to reason about – check.
  • CON: The code that represents the mechanisms of control flow completely overwhelms the meaning of the code – check.
  • CON: The transformation of ordinary code control flow into CPS is the kind of thing that compilers are good at, and almost no one else is – check.

The parts are indeed individually simple enough; but all the rest is sadly true. We can see how clever, what the actual power-to-weight ratio is, or how much what actually gets executed differs from what it superficially looks like we've written, by replacing the initialization in ToParser with

{ Op = x => {
max = Math.Max(max, new System.Diagnostics.StackTrace().GetFrames().Length);
return value(x);
} }
view raw gistfile1.cs hosted with ❤ by GitHub

for a static integer value max in the class. Running the test example, we see that the stack depths achieved by these simple parsers are 64 and 95 respectively -- causes and effects are remote from one another (with the side effect that the only code not covered in this belongs to the subtract and divide operator parsers, and the failure handlers in the main program.

Coupled with this is the difficulty of doing anything meaningful in the way of fine-grained TDD in the bootstrap phase -- the individual pieces are not parsers, are comparatively trivial, and even those return functions operating on functions; it's in combination that they do their magic, and there's a lot of code needed just to read one token out of a stream.

Having the F# equivalent to work with -- to provide guidance of what types all the individual pieces were, and then in essence manually de-sugaring the computation expression notation -- was invaluable. But if you have the F# version, there is little need for the transposition except as a five-finger exercise like this.

On the positive side, though, once you have the toolkit like this and it handles even the simple cases, that probably means that it is robust; and then it can pretty much be taken as a black box, for those times when you absolutely have to do it in C# rather than F#.

1 comment :

Mike Hadlow said...

Thanks for the namecheck, I'm about to tackle a Monadic parser myself, so this was very cool.