Monday, May 03, 2021

String processing as a fold

Having occasion recently to ensure that text in XML/HTML containing non-ASCII (high-bit set) characters, but no control codes aside from line breaks, was presenting them as character references, the obvious algorithm in C#, using a StringBuilder, sb, was

  foreach( char ch in text )
    if ( ch < 127 ) 
      { sb.Append(ch); }
      { sb.AppendFormat( "&#x{0:X4}", (int) ch ); }

In F# though, the obvious direct Seq.iter translation ends up needing |> ignore the results of the append operations. Since this is actually an accumulation operation into the StringBuilder, the better functional representation would be more like

  let sb = Seq.fold (fun (b:StringBuilder)
                         (c:char) -> let ic = int c
                                     if ic >= 127
                                     then b.AppendFormat( "&#x{0:X4};", ic )
                                     else b.Append(c))
              (StringBuilder(text.Length + extra)) // estimate the expansion up front

which lets the StringBuilder flow naturally through the process, rather than closing over it and having to discard the value of the if expression. This could be done in C#, too along the lines of

  var sb = text.Aggregate(new StringBuilder(), (b, c) =>
                                     if (c >= 127) 
                                       {return b.AppendFormat("&#x{0:X4};", c);}
                                       {return b.Append(c);});

only here the returns have to be explicit.

