Haskell: Explaining Arrows through XML Transformations

There’s been some discussion on the programming subreddit today about arrows in Haskell, which was kicked off by dons and continued by DRMacIver, who showed how arrow operators can be used with pure functions. But this left some people wondering about the practical applications of arrows, beyond merely combining functions.

I’ve been doing a bit of work recently with HXT, the Haskell XML Toolkit, which uses an arrows-based API that I think provides a good illustration. So I’ll try to explain arrows through their use in HXT. I hasten to add that none of the following is my original work: I am plagiarizing shamelessly from the HXT website and Wiki.

Lets think for a moment about XML transformations. Suppose we want to transform an XML fragment that looks like this:

<cd artist="Luciano Pavarotti" title="Pavarotti Gala Concert">
    <company>Decca</company>
    <price>9.90</price>
</cd>

into a fragment of XHTML that looks like this:

<tr><td>Artist:</td><td>Luciano Pavarotti</td></tr>
<tr><td>Title:</td><td>Pavarotti Gala Concert</td></tr>

In general, we want to process each node of an input XML tree into zero or more transformed nodes in the output XML tree. To achieve this we can define an XmlFilter type: a function that takes an XmlTree and returns zero or more output XmlTrees. We define XmlFilter as follows:

type XmlFilter = XmlTree -> [XmlTree]

which we can generalise as:

type Filter a b = a -> [b]
type XmlFilter = Filter XmlTree XmlTree

where a and b are type parameters, or stand-ins for any type.

First lets look at how to define some basic building blocks. A frequent requirement in XML transformation is to include or exclude fragments based on a boolean predicate, such as “published by Decca” or “price less than ten pounds”. A predicate is just a function a -> Bool so we can define the function isA which turns one of these functions into a Filter:

isA :: (a -> Bool) -> Filter a a
isA p x
  | p x       = [x]
  | otherwise = []

Also we may want to combine two Filters to create a new composite Filter, for example the <+> operator could be defined to concatenate the results of applying two Filters to the same item:

(<+>) :: Filter a b -> Filter a b -> Filter a b
(f <+> g) t = f t ++ g t

Now suppose we have a filter getChildren that returns all the child trees of an XmlTree… I won’t bother to define it in full here because it’s incidental to the main point. We can combine getChildren with <+> to give us a new filter called multi, which applies a filter to an XML node and all descendents of that node:

multi :: XmlFilter -> XmlFilter
multi f = f <+> (getChildren >>> multi f)

Sorry for not defining the >>> operator either: it simply chains two filters together by taking all the outputs from the first and passing them through the second.

Okay that’s enough building blocks. Our XML transformation API is coming along nicely… but, it has a big limitation: so far, all our filters are pure functions, which means no side effects. But what if we want to write some tracing information to a log? That requires I/O, which is a side effect. To solve this problem we define the type XmlIOFilter as follows:

type XmlIOFilter = XmlTree -> IO [XmlTree]

But then what if we want to propagate some state through the filter? There’s a solution to this as well, we simply define the type XmlStateFilter:

type XmlStateFilter s = s -> XmlTree -> (s, [XmlTree])

Hmm but what if we want both I/O and state propagation? Well, we have to define XmlIOStateFilter:

type XmlIOStateFilter s = s -> XmlTree -> IO (s, [XmlTree])

Here’s the problem: looking back at our multi filter, it seems we now have to redefine multi for each of these different kinds of filter. In other words we have to write all of the following functions:

multi     :: XmlFilter -> XmlFilter
multiIO   :: XmlIOFilter -> XmlIOFilter
multiST   :: XmlStateFilter s -> XmlStateFilter s
multiIOST :: XmlIOStateFilter s -> XmlIOStateFilter s

And we have to do this for all our other building blocks as well, such as <+> and isA and so on! What a nightmare.

Fortunately (come on, you knew there was going to be a “fortunately”) Haskell’s type classes come to the rescue. All of the filter types above are “function-like”: they all take inputs and produce outputs. However they have extra “stuff” as well: some have state, some have I/O, some have both. So we need a type class that encapsulates the function-like behaviour, while allowing for the extra stuff to happen.

We call that type class “Arrow”. Anything that belongs to the Arrow type class takes input and produces and output, and optionally does other stuff. We can see from the above that pure functions belong to the Arrow type class, therefore our original XmlFilter type is an arrow. But the I/O and state infected versions of XmlFilter are also arrows, so we can write all of our building blocks (multi, <+>, isA, etc.) just once. And this is exactly how HXT works.

What does HXT code look like in practice? Well, we simply build up processing pipelines, or “filter chains”. Just like the function composition operator “.” (dot) can be used to chain together pure functions in a point-free style, we can chain together filters using the standard arrow operators such as:

  • >>>, which creates a composite filter by piping the output of one filter into the input of another
  • &&&, which creates a filter that forks the input pipeline into two, applies different filters to each fork and then outputs a tuple containing the two outputs
  • ***, which creates a filter that works on a pipeline that has been forked into a tuple (e.g. using &&&), applies a filter to each side of the tuple, and outputs a new tuple with the two results.
  • Many more based on a series of more specific type classes that extend Arrow.

As DRMacIver showed, these operations can be useful for composing together pure functions, but their usefulness goes far beyond that. When building these pipelines I tend to imagine them in the form of a circuit diagram, with XML nodes being the electrons flowing through it. So >>> is a simple wire from one processing unit to another; &&& is a splitter, and so on. Helpfully, I can use whatever mix I like of pure functions, stateful filters and I/O-producing filters, although of course the composite filter will always be infected with I/O if any one of its components uses I/O, and similarly for state.

For example, suppose we want to get a list of artists and titles in the format “Pavarotti Gala Concert performed by Luciano Pavarotti”. First we define the pure function itemSummary as follows:

itemSummary :: String -> String -> String
itemSummary a t = t ++ " performed by " ++ a

Now we can build it into a pipeline:

summaries :: (ArrowXml a) => a XmlTree String
summaries = deep (hasName "cd")
            >>>
            (getAttrValue "artist") &&& (getAttrValue "title")
            >>>
            (arr . uncurry) itemSummary

This works by first finding all the elements called “cd” in the XML document, using the deep filter, which is similar to multi. The next stage in the pipeline splits the data into a tuple, and applies two filters to the left and right hand side. The tuple we get out of our example fragment above will be ("Luciano Pavarotti", "Pavarotti Gala Concert"). The final stage takes the function itemSummary and uncurries it (i.e. turns it into a function that takes a single tuple parameter instead of two parameters) and then turns it into an arrow using the arr function.

I think this is a beautiful way to process XML — far nicer than XSLT, certainly. In fact, short of using a pictorial language with boxes and lines, I don’t think there’s any better way to turn the mental model of a transformation into executable code.

In fact this notation can be used for many kinds of data processing pipeline, not just XML transformations. For example, you could use it as a DSL for business process descriptions. A replacement for BPEL, anybody?