Recently, I learned about improved support for *semi-explicit parallelism* in the Glorious Haskell Compiler GHC:

[Programmers] merely annotate subcomputations that might be evaluated in parallel, leaving the choice of whether to actually do so to the run-time system. [...] programmers should be able to take existing Haskell programs, and with a little high-level knowledge of how the program should parallelize, make some small modifications to the program using existing well-known techniques, and thereby achieve decent speedup on today's parallel hardware.

One of my favourite research topics is non-deterministic programming. Let's see whether semi-explicit parallelism helps to enumerate the search space of such programs faster. This post is generated from a literate Haskell file in case you want to reproduce my results.

`import Control.Monad`

import Control.Parallel

import System ( getArgs )

As a running example I will use permutation sort, one of the least efficient sorting algorithms ever invented:

`sort l = permute l `suchthat` isSorted`

Admittedly,

this is a toy example and

sorting is usually not expressed as a search problem

but it is a representative program in the so called *generate-and-test* style and it generates a huge search space quickly. It should be possible to explore that in parallel leaving us with a promising case study.

We can express non-determinism in Haskell using *lists of successes*. The functions used to define `sort`

could be defined using *list comprehensions*:

`suchthat :: [a] -> (a -> Bool) -> [a]`

suchthat xs p = [ x | x <- xs, p x ]

permute :: [a] -> [[a]]

permute [] = [[]]

permute (x:xs) = [ zs | ys <- permute xs, zs <- insert x ys ]

insert :: a -> [a] -> [[a]]

insert x [] = [[x]]

insert x (y:ys) = [x:y:ys] ++ [ y:zs | zs <- insert x ys ]

`suchthat`

is a convenient renaming for `flip filter`

defined using a list comprehension and the functions `permute`

and `insert`

return a list of possible results instead of only one. For example, the call `permute [1,2]`

yields `[[1,2],[2,1]]`

. The predicate `isSorted`

checks whether a list of numbers is sorted:

`isSorted :: [Int] -> Bool`

isSorted (x:y:ys) = x <= y && isSorted (y:ys)

isSorted _ = True

With these definitions, the call `sort [3,2,1]`

is evaluated to `[[1,2,3]]`

because `[1,2,3]`

is the only sorted permutation of `[3,2,1]`

.

Lazy lists are great to represent the result of enumerating the search space of a non-deterministic computation. But they fix the search strategy to sequential depth-first search. I want to define a different search strategy (viz., parallel depth-first search) so this is too restrictive. We can generalize the above definitions using the `MonadPlus`

type class of which the type constructor `[]`

for lists is an instance of. So let's step back for an interlude that explains monads for non-determinism.

A monad for non-determinism is a type constructor `m`

that supports the following four functions:

`return :: a -> m a`

(>>=) :: m a -> (a -> m b) -> m b

mzero :: m a

mplus :: m a -> m a -> m a

The first two functions belong to the type class `Monad`

, the last two belong to the type class `MonadPlus`

which is a subclass of `Monad`

. A value of type `m a`

can be interpreted as non-deterministic computation that yields results of type `a`

. The intuition behind these four operations is as follows:

`return`

constructs a non-deterministic computation with a single result, viz., the one given as argument,`>>=`

applies a non-deterministic function to every result of a non-deterministic computation and merges the results,`mzero`

constructs a non-deterministic computation without results, i.e., a failing computation, and`mplus`

merges the results of two non-deterministic computations.

We can rewrite the non-deterministic functions for sorting using monadic combinators:

`suchthat :: MonadPlus m => m a -> (a -> Bool) -> m a`

suchthat xs p = do x <- xs

guard (p x) -- predefined in Control.Monad

return x

permute :: MonadPlus m => [a] -> m [a]

permute [] = return []

permute (x:xs) = do ys <- permute xs

zs <- insert x ys

return zs

insert :: MonadPlus m => a -> [a] -> m [a]

insert x [] = return [x]

insert x (y:ys) = return (x:y:ys) `mplus` do zs <- insert x ys

return (y:zs)

Now, our `sort`

function can return results in any monad for non-determinism. Especially, it can still yield a list of results.

`MonadPlus`

In order to define parallel search algorithms, we define a different instance of the `MonadPlus`

type class that represents the search space as a tree structure^{1}. The idea is to simply replace calls to monadic combinators with constructors. Only `>>=`

is implemented as a function:

`data MPlus a = MZero | Return a | MPlus (MPlus a) (MPlus a)`

instance Monad MPlus

where

return = Return

MZero >>= _ = MZero

Return x >>= f = f x

MPlus l r >>= f = MPlus (l >>= f) (r >>= f)

instance MonadPlus MPlus

where

mzero = MZero

mplus = MPlus

The implementation of `>>=`

is borrowed from laws for `Monad`

and for `MonadPlus`

.

We can now define different search strategies as functions of type `MPlus a -> [a]`

. For example, sequential depth-first search looks as follows:

`search :: MPlus a -> [a]`

search MZero = []

search (Return x) = [x]

search (MPlus l r) = search l ++ search r

This function is inspired by the `MonadPlus`

instance for `[]`

.

Let's count on the promise of semi-explicit parallelism: In the last rule defining `search`

, the results of the right branch `r`

of the search space will probably be needed later and it might be beneficial to compute them in parallel.

We can express this using the combinator `par`

from the module `Control.Parallel`

and define a function for parallel depth-first search:

`parSearch :: MPlus a -> [a]`

parSearch MZero = []

parSearch (Return x) = [x]

parSearch (MPlus l r) = rs `par` (parSearch l ++ rs)

where rs = parSearch r

This use of `par`

forks a so called *spark*, i.e., a computation that might be chosen by the run-time system for parallel execution. We do not need to fork a spark for searching the left subtree because `parSearch l`

is evaluated by the call to `++`

immediately.

Does this tweak speedup search? Here is a simple benchmark to try it out.

`main = do n <- liftM (read.head) getArgs`

mapM_ print (parSearch (sort [1..n]))

We compile this benchmark with GHCs `-threaded`

option in order to get code that runs on multiple cores:

```
$ ghc --make -threaded -O2 -o permsort-par speedup-search-with-parallelism.lhs
```

With a single core the run-time of `permsort-par 11`

is about 20 seconds.

Using two cores the run-time is reduced to 11 seconds. Twice the cores gives almost half the run-time. Using three and four cores is again a little bit faster.

I have measured the run-times using GHC 6.10.1 on an Intel Core2 Quad CPU with 2.83 GHz providing 1 GB of initial heap. Using a current snapshot of GHC 6.11 did not show significant differences on a MacBook with Intel Core2 Duo CPU. This may well be due to the specific nature of the solved problem: there is a huge search space with a single solution, so most of the search space is just an expensive failing computation.

Feel free to try other examples, the `parSearch`

function is on Hackage.

For feedback contact or comment on reddit.