Speedup search with parallelism!

Recently, I learned about improved support for semi-explicit parallelism in the Glorious Haskell Compiler GHC:

[Programmers] merely annotate subcomputations that might be evaluated in parallel, leaving the choice of whether to actually do so to the run-time system. [...] programmers should be able to take existing Haskell programs, and with a little high-level knowledge of how the program should parallelize, make some small modifications to the program using existing well-known techniques, and thereby achieve decent speedup on today's parallel hardware.

One of my favourite research topics is non-deterministic programming. Let's see whether semi-explicit parallelism helps to enumerate the search space of such programs faster. This post is generated from a literate Haskell file in case you want to reproduce my results.

Non-deterministic programming in Haskell

import Control.Monad
import Control.Parallel

import System ( getArgs )

As a running example I will use permutation sort, one of the least efficient sorting algorithms ever invented:

sort l = permute l `suchthat` isSorted


but it is a representative program in the so called generate-and-test style and it generates a huge search space quickly. It should be possible to explore that in parallel leaving us with a promising case study.

We can express non-determinism in Haskell using lists of successes. The functions used to define sort could be defined using list comprehensions:

suchthat :: [a] -> (a -> Bool) -> [a]
suchthat xs p = [ x | x <- xs, p x ]

permute :: [a] -> [[a]]
permute [] = [[]]
permute (x:xs) = [ zs | ys <- permute xs, zs <- insert x ys ]

insert :: a -> [a] -> [[a]]
insert x [] = [[x]]
insert x (y:ys) = [x:y:ys] ++ [ y:zs | zs <- insert x ys ]

suchthat is a convenient renaming for flip filter defined using a list comprehension and the functions permute and insert return a list of possible results instead of only one. For example, the call permute [1,2] yields [[1,2],[2,1]]. The predicate isSorted checks whether a list of numbers is sorted:

isSorted :: [Int] -> Bool
isSorted (x:y:ys) = x <= y && isSorted (y:ys)
isSorted _ = True

With these definitions, the call sort [3,2,1] is evaluated to [[1,2,3]] because [1,2,3] is the only sorted permutation of [3,2,1].

Lazy lists are great to represent the result of enumerating the search space of a non-deterministic computation. But they fix the search strategy to sequential depth-first search. I want to define a different search strategy (viz., parallel depth-first search) so this is too restrictive. We can generalize the above definitions using the MonadPlus type class of which the type constructor [] for lists is an instance of. So let's step back for an interlude that explains monads for non-determinism.

Interlude that explains monads for non-determinism

A monad for non-determinism is a type constructor m that supports the following four functions:

return :: a -> m a
(>>=) :: m a -> (a -> m b) -> m b
mzero :: m a
mplus :: m a -> m a -> m a

The first two functions belong to the type class Monad, the last two belong to the type class MonadPlus which is a subclass of Monad. A value of type m a can be interpreted as non-deterministic computation that yields results of type a. The intuition behind these four operations is as follows:

We can rewrite the non-deterministic functions for sorting using monadic combinators:

suchthat :: MonadPlus m => m a -> (a -> Bool) -> m a
suchthat xs p = do x <- xs
guard (p x) -- predefined in Control.Monad
return x

permute :: MonadPlus m => [a] -> m [a]
permute [] = return []
permute (x:xs) = do ys <- permute xs
zs <- insert x ys
return zs

insert :: MonadPlus m => a -> [a] -> m [a]
insert x [] = return [x]
insert x (y:ys) = return (x:y:ys) `mplus` do zs <- insert x ys
return (y:zs)

Now, our sort function can return results in any monad for non-determinism. Especially, it can still yield a list of results.

Parallel search with a dumb instance of MonadPlus

In order to define parallel search algorithms, we define a different instance of the MonadPlus type class that represents the search space as a tree structure1. The idea is to simply replace calls to monadic combinators with constructors. Only >>= is implemented as a function:

data MPlus a = MZero | Return a | MPlus (MPlus a) (MPlus a)

instance Monad MPlus
return = Return

MZero >>= _ = MZero
Return x >>= f = f x
MPlus l r >>= f = MPlus (l >>= f) (r >>= f)

instance MonadPlus MPlus
mzero = MZero
mplus = MPlus

The implementation of >>= is borrowed from laws for Monad and for MonadPlus.

We can now define different search strategies as functions of type MPlus a -> [a]. For example, sequential depth-first search looks as follows:

search :: MPlus a -> [a]
search MZero = []
search (Return x) = [x]
search (MPlus l r) = search l ++ search r

This function is inspired by the MonadPlus instance for [].

Let's count on the promise of semi-explicit parallelism: In the last rule defining search, the results of the right branch r of the search space will probably be needed later and it might be beneficial to compute them in parallel.

We can express this using the combinator par from the module Control.Parallel and define a function for parallel depth-first search:

parSearch :: MPlus a -> [a]
parSearch MZero = []
parSearch (Return x) = [x]
parSearch (MPlus l r) = rs `par` (parSearch l ++ rs)
where rs = parSearch r

This use of par forks a so called spark, i.e., a computation that might be chosen by the run-time system for parallel execution. We do not need to fork a spark for searching the left subtree because parSearch l is evaluated by the call to ++ immediately.

Does this tweak speedup search? Here is a simple benchmark to try it out.

main = do n <- liftM (read.head) getArgs
mapM_ print (parSearch (sort [1..n]))

We compile this benchmark with GHCs -threaded option in order to get code that runs on multiple cores:

$ ghc --make -threaded -O2 -o permsort-par speedup-search-with-parallelism.lhs

With a single core the run-time of permsort-par 11 is about 20 seconds.

Multi-core Permutation Sort

Using two cores the run-time is reduced to 11 seconds. Twice the cores gives almost half the run-time. Using three and four cores is again a little bit faster.

I have measured the run-times using GHC 6.10.1 on an Intel Core2 Quad CPU with 2.83 GHz providing 1 GB of initial heap. Using a current snapshot of GHC 6.11 did not show significant differences on a MacBook with Intel Core2 Duo CPU. This may well be due to the specific nature of the solved problem: there is a huge search space with a single solution, so most of the search space is just an expensive failing computation.

Feel free to try other examples, the parSearch function is on Hackage.

For feedback contact or comment on reddit.

  1. A similar tree representation of non-deterministic search is available on Hackage.