Ran into another interesting performance problem while converting a small test program over to Haskell. Let’s say that you want to walk through every line of a text file, collate character frequencies, and return anything that maps to a particular frequency. For purposes of explanation we’ll do something really silly like look for lines with 10 capital ‘A’s.
{-# OPTIONS -XBangPatterns #-} import IO import System import Data.Word import Data.Array.Unboxed import Control.Monad.ST import Data.Array.ST import qualified Data.ByteString as B import qualified Data.ByteString.Internal as BI import qualified Data.ByteString.Char8 as C counts' !line = do arr <- newArray (0,255) 0 :: ST s (STUArray s Word8 Int) -- collate character counts here return arr counts !line = runSTUArray (counts' line) hit !line = let freq = counts line in freq!65 == 10 main = do args <- getArgs f <- openFile "wordlist" ReadMode text <- B.hGetContents f print $ length $ filter hit (C.lines text)This is fast (0.07s on 300K test file, compiled with -O2 on a Macbook Pro), but there’s no actual collation going on. It suddenly becomes two magnitudes slower as soon as you start to do anything based on the line:
-- version 1 : 9.5 seconds counts' !line = do arr <- newArray (0,255) 0 :: ST s (STUArray s Word8 Int) readArray arr (B.head line) >>= \v -> writeArray arr (B.head line) (v+1) return arr counts !line = runSTUArray (counts' line).. equals 9.54 seconds just when you collate the first character of the string, and led me to believe that ByteStrings were slow, especially since a constant change to the array was fast (0.07 seconds):
-- version 2 : 0.07 seconds counts' !line = do arr <- newArray (0,255) 0 :: ST s (STUArray s Word8 Int) readArray arr 0 >>= \v -> writeArray arr 0 (v+1) return arr(Here I’ll elide the hours of me messing around with profiling and whatnot to try and figure out why B.head was so slow, or how to make B.foldl’ update a state variable, or sprinkling strictness bangs all over the place, or the other tons of false trails that I went down.)
It turns out that the slow portion in all of this is the newArray, not the character collation itself, which can be seen in the profile if the array creation is moved into its own routine and we look at the profile:
-- version 3 -- compile with: ghc -package bytestring stuarray.hs -prof -auto-all -O2 -o stuarray.out -- run with: ./stuarray.out +RTS -p initial_array = do arr <- newArray (0,255) 0 :: ST s (STUArray s Word8 Int) return arr counts' !line = do arr <- initial_array let ix = B.head line readArray arr ix >>= \v -> writeArray arr ix (v+1) return arr counts !line = runSTUArray (counts' line)… and the relevant part of the profile:
MAIN MAIN 1 0 0.0 0.0 100.0 100.0
main Main 240 1 0.7 0.3 100.0 100.0
hit Main 242 335075 0.0 0.0 99.3 99.7
counts Main 243 335075 0.0 0.2 99.3 99.7
counts' Main 244 335075 0.0 0.0 99.3 99.5
initial_array Main 245 335075 99.3 99.5 99.3 99.5
Which brings up two interesting points:
The Ocaml version took about 0.720 seconds (Ocaml version below), compared to Haskell time of about 8.5 seconds. Reducing the array size to 128 in both cases reduced it to 0.112 seconds for Ocaml and 4.3 seconds in Haskell.
Array size | Ocaml | Haskell |
---|---|---|
128 | 0.112 | 4.3 |
255 | 0.184 | 8.5 |
256 | 0.720 | 8.5 |
Note the discontinuity in Ocaml between 255 and 256 elements, which I find interesting. The nice people in #haskell suggested that I stop constructing/destructing the array and instead just null it out between each line, and it turned out that the best way to do that was just to thaw a starter array.
-- version 4: 0.11 seconds initial_array = listArray (0,255) (repeat 0) :: UArray Word8 Int counts' !line = do arr <- thaw initial_array let ix = B.head line readArray arr ix >>= \v -> writeArray arr ix (v+1) return arr counts !line = runSTUArray (counts' line)This is still kind of strange to me (because an object is still getting
constructed/destructed – thaw guarantees a copy) but you can’t argue
with a performance increase. (Interestingly, using Array.copy or
Array.fill in Ocaml is slower than just using Array.make). Final
times? 0.11 seconds in Haskell vs. the best of 0.720 second UPDATED:
0.12 seconds in Ocaml… and the Haskell version is just 50% slower than
a C implementation with a gratuitous calloc instead of a memset.
UPDATE: An pointed out that zeroing the Ocaml array with a for loop is much faster than Array.copy or Array.fill, bringing it to about the same speed as Haskell.
Lessons learned?
Comments are moderated whenever I remember that I have a blog.