2
votes

I have a little script to read in, parse and derive some kind of interesting (not really) statistics from an apache log file. So far I've made two simple options, the total number of bytes sent in all requests in the log file, and a top 10 of the most common IP adresses.

The first "mode" is just a simple sum of all the parsed bytes. The second one is a fold over a map (Data.Map), using insertWith (+) 1' to count the occurrences.

The first one runs as I expected, most of the time spent parsing, in constant space.

42,359,709,344 bytes allocated in the heap 72,405,840 bytes copied during GC 113,712 bytes maximum residency (1553 sample(s)) 145,872 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation)

Generation 0: 76311 collections,
0 parallel, 0.89s, 0.99s elapsed
Generation 1: 1553 collections, 0 parallel, 0.21s, 0.22s elapsed

INIT time 0.00s ( 0.00s elapsed) MUT time 21.76s ( 24.82s elapsed) GC time 1.10s ( 1.20s elapsed) EXIT time
0.00s ( 0.00s elapsed) Total time 22.87s ( 26.02s elapsed)

%GC time 4.8% (4.6% elapsed)

Alloc rate 1,946,258,962 bytes per MUT second

Productivity 95.2% of total user, 83.6% of total elapsed

However, the second one does not!

49,398,834,152 bytes allocated in the heap 580,579,208 bytes copied during GC 718,385,088 bytes maximum residency (15 sample(s)) 134,532,128 bytes maximum slop 1393 MB total memory in use (172 MB lost due to fragmentation)

Generation 0: 91275 collections,
0 parallel, 252.65s, 254.46s elapsed
Generation 1: 15 collections, 0 parallel, 0.12s, 0.12s elapsed

INIT time 0.00s ( 0.00s elapsed) MUT time 41.11s ( 48.87s elapsed) GC time 252.77s (254.58s elapsed) EXIT time
0.00s ( 0.01s elapsed) Total time 293.88s (303.45s elapsed)

%GC time 86.0% (83.9% elapsed)

Alloc rate 1,201,635,385 bytes per MUT second

Productivity 14.0% of total user, 13.5% of total elapsed

And here is the code.

{-# LANGUAGE OverloadedStrings #-}

module Main where

import qualified Data.Attoparsec.Lazy as AL
import Data.Attoparsec.Char8 hiding (space, take)
import qualified Data.ByteString.Char8 as S
import qualified Data.ByteString.Lazy.Char8 as L
import Control.Monad (liftM)
import System.Environment (getArgs)
import Prelude hiding (takeWhile)
import qualified Data.Map as M
import Data.List (foldl', sortBy)
import Text.Printf (printf)
import Data.Maybe (fromMaybe)

type Command = String

data LogLine = LogLine {
    getIP     :: S.ByteString,
    getIdent  :: S.ByteString,
    getUser   :: S.ByteString,
    getDate   :: S.ByteString,
    getReq    :: S.ByteString,
    getStatus :: S.ByteString,
    getBytes  :: S.ByteString,
    getPath   :: S.ByteString,
    getUA     :: S.ByteString
} deriving (Ord, Show, Eq)

quote, lbrack, rbrack, space :: Parser Char
quote  = satisfy (== '\"')
lbrack = satisfy (== '[')
rbrack = satisfy (== ']')
space  = satisfy (== ' ')

quotedVal :: Parser S.ByteString
quotedVal = do
    quote
    res <- takeTill (== '\"')
    quote
    return res

bracketedVal :: Parser S.ByteString
bracketedVal = do
    lbrack
    res <- takeTill (== ']')
    rbrack
    return res

val :: Parser S.ByteString
val = takeTill (== ' ')

line :: Parser LogLine
l    ine = do
    ip <- val
    space
    identity <- val
    space
    user <- val
    space
    date <- bracketedVal
    space
    req <- quotedVal
    space
    status <- val
    space
    bytes <- val
    (path,ua) <- option ("","") combined
    return $ LogLine ip identity user date req status bytes path ua

combined :: Parser (S.ByteString,S.ByteString)
combined = do
    space
    path <- quotedVal
    space
    ua <- quotedVal
    return (path,ua)

countBytes :: [L.ByteString] -> Int
countBytes = foldl' count 0
    where
        count acc l = case AL.maybeResult $ AL.parse line l of
            Just x  -> (acc +) . maybe 0 fst . S.readInt . getBytes $ x
            Nothing -> acc

countIPs :: [L.ByteString] -> M.Map S.ByteString Int
countIPs = foldl' count M.empty
    where
        count acc l = case AL.maybeResult $ AL.parse line l of
            Just x -> M.insertWith' (+) (getIP x) 1 acc
            Nothing -> acc

---------------------------------------------------------------------------------

main :: IO ()
main = do
  [cmd,path] <- getArgs
  dispatch cmd path

pretty :: Show a => Int -> (a, Int) -> String
pretty i (bs, n) = printf "%d: %s, %d" i (show bs) n

dispatch :: Command -> FilePath -> IO ()
dispatch cmd path = action path
    where
        action = fromMaybe err (lookup cmd actions)
        err    = printf "Error: %s is not a valid command." cmd

actions :: [(Command, FilePath -> IO ())]
actions = [("bytes", countTotalBytes)
          ,("ips",  topListIP)]

countTotalBytes :: FilePath -> IO ()
countTotalBytes path = print . countBytes . L.lines =<< L.readFile path

topListIP :: FilePath -> IO ()
topListIP path = do
    f <- liftM L.lines $ L.readFile path
    let mostPopular (_,a) (_,b) = compare b a
        m = countIPs f
    mapM_ putStrLn . zipWith pretty [1..] . take 10 . sortBy mostPopular . M.toList $ m

Edit:

Adding +RTS -A16M reduced GC to 20%. Memory use of course unchanged.

2
Not a solution, but foldl' over an accumulating map is a waste. Just use a regular foldl.John L
@John L, you're quite right, a quick test shows no difference in speed, GC or memory use, between foldl and foldl' in this case.Johanna Larsson

2 Answers

3
votes

I suggest making the following changes to the code:

@@ -1,4 +1,4 @@
-{-# LANGUAGE OverloadedStrings #-}
+{-# LANGUAGE BangPatterns, OverloadedStrings #-}

 module Main where

@@ -9,7 +9,7 @@
 import Control.Monad (liftM)
 import System.Environment (getArgs)
 import Prelude hiding (takeWhile)
-import qualified Data.Map as M
+import qualified Data.HashMap.Strict as M
 import Data.List (foldl', sortBy)
 import Text.Printf (printf)
 import Data.Maybe (fromMaybe)
@@ -17,15 +17,15 @@
 type Command = String

 data LogLine = LogLine {
-    getIP     :: S.ByteString,
-    getIdent  :: S.ByteString,
-    getUser   :: S.ByteString,
-    getDate   :: S.ByteString,
-    getReq    :: S.ByteString,
-    getStatus :: S.ByteString,
-    getBytes  :: S.ByteString,
-    getPath   :: S.ByteString,
-    getUA     :: S.ByteString
+    getIP     :: !S.ByteString,
+    getIdent  :: !S.ByteString,
+    getUser   :: !S.ByteString,
+    getDate   :: !S.ByteString,
+    getReq    :: !S.ByteString,
+    getStatus :: !S.ByteString,
+    getBytes  :: !S.ByteString,
+    getPath   :: !S.ByteString,
+    getUA     :: !S.ByteString
 } deriving (Ord, Show, Eq)

 quote, lbrack, rbrack, space :: Parser Char
@@ -39,14 +39,14 @@
     quote
     res <- takeTill (== '\"')
     quote
-    return res
+    return $! res

 bracketedVal :: Parser S.ByteString
 bracketedVal = do
     lbrack
     res <- takeTill (== ']')
     rbrack
-    return res
+    return $! res

 val :: Parser S.ByteString
 val = takeTill (== ' ')
@@ -67,14 +67,14 @@
     space
     bytes <- val
     (path,ua) <- option ("","") combined
-    return $ LogLine ip identity user date req status bytes path ua
+    return $! LogLine ip identity user date req status bytes path ua

 combined :: Parser (S.ByteString,S.ByteString)
 combined = do
     space
-    path <- quotedVal
+    !path <- quotedVal
     space
-    ua <- quotedVal
+    !ua <- quotedVal
     return (path,ua)

 countBytes :: [L.ByteString] -> Int
@@ -84,11 +84,11 @@
             Just x  -> (acc +) . maybe 0 fst . S.readInt . getBytes $ x
             Nothing -> acc

-countIPs :: [L.ByteString] -> M.Map S.ByteString Int
+countIPs :: [L.ByteString] -> M.HashMap S.ByteString Int
 countIPs = foldl' count M.empty
     where
         count acc l = case AL.maybeResult $ AL.parse line l of
-            Just x -> M.insertWith' (+) (getIP x) 1 acc
+            Just x -> M.insertWith (+) (getIP x) 1 acc
             Nothing -> acc

 ---------------------------------------------------------------------------------

I made the fields of LogLine strict to avoid them containing thunks referring to expressions related to parsing. It's good practice to make fields strict, unless you really need them to be lazy.

I made sure that the parse result is created as soon as possible (that's the $! part of the change), also to avoid delaying the parsing until you actually inspect the individual fields of LogLine.

Finally I switched to a better data structure, HashMap from the unordered-containers package. Note that all functions in Data.HashMap.Strict are value strict, which means we can use the plain insertWith variant.

Note that taking a sub-string of a ByteString forces the original string to be retained in memory, due to sharing the underlying storage (this is the same as for Java's String). If you want to make sure that no extra memory is retained, use the copy function from the bytestring package. You can try to call copy on the result of (getIP x) and see if that makes any difference. The trade-off here is using some extra computation to copy the string in return for lower space usage.

Note that using -A<high number> tend to improve performance of short running programs (i.e. benchmarks) but not necessarily on real programs. Same goes for -H. At least a higher -H value (e.g. 1G) doesn't hurt the performance of your program.

0
votes

The most obvious point is that your first script can throw away data as soon as it's seen it, whereas the second one must hold on to everything it's seen. Hence, you'd expect the second script to take at least O(N) memory whereas the first can run in constant space.

Have you tried running with heap profiling turned on? I could make some stabs at where the excess allocations are likely to be happening in your code, but there's no substitute for hard data.

I'd be eying the Data.Map.insertWith' calls with suspicion myself, since each one renders a chunk of the extant Map surplus to requirements and requires copying and rebalancing but that's pure guesswork on my part. If the insertWith' calls are to blame, then since you don't need interstitial Map entries, it might be faster to build the entire map in one pass (without any increments to count IPs) and then do a second pass to do the counts. That way you won't waste time rebalancing the Map. You could also take advantage of the fact that your key datatype fits into an Int (well, it does if it's an IPv4 address at least) and use a Data.IntMap instead, which has much lower memory overhead.