I have a little script to read in, parse and derive some kind of interesting (not really) statistics from an apache log file. So far I've made two simple options, the total number of bytes sent in all requests in the log file, and a top 10 of the most common IP adresses.
The first "mode" is just a simple sum of all the parsed bytes. The second one is a fold over a map (Data.Map), using insertWith (+) 1'
to count the occurrences.
The first one runs as I expected, most of the time spent parsing, in constant space.
42,359,709,344 bytes allocated in the heap 72,405,840 bytes copied during GC 113,712 bytes maximum residency (1553 sample(s)) 145,872 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 76311 collections,
0 parallel, 0.89s, 0.99s elapsed
Generation 1: 1553 collections, 0 parallel, 0.21s, 0.22s elapsedINIT time 0.00s ( 0.00s elapsed) MUT time 21.76s ( 24.82s elapsed) GC time 1.10s ( 1.20s elapsed) EXIT time
0.00s ( 0.00s elapsed) Total time 22.87s ( 26.02s elapsed)%GC time 4.8% (4.6% elapsed)
Alloc rate 1,946,258,962 bytes per MUT second
Productivity 95.2% of total user, 83.6% of total elapsed
However, the second one does not!
49,398,834,152 bytes allocated in the heap 580,579,208 bytes copied during GC 718,385,088 bytes maximum residency (15 sample(s)) 134,532,128 bytes maximum slop 1393 MB total memory in use (172 MB lost due to fragmentation)
Generation 0: 91275 collections,
0 parallel, 252.65s, 254.46s elapsed
Generation 1: 15 collections, 0 parallel, 0.12s, 0.12s elapsedINIT time 0.00s ( 0.00s elapsed) MUT time 41.11s ( 48.87s elapsed) GC time 252.77s (254.58s elapsed) EXIT time
0.00s ( 0.01s elapsed) Total time 293.88s (303.45s elapsed)%GC time 86.0% (83.9% elapsed)
Alloc rate 1,201,635,385 bytes per MUT second
Productivity 14.0% of total user, 13.5% of total elapsed
And here is the code.
{-# LANGUAGE OverloadedStrings #-}
module Main where
import qualified Data.Attoparsec.Lazy as AL
import Data.Attoparsec.Char8 hiding (space, take)
import qualified Data.ByteString.Char8 as S
import qualified Data.ByteString.Lazy.Char8 as L
import Control.Monad (liftM)
import System.Environment (getArgs)
import Prelude hiding (takeWhile)
import qualified Data.Map as M
import Data.List (foldl', sortBy)
import Text.Printf (printf)
import Data.Maybe (fromMaybe)
type Command = String
data LogLine = LogLine {
getIP :: S.ByteString,
getIdent :: S.ByteString,
getUser :: S.ByteString,
getDate :: S.ByteString,
getReq :: S.ByteString,
getStatus :: S.ByteString,
getBytes :: S.ByteString,
getPath :: S.ByteString,
getUA :: S.ByteString
} deriving (Ord, Show, Eq)
quote, lbrack, rbrack, space :: Parser Char
quote = satisfy (== '\"')
lbrack = satisfy (== '[')
rbrack = satisfy (== ']')
space = satisfy (== ' ')
quotedVal :: Parser S.ByteString
quotedVal = do
quote
res <- takeTill (== '\"')
quote
return res
bracketedVal :: Parser S.ByteString
bracketedVal = do
lbrack
res <- takeTill (== ']')
rbrack
return res
val :: Parser S.ByteString
val = takeTill (== ' ')
line :: Parser LogLine
l ine = do
ip <- val
space
identity <- val
space
user <- val
space
date <- bracketedVal
space
req <- quotedVal
space
status <- val
space
bytes <- val
(path,ua) <- option ("","") combined
return $ LogLine ip identity user date req status bytes path ua
combined :: Parser (S.ByteString,S.ByteString)
combined = do
space
path <- quotedVal
space
ua <- quotedVal
return (path,ua)
countBytes :: [L.ByteString] -> Int
countBytes = foldl' count 0
where
count acc l = case AL.maybeResult $ AL.parse line l of
Just x -> (acc +) . maybe 0 fst . S.readInt . getBytes $ x
Nothing -> acc
countIPs :: [L.ByteString] -> M.Map S.ByteString Int
countIPs = foldl' count M.empty
where
count acc l = case AL.maybeResult $ AL.parse line l of
Just x -> M.insertWith' (+) (getIP x) 1 acc
Nothing -> acc
---------------------------------------------------------------------------------
main :: IO ()
main = do
[cmd,path] <- getArgs
dispatch cmd path
pretty :: Show a => Int -> (a, Int) -> String
pretty i (bs, n) = printf "%d: %s, %d" i (show bs) n
dispatch :: Command -> FilePath -> IO ()
dispatch cmd path = action path
where
action = fromMaybe err (lookup cmd actions)
err = printf "Error: %s is not a valid command." cmd
actions :: [(Command, FilePath -> IO ())]
actions = [("bytes", countTotalBytes)
,("ips", topListIP)]
countTotalBytes :: FilePath -> IO ()
countTotalBytes path = print . countBytes . L.lines =<< L.readFile path
topListIP :: FilePath -> IO ()
topListIP path = do
f <- liftM L.lines $ L.readFile path
let mostPopular (_,a) (_,b) = compare b a
m = countIPs f
mapM_ putStrLn . zipWith pretty [1..] . take 10 . sortBy mostPopular . M.toList $ m
Edit:
Adding +RTS -A16M reduced GC to 20%. Memory use of course unchanged.
foldl'
over an accumulating map is a waste. Just use a regularfoldl
. – John L