4
votes

I'm trying to figure out the "right" way to parse a particular text file in Haskell.

In F#, I loop over each line, testing it against a regular expression to determine if it's a line I want to parse, and then if it is, I parse it using the regular expression. Otherwise, I ignore the line.

The file is a printable report, with headers on each page. Each record is one line, and each field is separated by two or more spaces. Here's an example:

                                                    MY COMPANY'S NAME
                                                     PROGRAM LISTING
                                             STATE:  OK     PRODUCT: ProductName
                                                 (DESCRIPTION OF REPORT)
                                                    DATE:   11/03/2013

  This is the first line of a a two-line description of the contents of this report. The description, as noted,
  spans two lines. This is more text. I'm running out of things to write. Blah.

          DIVISION CODE: 3     XYZ CODE: FAA3   AGENT CODE: 0007                                       PAGE NO:  1

 AGENT    TARGET NAME                      ST   UD   TARGET#   XYZ#   X-DATE       YEAR    CO          ENCODING
 -----    ------------------------------   --   --   -------   ----   ----------   ----    ----------  ----------

 0007     SMITH, JOHN                      43   3    1234567   001    12/06/2013   2004    ABC         SIZE XL
 0007     SMITH, JANE                      43   3    2345678   001    12/07/2013   2005    ACME        YELLOW
 0007     DOE, JOHN                        43   3    3456789   004    12/09/2013   2008    MICROSOFT   GREEN
 0007     DOE, JANE                        43   3    4567890   002    12/09/2013   2007    MICROSOFT   BLUE
 0007     BORGES, JORGE LUIS               43   3    5678901   001    12/09/2013   2008    DUFEMSCHM   Y1500
 0007     DEWEY, JOHN &                    43   3    6789012   003    12/11/2013   2013    ERTZEVILI   X1500
 0007     NIETZSCHE, FRIEDRICH             43   3    7890123   004    12/11/2013   2006    NCORPORAT   X7

I first built the parser to test each line to see if it were a record. Were it a record, I just cut up the line based on character position with my home-grown substring function. This works just fine.

Then I discovered that I did, indeed, have a regular expression library in my Haskell installation, so I decided to try using regular expressions like I do in F#. That failed miserably, as the library rejects perfectly valid regular expressions.

Then I thought, What about Parsec? But the learning curve for using that is getting steeper the higher I climb, and I find myself wondering if it is the right tool for such a simple task as parsing this report.

So I thought I'd ask some Haskell experts: how would you go about parsing this kind of report? I'm not asking for code, though if you've got some, I'd love to see it. I'm really asking for technique or technology.

Thanks!

P.s. The output is just a colon-separated file with a line of field names at the top of the file, followed by just the records, that can be imported into Excel for the end-user.

Edit:

Thank you all so much for the great comments and answers!

Because I didn't make it clear originally: The first fourteen lines of the example repeat for every page of (print) output, with the number of records varying per page from zero to a full page (looks like 45 records). I apologize for not making that clear earlier, as it will probably affect some of the answers already offered.

My Haskell system currently is limited to Parsec (it doesn't have attoparsec) and Text.Regex.Base and Text.Regex.Posix. I'll have to see about installing attoparsec and/or additional Regex libraries. But for the time being, you've convinced me to keep at learning Parsec. Thank you for the very helpful code examples!

3
I would definitely go with Parsec or better, attoparsec. Do you have any particular problems with it?Niklas B.
In regards to your regex rejections, have you tried both Text.Regex and Text.Regex.PCRE? Text.Regex is a shadow package of Text.Regex.Posix, which likely doesn't support features you're used to using. PCRE is perl-esque regex, and has a much expanded feature offering.Elliot Robinson
For a comparison of Regexp libraries see haskell.org/haskellwiki/Regular_expressionsPedro Rodrigues
Is the input header a fixed size? Can you just ignore the first few lines like drop 14 . lines? Is it fair to say that the fields are "double space" delimited?J. Abrahamson

3 Answers

4
votes

This is definitely a job worth of a parsing library. My primary goal is normally (i.e., for anything I intend to use more than once or twice) to get the data into a non-textual form ASAP, something like

module ReportParser where

import Prelude hiding (takeWhile)
import Data.Text hiding (takeWhile)

import Control.Applicative
import Data.Attoparsec.Text

data ReportHeaderData = Company Text
                      | Program Text
                      | State Text
--                    ...
                      | FieldNames [Text]

data ReportData = ReportData Int Text Int Int Int Int Date Int Text Text

data Date = Date Int Int Int

and we can say, for the sake of argument, that a report is

data Report = Report [ReportHeaderData] [ReportData]

Now, I generally create a parser which is a function of the same name as the data type

-- Ending condition for a field
doubleSpace :: Parser Char
doubleSpace = space >> space

-- Clears leading spaces
clearSpaces :: Parser Text
clearSpaces = takeWhile (== ' ') -- Naively assumes no tabs

-- Throws away everything up to and including a newline character (naively assumes unix line endings)
clearNewline :: Parser ()
clearNewline = (anyChar `manyTill` char '\n') *> pure ()

-- Parse a date
date :: Parser Date
date = Date <$> decimal <*> (char '/' *> decimal) <*> (char '/' *> decimal)

-- Parse a report
reportData :: Parser ReportData
reportData = let f1 = decimal <* clearSpaces
                 f2 = (pack <$> manyTill anyChar doubleSpace) <* clearSpaces
                 f3 = decimal <* clearSpaces
                 f4 = decimal <* clearSpaces
                 f5 = decimal <* clearSpaces
                 f6 = decimal <* clearSpaces
                 f7 = date <* clearSpaces
                 f8 = decimal <* clearSpaces
                 f9 = (pack <$> manyTill anyChar doubleSpace) <* clearSpaces
                 f10 = (pack <$> manyTill anyChar doubleSpace) <* clearNewline
             in ReportData <$> f1 <*> f2 <*> f3 <*> f4 <*> f5 <*> f6 <*> f7 <*> f8 <*> f9 <*> f10

By proper running of one of the parse functions and the use of one of the combinators (such as many (and possibly feed, if you end up with a Partial result), you should end up with a list of ReportDatas. You can then convert them to CSV with some function you've created.

Note that I didn't deal with the header. It should be relatively trivial to write code to parse it, and build a Report with e.g.

-- Not tested
parseReport = Report <$> (many reportHeader) <*> (many reportData)

Note that I prefer the Applicative form, but it's also possible to use the monadic form if you prefer (I did in doubleSpace). Data.Alternative is also useful, for reasons implied by the name.

For playing with this, I highly recommend GHCI and the parseTest function. GHCI is just overall handy and a good way to test individual parsers, while parseTest takes a parser and input string and outputs the status of the run, the parsed string, and any remaining string not parsed. Very useful when you're not quite sure what's going on.

2
votes

There are very few languages that I would recommend using a parser for something so simple (I've parsed many a file like this using regular expressions in the past), but parsec makes it so easy-

parseLine = do
  first <- count 4 anyChar
  second <- count 4 anyChar
  return (first, second)

parseFile = endBy parseLine (char '\n')

main = interact $ show . parse parseFile "-" 

The function "parseLine" creates a parser for an individual line by chaining together two fields made up of fixed length (4 chars, any char will do).

The function "parseFile" then chains these together as a list of lines.

Of course you will have to add more fields, and cut off the header in your data still, but all of this is easy in parsec.

This is arguably much easier to read than regexps....

1
votes

Assuming a few things—that the header is fixed and the field of each line is "double space" delimited—it's really quite easy to implement a parser in Haskell for this file. The end result is probably going to be longer than your regexp (and there are regexp libraries in Haskell if that fits your desire) but it's far more testable and readable. I'll demonstrate some of that while I outline how to build one for this file format.

I'll use Attoparsec. We'll also need to use the ByteString data type (and the OverloadedStrings PRAGMA which lets Haskell interpret string literals as both String and ByteString) and some combinators from Control.Applicative and Control.Monad.

{-# LANGUAGE OverloadedStrings #-}

import           Data.Attoparsec.Char8
import           Control.Applicative
import           Control.Monad
import qualified Data.ByteString.Char8         as S

First, we'll build a data type representing each record.

data YearMonthDay =
  YearMonthDay { ymdYear  :: Int
               , ymdMonth :: Int
               , ymdDay   :: Int
               }
    deriving ( Show )

data Line =
  Line { agent     :: Int
       , name      :: S.ByteString
       , st        :: Int
       , ud        :: Int
       , targetNum :: Int
       , xyz       :: Int
       , xDate     :: YearMonthDay
       , year      :: Int
       , co        :: S.ByteString
       , encoding  :: S.ByteString
       }
    deriving ( Show )

You could fill in more descriptive types for each field if desired, but this isn't a bad start. Since each line can be parsed independently, I'll do just that. The first step is to build a Parser Line type---read that as a parser type which returns a Line if it succeeds.

To do this, we'll build our Line type "inside of" the Parser using its Applicative interface. That sounds really complex, but it's simple and looks quite pretty. We'll start with the YearMonthDay type as a warm-up

parseYMDWrong :: Parser YearMonthDay
parseYMDWrong =
  YearMonthDay <$> decimal
               <*> decimal
               <*> decimal

Here, decimal is a built-in Attoparsec parser which parses an integral type like Int. You can read this parser as nothing more than "parse three decimal numbers and use them to build my YearMonthDay type" and you'd be basically correct. The (<*>) operator (read as "apply") sequences the parses and collects their results into our YearMonthDay constructor function.

Unfortunately, as I indicated in the type, it's a little bit wrong. To point, we're currently ignoring the '/' characters which delimit the numbers inside of our YearMonthDay. We fix this by using the "sequence and throw away" operator (<*). It's a visual pun on (<*>) and we use it when we want to perform a parsing action... but we don't want to keep the result.

We use (<*) to augment the first two decimal parsers with their following '/' characters using the built-in char8 parser.

parseYMD :: Parser YearMonthDay
parseYMD =
  YearMonthDay <$> (decimal <* char8 '/')
               <*> (decimal <* char8 '/')
               <*> decimal

And we can test that this is a valid parser using Attoparsec's parseOnly function

>>> parseOnly parseYMD "2013/12/12"
Right (YearMonthDay {ymdYear = 2013, ymdMonth = 12, ymdDay = 12})

We'd like to now generalize this technique to the entire Line parser. There's one hitch, however. We'd like to parse ByteString fields like "SMITH, JOHN" which might contain spaces... while also delimiting each field of our Line by double spaces. This means that we need a special ByteString parser which consumes any character including single spaces... but quits the moment it sees two spaces in a row.

We can build this using the scan combinator. scan allows us to accumulate a state while consuming characters in our parse and determine when to stop that parse on the fly. We'll keep a boolean state—"was the last character a space?"—and stop whenever we see a new space while knowing the previous character was a space too.

parseStringField :: Parser S.ByteString
parseStringField = scan False step where
  step :: Bool -> Char -> Maybe Bool
  step b ' ' | b         = Nothing
             | otherwise = Just True
  step _ _               = Just False

We can again test this little piece using parseOnly. Let's try parsing three string fields.

>>> let p = (,,) <$> parseStringField <*> parseStringField <*> parseStringField
>>> parseOnly p "foo  bar  baz"
Right ("foo "," bar "," baz")
>>> parseOnly p "foo bar  baz quux  end"
Right ("foo bar "," baz quux "," end")
>>> parseOnly p "a sentence with no double space delimiters"
Right ("a sentence with no double space delimiters","","")

Depending on your actual file format, this might be perfect. It's worth noting that it leaves trailing spaces (these could be trimmed if desired) and it allows some space delimited fields to be empty. It's easy to continue to fiddle with this piece in order to fix these errors, but I'll leave it for now.

We can now build our Line parser. Like with parseYMD, we'll follow each field's parser with a delimiting parser, someSpaces which consumes two or more spaces. We'll use the MonadPlus interface to Parser to build this atop the built-in parser space by (1) parsing some spaces and (2) checking to be sure that we got at least two of them.

someSpaces :: Parser Int
someSpaces = do
  sps <- some space
  let count = length sps
  if count >= 2 then return count else mzero

>>> parseOnly someSpaces "  "
Right 2
>>> parseOnly someSpaces "    "
Right 4
>>> parseOnly someSpaces " "
Left "Failed reading: mzero"

And now we can build the line parser

lineParser :: Parser Line
lineParser =
  Line <$> (decimal <* someSpaces)
       <*> (parseStringField <* someSpaces)
       <*> (decimal <* someSpaces)
       <*> (decimal <* someSpaces)
       <*> (decimal <* someSpaces)
       <*> (decimal <* someSpaces)
       <*> (parseYMD <* someSpaces)
       <*> (decimal <* someSpaces)
       <*> (parseStringField <* someSpaces)
       <*> (parseStringField <* some space)

>>> parseOnly lineParser "0007     SMITH, JOHN                      43   3    1234567   001    12/06/2013   2004    ABC         SIZE XL      "
Right (Line { agent = 7
            , name = "SMITH, JOHN "
            , st = 43
            , ud = 3
            , targetNum = 1234567
            , xyz = 1
            , xDate = YearMonthDay {ymdYear = 12, ymdMonth = 6, ymdDay = 2013}
            , year = 2004
            , co = "ABC "
            , encoding = "SIZE XL "
            })

And then we can just cut off the header and parse each line.

parseFile :: S.ByteString -> [Either String Line]
parseFile = map (parseOnly parseLine) . drop 14 . lines