I'm trying to figure out the "right" way to parse a particular text file in Haskell.
In F#, I loop over each line, testing it against a regular expression to determine if it's a line I want to parse, and then if it is, I parse it using the regular expression. Otherwise, I ignore the line.
The file is a printable report, with headers on each page. Each record is one line, and each field is separated by two or more spaces. Here's an example:
MY COMPANY'S NAME
PROGRAM LISTING
STATE: OK PRODUCT: ProductName
(DESCRIPTION OF REPORT)
DATE: 11/03/2013
This is the first line of a a two-line description of the contents of this report. The description, as noted,
spans two lines. This is more text. I'm running out of things to write. Blah.
DIVISION CODE: 3 XYZ CODE: FAA3 AGENT CODE: 0007 PAGE NO: 1
AGENT TARGET NAME ST UD TARGET# XYZ# X-DATE YEAR CO ENCODING
----- ------------------------------ -- -- ------- ---- ---------- ---- ---------- ----------
0007 SMITH, JOHN 43 3 1234567 001 12/06/2013 2004 ABC SIZE XL
0007 SMITH, JANE 43 3 2345678 001 12/07/2013 2005 ACME YELLOW
0007 DOE, JOHN 43 3 3456789 004 12/09/2013 2008 MICROSOFT GREEN
0007 DOE, JANE 43 3 4567890 002 12/09/2013 2007 MICROSOFT BLUE
0007 BORGES, JORGE LUIS 43 3 5678901 001 12/09/2013 2008 DUFEMSCHM Y1500
0007 DEWEY, JOHN & 43 3 6789012 003 12/11/2013 2013 ERTZEVILI X1500
0007 NIETZSCHE, FRIEDRICH 43 3 7890123 004 12/11/2013 2006 NCORPORAT X7
I first built the parser to test each line to see if it were a record. Were it a record, I just cut up the line based on character position with my home-grown substring function. This works just fine.
Then I discovered that I did, indeed, have a regular expression library in my Haskell installation, so I decided to try using regular expressions like I do in F#. That failed miserably, as the library rejects perfectly valid regular expressions.
Then I thought, What about Parsec? But the learning curve for using that is getting steeper the higher I climb, and I find myself wondering if it is the right tool for such a simple task as parsing this report.
So I thought I'd ask some Haskell experts: how would you go about parsing this kind of report? I'm not asking for code, though if you've got some, I'd love to see it. I'm really asking for technique or technology.
Thanks!
P.s. The output is just a colon-separated file with a line of field names at the top of the file, followed by just the records, that can be imported into Excel for the end-user.
Edit:
Thank you all so much for the great comments and answers!
Because I didn't make it clear originally: The first fourteen lines of the example repeat for every page of (print) output, with the number of records varying per page from zero to a full page (looks like 45 records). I apologize for not making that clear earlier, as it will probably affect some of the answers already offered.
My Haskell system currently is limited to Parsec (it doesn't have attoparsec) and Text.Regex.Base and Text.Regex.Posix. I'll have to see about installing attoparsec and/or additional Regex libraries. But for the time being, you've convinced me to keep at learning Parsec. Thank you for the very helpful code examples!
Text.Regex
andText.Regex.PCRE
?Text.Regex
is a shadow package ofText.Regex.Posix
, which likely doesn't support features you're used to using. PCRE is perl-esque regex, and has a much expanded feature offering. – Elliot Robinsondrop 14 . lines
? Is it fair to say that the fields are "double space" delimited? – J. Abrahamson