0
votes

I'm trying to use the csv module to parse a specifically formatted delimited file. I'm using Python 3.5.

The format is provided by a third party, and I am having issues using the csv module to give a representation in all cases. Any columns that are specified as a text data type have double quoted values. Dates and numbers will not have any quote value between pipes(the delimiter). The issue arises in that in trying multiple formats, I either get left with a single middle double quote, or I lose information like \ -> empty space. I'm hoping I don't have to use regular expressions for this, so if there is a way around it with the csv module, that'd be great.

Rules:

escape character is a "\"

  • tab escape : \t
  • new line character: \n
  • backslash character: \\
  • inner quote character: \"
  • delimiter = |
  • dates have no quotes.
  • numbers, including NaN values(empty pipe ||) have no quotes

When I try various dialect parameters, I can't seem to correctly parse this csv file. It either converts backslashes to empty spaces, mis-places inner quotes, etc. Is there a possible way to use the csv module, or will I need to either do some post processing, or create my own regex?

import csv
import os

dialect_params = {'delimiter': '|'} # help needed here.

newline_sample = '"I went to dinner. \n Then I went to a show."'
quote_sample = '"I read the \"WSJ\", did you?"'
backslash_sample = '"Boasberg\\Wheeler Communications, Inc."'
na_sample = 'N\A'
date_sample = '2013-04-23'
number_sample = '1.3'
text_sample = '|'.join([newline_sample, quote_sample,
                        backslash_sample, na_sample,
                        date_sample, number_sample]) + '\n'

csv.reader(iter([text_sample]), **dialect_params)
1
Is that code correct? Can't see lines defined. - strubbly
Where you say backspace do you mean backslash? I'm surprised you have any trouble with that: the csv format does not special case backslash at all, by default. - strubbly
In your quote_sample do you really want literal \" or just a quote (Python will give you the latter)? - strubbly
"quote character: \n" really? Sure you didn't mean quote character: '"' (double quotes)? - jDo

1 Answers

0
votes

There is some confusion in your example because you don't differentiate between escaping in your input file and escaping in Python and unfortunately they are rather similar. It seems that your file contains literal backslashes, but your Python example does not. Bear in mind that in Python, this string '"\""' is three characters long, all three being double quote characters.

This is also leading to confusion when you try to test your code. For example, if your code successfully converts a pair of backslash characters in the input to a single backslash in the output, when you print the repr of that output (in a list, for example) then you will see TWO backslashes again. Here are some examples to demonstrate:

# A five character string consisting of two quotes, a backslash, a
# newline and another quote
s = '"\"\\\n"'
print(s)
# prints:
#  ""\
#  "
print([s])
# prints:
#  ['""\\\n"']

The easiest fix for this is to use "raw" strings in your test code. And to be careful when interpreting the output. Note the added r before the string definition:

# An eight character string with a lot of backslashes in
s = r'"\"\\\n"'
print(s)
# prints:
#  "\"\\\n"
print([s])
# prints:
#  ['"\\"\\\\\\n"']

OK, so if you clean up your Python test, what is the problem? The problem is that you would like to use the csv module to parse the delimiters and the enclosing quotes. But there is a problem with the quotes. If you tell csv to interpret the quotes then it will detect escaped quotes inside a field and stop the field there. If you then tell it that quotes can be escaped it will also interpret the escaped newlines as if they are escaped n characters.

import csv
import io

sample = r'''"ab \" cd \n ef"
"gh \\ ij \t kl"'''

# by default we have
#   doublequote = True
#   quotechar = '"'
# But this means that single quotes in the
# field are treated as ending the field
for l in csv.reader(io.StringIO(sample)):
    print(l[0])

# Setting
#   doublequote = False
#   escapechar = '\'
# handles the quote correctly but messes up the escaped newline
for l in csv.reader(io.StringIO(sample),doublequote = False,escapechar = '\\'):
    print(l[0])

# Setting
#   quotechar = None
# works correctly for the delimiters but not escaping or quoting
for l in csv.reader(io.StringIO(sample),quotechar = None):
    print(l[0])

I suggest you use the csv module to correctly interpret the delimiters but to ignore the quotes (as in the last example).

Then you can write your own code to interpret the quoted fields. First check for enclosing quotes and remove them. Then, it is straightforward to use str.replace() to replace each escape sequence with the desired character. Here's an example:

import csv
import io

escapes = [
    (r'\t','\t'),
    (r'\n','\n'),
    (r'\\','\\'),
    (r'\"','\"'),
]

sample = r'''"ab \" cd \n ef|12"
"ij \\ kl \t mn"|"o"'''

for l in csv.reader(io.StringIO(sample),delimiter = '|',quotechar = None):
    l = [f if f[0] != '"' else f[1:-1] for f in l]
    for old,new in escapes:
        l = [f.replace(old,new) for f in l]
    print(l)

Finally, note that this is a nice straightforward way to do it but it will not perform very well. There are tricksier possible solutions. For example, you could use ast.literal_eval since these escapes are compatible with Python escapes, or you could even use the codecs module, but I don't recommend these if you don't know what you're doing.