Exotic CSV Dialect Parsing

Question

I cannot seem to find the correct Python3 CSV reader args to use to parse this particular CSV Dialect. The behavior of the thing generating the CSV is as following:

Parser information:

Quotation character: " (\x22)
Field Delimiter: ^ (\x5e)
Record Separator: \n (\x0a)
Escape Character \ (\x5c)

How the CSV which generated this format works:

If the specified record separator is found in a field, quote field
If the specified field separator is found in a field, quote the field.
If the specified quotation character is found in a field, quote the field and escape the quotation character
If the specified escape character is found in a field, do nothing...

^ this last point is what is causing me an issue since my first field of a particular row ends with a backslash. This causes the Python3 CSV parser to interpret the first field separator as being escaped.

See below:

(xcve) ttucker@plato:~/tmp/csv$ python --version
Python 3.6.4
(xcve) ttucker@plato:~/tmp/csv$ cat test_csv.py 
import csv
with open('exotic_dialect.csv') as f:
    data = f.readlines()
reader = csv.reader(data, delimiter='^', quotechar='"',
                    escapechar='\\', quoting=csv.QUOTE_MINIMAL)
for row in reader:
    print(row)

(xcve) ttucker@plato:~/tmp/csv$ cat exotic_dialect.csv 
a^b^c
a|^b^c
"a\""^b^c
"a^"^b^c
a\^b^c
(xcve) ttucker@plato:~/tmp/csv$ python test_csv.py 
['a', 'b', 'c']
['a|', 'b', 'c']
['a"', 'b', 'c']
['a^', 'b', 'c']
['a^b', 'c']

^ This last list should have three fields; i.e., ['a\', 'b', 'c']

So, my questions are:

Can this CSV Dialect be parsed by the default Python Lib (but with some specific options I can't seem to find)
Can this be easily parsed by some python code (Also, assume that the first field ends in every printable ascii)

This will not work with the csv module as the escapechar is explicitly defined as escaping the delimiter. It can also escape the quotechar if doublequote is False. So there is no way to just escape the quotechar. — AChampion

Peter S Peter S · Accepted Answer · 2019-01-10T01:39:24

I'm not an expert in the subject, but

1) probably not. You are expecting the same escape character to do 2 different things in this context, and it would be impossible to differentiate. These CSV parsers generally take care of escaping characters first.

How would a CSV parser determine the correct behavior for this example?

fo\^o^bar
- ["fo\","o","bar"]
- ["fo^o","bar"]

2) I personally would run your CSV through some pre-processing, so that you can correctly parse your file (replace \^ with \\^)

Exotic CSV Dialect Parsing

2 Answers