1
votes

I cannot seem to find the correct Python3 CSV reader args to use to parse this particular CSV Dialect. The behavior of the thing generating the CSV is as following:

Parser information:

  • Quotation character: " (\x22)
  • Field Delimiter: ^ (\x5e)
  • Record Separator: \n (\x0a)
  • Escape Character \ (\x5c)

How the CSV which generated this format works:

  • If the specified record separator is found in a field, quote field
  • If the specified field separator is found in a field, quote the field.
  • If the specified quotation character is found in a field, quote the field and escape the quotation character
  • If the specified escape character is found in a field, do nothing...

^ this last point is what is causing me an issue since my first field of a particular row ends with a backslash. This causes the Python3 CSV parser to interpret the first field separator as being escaped.

See below:

(xcve) ttucker@plato:~/tmp/csv$ python --version
Python 3.6.4
(xcve) ttucker@plato:~/tmp/csv$ cat test_csv.py 
import csv
with open('exotic_dialect.csv') as f:
    data = f.readlines()
reader = csv.reader(data, delimiter='^', quotechar='"',
                    escapechar='\\', quoting=csv.QUOTE_MINIMAL)
for row in reader:
    print(row)

(xcve) ttucker@plato:~/tmp/csv$ cat exotic_dialect.csv 
a^b^c
a|^b^c
"a\""^b^c
"a^"^b^c
a\^b^c
(xcve) ttucker@plato:~/tmp/csv$ python test_csv.py 
['a', 'b', 'c']
['a|', 'b', 'c']
['a"', 'b', 'c']
['a^', 'b', 'c']
['a^b', 'c']

^ This last list should have three fields; i.e., ['a\', 'b', 'c']

So, my questions are:
  1. Can this CSV Dialect be parsed by the default Python Lib (but with some specific options I can't seem to find)
  2. Can this be easily parsed by some python code (Also, assume that the first field ends in every printable ascii)
2
This will not work with the csv module as the escapechar is explicitly defined as escaping the delimiter. It can also escape the quotechar if doublequote is False. So there is no way to just escape the quotechar. - AChampion

2 Answers

0
votes

I'm not an expert in the subject, but

1) probably not. You are expecting the same escape character to do 2 different things in this context, and it would be impossible to differentiate. These CSV parsers generally take care of escaping characters first.

How would a CSV parser determine the correct behavior for this example?

  • fo\^o^bar
    • ["fo\","o","bar"]
    • ["fo^o","bar"]

2) I personally would run your CSV through some pre-processing, so that you can correctly parse your file (replace \^ with \\^)

0
votes

So, doing string replaces on the data first wasn't the best idea because there wouldn't be any way to determine if unescaped-escape character should be escaped without understanding it's context. I ended up just writing my own parser.

class XcCsv(object):
    def __init__(self, field_delim, record_delim, quote_char, escape_char):
        self._field_delim = field_delim
        self._record_delim = record_delim
        self._quote_char = quote_char
        self._escape_char = escape_char

        self._records = []
        self._record_buf = []
        self._field_buf = ""
        self._in_quote = False
        self._in_escape = False

    # This could be different ...
    def _walker(self, data):
        data_length = len(data)
        if data_length == 1:
            self._parse_char(data)
        else:
            for d in data:
                self._walker(d)

    def _parse_char(self, char):
        if self._in_escape:
            self._field_buf += char
            self._in_escape = False
        elif char == self._escape_char:
            if self._in_quote:
                self._in_escape = True
            else:
                self._field_buf += char
        elif char == self._quote_char:
            if self._in_quote:
                if self._in_escape == True:
                    self._field_buf += char
                else:
                    self._in_quote = False
            else:
                self._in_quote = True
        elif char == self._field_delim:
            if self._in_quote:
                self._field_buf += char
            else:
                self._record_buf.append(self._field_buf)
                self._field_buf = ""
        elif char == self._record_delim:
            if self._in_quote:
                self._field_buf += char
            else:
                self._record_buf.append(self._field_buf)
                self._records.append(self._record_buf)
            self._record_buf = []
            self._field_buf = ""
        else:
            self._field_buf += char

    def reader(self, data):
        self._walker(data)
        for rec in self._records:
            print(rec)

csv = XcCsv("^","\n","'","\\")
data = open("exotic_dialect.csv").readlines()
csv.reader(data)