Troubles with python csv reader - not strict enough for my liking

Question

People have been asking what this question is about so i will try to sum it up - I am trying to achieve means of detecting best matching format for parsing CSV files. That is probably the best description of what i am trying to do.

I have csv file with these contents:

710000 8454889 03 3 ;sometext;;48,05;65,82;;65,82
710001 8454889 03 3 ;sometext;;49,09;66,96;;66,96
710002 8454889 03 3 ;sometext;;12,63;17,22;;17,22

No quotechars and ";" as delimiter.

I have described several csv reader dialects:

csv.register_dialect('excel', delimiter = ',', quotechar = '"', quoting = csv.QUOTE_ALL, strict = True, skipinitialspace = True)
csv.register_dialect('semicolonquotes', delimiter = ';', quotechar = '"', quoting = csv.QUOTE_ALL, strict = True, skipinitialspace = True)
csv.register_dialect('semicolonnonquotes', delimiter = ';', quotechar = None, quoting = csv.QUOTE_NONE, strict = True, skipinitialspace = True)

And i have script which tries to figure out which one of those formats matches the file contents the best. Unfortunately, in case of this example file, it matches for first case - "excel", even though i would like it to only match "semicolonnonquotes".

Edit: The code i use to match file is much like this:

dialects = csv.list_dialects()
for dialect in dialects:
    file.seek(0)
    reader = csv.reader(file, csv.get_dialect(dialect))
    reader.next()

very simple code to see if reader throws error when reading with set dialect or not. Wrapped in try/except to catch first dialect without error. Unfortunately none of those dialects raises error.

/Edit

I figured that if i set strict (link) to True, then it would raise an error, when row contains no quotechars. But apparently it does not work like that.

First dialect matches, and gets me csv rows like:

['710000 8454889 03 3 ;sometext;;48', '05;65', '82;;65', '82']

Is there some way to tune this so i would get the results i desire:

['710000 8454889 03 3 ', 'sometext', '', '48,05', '65,82', '', '65,82']

Edit2

Reading through the docs it seems that specifying quoting for csv.reader does next to nothing:http://docs.python.org/2.7/library/csv.html#csv.QUOTE_ALL

Guess this is where my problems come from.

/Edit2

Disclaimer: I know that CSV stands for COMMA separated values. If there is no way to achieve what i want without extending existing library then i will accept that as an answer and force users to use CSV files which only contain commas as delimiters.

Would it help to check the strictest dialect first, then the next strictest, et cetera? Can you order all your dialects by strictness? — RemcoGerlich
Yes i could. And i actually thought they are ordered in this order. As this "semicolonnonquotes" is least strict of them - only requiring ; as delimiter. The whole problem is that the file contains "," as decimal separator not field as delimiter. And in my eyes, the csv reader should first look for quotechars and then for delimiters. And i guess it might be doing just that. But i kinda hoped it would also raise an error if quotechars are missing from the file. — Odif Yltsaeb

martineau martineau · Accepted Answer · 2014-01-22T16:40:36

I'm not sure I understand what you're asking. However if you know what the delimiter is and that there's no quoting, forget about csv dialects and just specify those things as formatting parameters to the csv.reader instance created:

with open('contents.csv', 'rb') as file:
    reader = csv.reader(file, delimiter=';', quotechar=None)
    line = reader.next()
    print 'line 1: {}'.format(line)

Output:

line 1: ['710000 8454889 03 3 ', 'sometext', '', '48,05', '65,82', '', '65,82']

Troubles with python csv reader - not strict enough for my liking

2 Answers