0
votes

People have been asking what this question is about so i will try to sum it up - I am trying to achieve means of detecting best matching format for parsing CSV files. That is probably the best description of what i am trying to do.

I have csv file with these contents:

710000 8454889 03 3 ;sometext;;48,05;65,82;;65,82
710001 8454889 03 3 ;sometext;;49,09;66,96;;66,96
710002 8454889 03 3 ;sometext;;12,63;17,22;;17,22

No quotechars and ";" as delimiter.

I have described several csv reader dialects:

csv.register_dialect('excel', delimiter = ',', quotechar = '"', quoting = csv.QUOTE_ALL, strict = True, skipinitialspace = True)
csv.register_dialect('semicolonquotes', delimiter = ';', quotechar = '"', quoting = csv.QUOTE_ALL, strict = True, skipinitialspace = True)
csv.register_dialect('semicolonnonquotes', delimiter = ';', quotechar = None, quoting = csv.QUOTE_NONE, strict = True, skipinitialspace = True)

And i have script which tries to figure out which one of those formats matches the file contents the best. Unfortunately, in case of this example file, it matches for first case - "excel", even though i would like it to only match "semicolonnonquotes".

Edit: The code i use to match file is much like this:

dialects = csv.list_dialects()
for dialect in dialects:
    file.seek(0)
    reader = csv.reader(file, csv.get_dialect(dialect))
    reader.next()

very simple code to see if reader throws error when reading with set dialect or not. Wrapped in try/except to catch first dialect without error. Unfortunately none of those dialects raises error.

/Edit

I figured that if i set strict (link) to True, then it would raise an error, when row contains no quotechars. But apparently it does not work like that.

First dialect matches, and gets me csv rows like:

['710000 8454889 03 3 ;sometext;;48', '05;65', '82;;65', '82']

Is there some way to tune this so i would get the results i desire:

['710000 8454889 03 3 ', 'sometext', '', '48,05', '65,82', '', '65,82']

Edit2

Reading through the docs it seems that specifying quoting for csv.reader does next to nothing:http://docs.python.org/2.7/library/csv.html#csv.QUOTE_ALL

Guess this is where my problems come from.

/Edit2

Disclaimer: I know that CSV stands for COMMA separated values. If there is no way to achieve what i want without extending existing library then i will accept that as an answer and force users to use CSV files which only contain commas as delimiters.

2
What code are you using to detect the CSV dialect? - Martijn Pieters
@MartijnPieters - Added some code in edit. - Odif Yltsaeb
Would it help to check the strictest dialect first, then the next strictest, et cetera? Can you order all your dialects by strictness? - RemcoGerlich
Yes i could. And i actually thought they are ordered in this order. As this "semicolonnonquotes" is least strict of them - only requiring ; as delimiter. The whole problem is that the file contains "," as decimal separator not field as delimiter. And in my eyes, the csv reader should first look for quotechars and then for delimiters. And i guess it might be doing just that. But i kinda hoped it would also raise an error if quotechars are missing from the file. - Odif Yltsaeb
Nowadays, CSV stands for CHARACTER separated values. - martineau

2 Answers

0
votes

I'm not sure I understand what you're asking. However if you know what the delimiter is and that there's no quoting, forget about csv dialects and just specify those things as formatting parameters to the csv.reader instance created:

with open('contents.csv', 'rb') as file:
    reader = csv.reader(file, delimiter=';', quotechar=None)
    line = reader.next()
    print 'line 1: {}'.format(line)

Output:

line 1: ['710000 8454889 03 3 ', 'sometext', '', '48,05', '65,82', '', '65,82']
0
votes

I think if you want to go about having a script automagically figure out the best csv dialect to use, you shouldn't base it on whether or not an error will be thrown, but how well a certain dialect performs (if at all).

"Most" CSV's are going to have similar amount of columns in each row (usually having the first row as the header). You can use this assumption to build a test case to test the variance in column counts [length of list] for each row in each dialect. The dialect which has the lowest variance in column count (or highest reoccurring count) after parsing all the rows, or a smaller sample, should be your best. If there is a tie, you may have to come up with some more trickery, but this should give you a starting point.

Just know that this assumption will fail horribly as soon as you start receiving csv's with out any order what-so-ever. Which in reality happens all the time.