8
votes

I have a stack of CSV files I want to parse - the problem is half of the have quote marks used as quote marks, and commas inside main field. They are not really CSV, but they do have a fixed number of fields that are identifiable. The dialect=csv."excel" setting works perfectly on files with out the extra " and , chars inside the field.

This data is old/unsupported. I am trying to push some life into it.

e.g.

"AAAAA
AAAA
AAAA
AAAA","AAAAAAAA


AAAAAA
AAAAA "AAAAAA" AAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAA, AAAAA
AAAAAAAAA AAAAA AAAAAAAAAA
AAAAA, "AAAAA", AAAAAAAAA
AAAAAAAA AAAAAAAA
AAAAAAA
"

This is tripping the file parser, and throws an error _csv.Error: newline inside string. I narrrowed it down to this being the issue by removing the quote marks from inside the 2nd field and the csv.reader module parses the file OK.

Some of the fields are multi line - I'm not sure if thats important to know.

I have been poking around at the dialect settings, and whilst I can find 'skipinitialspace', this doesn't seem to solve the problem.

To be clear - this is not valid 'CSV', its data objects that loosely follow a CSV structure, but have , and " chars inside the field test.

The lineterminator is \x0d\x0a

I have tried a number of goes at differnt permuations of doublequote and the quoting variable in the dialect module, but I can't get this parse correctly.

I can not be confident that a ," or ", combination exists only on field boundaries.

This problem only exists for one (the last) of several fields in the file, and there are several thousand files.

3
proper CSV uses two consecutive quote characters ("") to escape a quote character in a quoted field. Alternately, the field could be unquoted and no escape is neccesary, although commas and newlines cannot appear in such a field. Single quotes like this are not normally valid. Is it possible to modify the options of the source of your CSV data to better conform to the expectations of most other CSV readers (including python). - SingleNegationElimination
I inherited the data - and there is lots of it... So I can't do anything except build a parser that can pick it apart again. - Jay Gattuso
The are files not damaged, its badly/poorly structured as a CSV. "I inherited the data" is the exact 'excuse' - thats just what happens. The data is old / abandoned, and I want to resurrect it. - Jay Gattuso
PLEASE (if you want real help) edit your question to include coherently ALL the information you have dribbled out in comments plus what encoding is being used plus whether the non-quote-doubled field(s) appear(s) in constant columns plus what is the row separator (\n or \r\n) plus which of comma, \r, \n appear in which columns. As your data is "non-sharable", then mask it (and compress it) by substituting "A" for each contiguous string of alphabetic characters -- if that affects parsing then your data is absolutely stuffed -- and publish a comprehensive sample somewhere. - John Machin
@Johnmachin - noted, and edited. Thank you. - Jay Gattuso

3 Answers

11
votes

Have you tried passing csv.QUOTE_NONE via the quoting keyword arg? Without having some code or data to test this on, I have no way to know whether this actually works on your data, but it seems to work with the fragment you provided.

>>> import csv
>>> r = csv.reader(open('foo.csv', 'rb'), quoting=csv.QUOTE_NONE)
>>> for row in r: print row
... 
['"A"', '"B"', '"ccc "ccccccc" cccccc"']
4
votes

I'm not allowed to comment quite yet, so I'll post as an answer...

Assuming you are using commas as your delimiter, are there any commas within your data? If not, then you could do a massive find and replace to double all the quote characters after the first and before the last characters of the field prior to CSV processing.

2
votes

I would write a converter that would parse the initial csv and output a valid one. You can possibly use the ", or "\n as the mechanism for determining the the delimiting.