6
votes

I'm trying to parse CSV files from an external system which I have no control of.

  • comma is used as a separator
  • when cell contains comma then it's wrapped in quotes and all other quotes are escaped with another quote character.
  • (my problem) when cell was not wrapped in quotes then all quote characters are escaped with another quote nonetheless.

Example CSV:

qw""erty,"a""b""c""d,ef""""g"

Should be parsed as:

[['qw"erty', 'a"b"c"d,ef""g']]

However, I think that Python's csv module does not expect quote characters to be escaped when cell was not wrapped in quote chars in the first place. csv.reader(my_file) (with default doublequote=True) returns:

['qw""erty', 'a"b"c"d,ef""g']

Is there any way to parse this with python csv module ?

2
Why not replace all instances of '""' inside of double quotes with '\\"'? Then the csv reader should be able to work normally. - user554546
your quote character and escape character are the same? - Haleemur Ali
@JackManey "inside of double quotes" fragment was unnecessary, but overall I think it will work provided I escape all '\' characters as well. - matf

2 Answers

5
votes

Following on @JackManey comment where he suggested to replace all instances of '""' inside of double quotes with '\\"'.

Recognizing if we are currently inside of double quoted cells turned out to be unnecessary and we can replace all instances of '""' with '\\"'. Python documentation says:

On reading, the escapechar removes any special meaning from the following character

However this would still break in the case where original cell already contains escape characters, example: 'qw\\\\""erty' producing [['qw\\"erty']]. So we have to escape the escape characters before parsing too.

Final solution:

with open(file_path, 'rb') as f:
  content = f.read().replace('\\', '\\\\').replace('""', '\\"')
  reader = csv.reader(StringIO(content), doublequote=False, escapechar='\\')
  return [row for row in reader]
0
votes

as @JackManey suggests, after reading the file, you can replace the two-double-quotes with a single-double-quote.

my_file_onequote = [col.replace('""', '"') for col in row for row in my_file]