I am using Python's csv module to parse data from a CSV file in my application. While testing the application, my colleague entered a piece of sample text copy-pasted from random website.
The sample text has double quotes inside the field and a comma within the double quotes. The commas outside of double quotes are correctly handled by the csv module but the comma inside the double quote is split into next column. I looked at the csv specification and the field does comply to the specification by escaping the double quotes by another set of double quotes.
I checked the file in libreoffice and it is handled correctly.
Here's one line from the csv data where I'm having a problem:
company_name,company_revenue,company_start_year,company_website,company_description,company_email
Acme Inc,80000000000000,2004,http://google.com,"The company is never clearly defined in Road Runner cartoons but appears to be a conglomerate which produces every product type imaginable, no matter how elaborate or extravagant - most of which never work as desired or expected. In the Road Runner cartoon Beep, Beep, it was referred to as ""Acme Rocket-Powered Products, Inc."" based in Fairfield, New Jersey. Many of its products appear to be produced specifically for Wile E. Coyote; for example, the Acme Giant Rubber Band, subtitled ""(For Tripping Road Runners)"".
Sometimes, Acme can also send living creatures through the mail, though that isn't done very often. Two examples of this are the Acme Wild-Cat, which had been used on Elmer Fudd and Sam Sheepdog (which doesn't maul its intended victim); and Acme Bumblebees in one-fifth bottles (which sting Wile E. Coyote). The Wild Cat was used in the shorts Don't Give Up the Sheep and A Mutt in a Rut, while the bees were used in the short Zoom and Bored.
While their products leave much to be desired, Acme delivery service is second to none; Wile E. can merely drop an order into a mailbox (or enter an order on a website, as in the Looney Tunes: Back in Action movie), and have the product in his hands within seconds.",[email protected]
Here's what it looks like in the debug log:
2014-08-27 21:35:53,922 - DEBUG: company_website=http://google.com
2014-08-27 21:35:53,923 - DEBUG: company_revenue=80000000000000
2014-08-27 21:35:53,923 - DEBUG: company_start_year=2004
2014-08-27 21:35:53,923 - DEBUG: account_description=The company is never clearly defined in Road Runner cartoons but appears to be a conglomerate which produces every product type imaginable, no matter how elaborate or extravagant - most of which never work as desired or expected. In the Road Runner cartoon Beep, Beep, it was referred to as "Acme Rocket-Powered Products
2014-08-27 21:35:53,924 - DEBUG: company_name=Acme Inc
2014-08-27 21:35:53,925 - DEBUG: company_email=Inc."" based in Fairfield
The relevant piece of code to handle csv parsing:
with open(csvfile, 'rU') as contactsfile:
# sniff for dialect of csvfile so we can automatically determine
# what delimiters to use
try:
dialect = csv.Sniffer().sniff(contactsfile.read(2048))
except:
dialect = 'excel'
get_total_jobs(contactsfile, dialect)
contacts = csv.DictReader(contactsfile, dialect=dialect, skipinitialspace=True, quoting=csv.QUOTE_MINIMAL)
# Start reading the rows
for row in contacts:
process_job()
for key, value in row.iteritems():
logging.debug("{}={}".format(key,value))
I understand that this is just junk data and we'll likely never encounter such a data but the csv files we receive are not within our control and we can have such an edge case. And since it's a valid csv file, which is handled correctly by libreoffice, it makes sense for me to handle it correctly as well.
I have searched for other questions on csv handling where people have had problems with either handling of quotes or comma within the field. I have both of these working fine, my problem is when a comma is nested within quotes within a field. There is a question with same problem which does solve the issue Comma in DoubleDouble Quotes in CSV File but it's a hackish way where I am not preserving the contents as they are given to me, which is a valid way as per RFC4180.