I got a CSV dump from SQL Server 2008 that has lines like this:
Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00
Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00
Electrical,197135021E,"SERVICE, "OUTLETS"",1997-05-15 00:00:00
Electrical,197135021E,"SERVICE, "OUTLETS" FOOBAR",1997-05-15 00:00:00
Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00
parse_dbenhur
is pretty, but can it be rewritten to support the presence of both commas and quotes? parse_ugly
is, well, ugly.
# @dbenhur's excellent answer, which works 100% for what i originally asked for
SEP = /(?:,|\Z)/
QUOTED = /"([^"]*)"/
UNQUOTED = /([^,]*)/
FIELD = /(?:#{QUOTED}|#{UNQUOTED})#{SEP}/
def parse_dbenhur(line)
line.scan(FIELD)[0...-1].map{ |matches| matches[0] || matches[1] }
end
def parse_ugly(line)
dumb_fields = line.chomp.split(',').map { |v| v.gsub(/\s+/, ' ') }
fields = []
open = false
dumb_fields.each_with_index do |v, i|
open ? fields.last.concat(v) : fields.push(v)
open = (v.start_with?('"') and (v.count('"') % 2 == 1) and dumb_fields[i+1] and dumb_fields[i+1].start_with?(' ')) || (open and !v.end_with?('"'))
end
fields.map { |v| (v.start_with?('"') and v.end_with?('"')) ? v[1..-2] : v }
end
lines = []
lines << 'Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00'
lines << 'Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00'
lines << 'Electrical,197135021E,"SERVICE, "OUTLETS"",1997-05-15 00:00:00'
lines << 'Electrical,197135021E,"SERVICE, "OUTLETS" FOOBAR",1997-05-15 00:00:00'
lines << 'Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00'
require 'csv'
lines.each do |line|
puts
puts line
begin
c = CSV.parse_line(line)
puts "#{c.to_csv.chomp} (size #{c.length})"
rescue
puts "FasterCSV says: #{$!}"
end
a = parse_ugly(line)
puts "#{a.to_csv.chomp} (size #{a.length})"
b = parse_dbenhur(line)
puts "#{b.to_csv.chomp} (size #{b.length})"
end
Here's the output when I run it:
Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00
FasterCSV says: Illegal quoting in line 1.
Plumbing,196222006P,"REPLACE LEAD WATER SERVICE W/1"" COPPER",1996-08-09 00:00:00 (size 4)
Plumbing,196222006P,"REPLACE LEAD WATER SERVICE W/1"" COPPER",1996-08-09 00:00:00 (size 4)
Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00
FasterCSV says: Unclosed quoted field on line 1.
Construction,197133031B,"""MORGAN SHOES"" ALT",1997-05-13 00:00:00 (size 4)
Construction,197133031B,"""MORGAN SHOES"" ALT",1997-05-13 00:00:00 (size 4)
Electrical,197135021E,"SERVICE, "OUTLETS"",1997-05-15 00:00:00
FasterCSV says: Missing or stray quote in line 1
Electrical,197135021E,"SERVICE ""OUTLETS""",1997-05-15 00:00:00 (size 4)
Electrical,197135021E,"""SERVICE"," ""OUTLETS""""",1997-05-15 00:00:00 (size 5)
Electrical,197135021E,"SERVICE, "OUTLETS" FOOBAR",1997-05-15 00:00:00
FasterCSV says: Missing or stray quote in line 1
Electrical,197135021E,"SERVICE ""OUTLETS"" FOOBAR",1997-05-15 00:00:00 (size 4)
Electrical,197135021E,"""SERVICE"," ""OUTLETS"" FOOBAR""",1997-05-15 00:00:00 (size 5)
Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00
Construction,198120036B,"""MERITER"",""DO IT CTR"", ""NCR"" AND ""TRACE"" ALTERATION",1998-04-30 00:00:00 (size 4)
Construction,198120036B,"""""MERITER""","""DO IT CTR"""," """"NCR"""" AND """"TRACE"""" ALTERATION""",1998-04-30 00:00:00 (size 6)
Construction,198120036B,"""""""MERITER""""","""""DO IT CTR"""""," """"NCR"""" AND """"TRACE"""" ALTERATION""",1998-04-30 00:00:00 (size 6)
UPDATE
Note that the CSV uses double quotes when a field has a comma.
UPDATE 2
It's fine if commas are stripped out of the fields in question... my parse_ugly method doesn't preserve them.
UPDATE 3
I learned from the client that it's SQL Server 2008 that's exporting this strange CSV - which has been reported to Microsoft here and here
UPDATE 4
@dbenhur's answer worked perfectly for what I originally asked for, but pointed out that I neglected to show lines with both commas and quotes. I will accept d@benhur's answer - but I'm hoping it can be improved to work on all lines above.
HOPEFULLY FINAL UPDATE
This code works (and I would consider it "semantically correct"):
QUOTED = /"((?:[^"]|(?:""(?!")))*)"/
SEPQ = /,(?! )/
UNQUOTED = /([^,]*)/
SEPU = /,(?=(?:[^ ]|(?: +[^",]*,)))/
FIELD = /(?:#{QUOTED}#{SEPQ})|(?:#{UNQUOTED}#{SEPU})|\Z/
def parse_sql_server_2008_csv_line(line)
line.scan(FIELD)[0...-1].map{ |matches| (matches[0] || matches[1]).tr(',', ' ').gsub(/\s+/, ' ') }
end
Adapted from @dbenhur and @ghostdog74's answer in How can I process a CSV file with “bad commas”?