4
votes

Before ruby 1.9.3 I was able to ingest rows containing characters with incorrect encoding using Ruby's CSV library:

require 'csv' 
CSV.open('file').each |row|
   ... #deal with wrong encoded characters here 
end

CSV in Ruby 1.9.3 raises exception with 'bad' rows -- ArgumentError: invalid byte sequence in UTF-8 which I cannot figure out how to catch inside of the block. I see two solutions, both are slow:

1. ~8 times slower:

open('file').each |line|
  begin
     CSV.parse(line)
   rescue ArgumentError
     line.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '?')
     line.encode!('UTF-8', 'UTF-16')
     CSV.parse(line)
   end
end

2. ~2 times slower

fix encoding in the file before sending it to CSV.

What is a faster way to deal with rows containing 'bad' characters?

1
If you can, cleanse your input by reading/cleansing/writing the data as a text file, to fix it en masse. Then load it row by row as CSV. That will be a lot faster than trying to fix it on a row by row basis.the Tin Man
Hm, isn't it the same as #2 solution?dimus

1 Answers

3
votes
CSV.open("file", "r:bom|utf-8").each |row|

end