Before ruby 1.9.3 I was able to ingest rows containing characters with incorrect encoding using Ruby's CSV library:
require 'csv'
CSV.open('file').each |row|
... #deal with wrong encoded characters here
end
CSV in Ruby 1.9.3 raises exception with 'bad' rows -- ArgumentError: invalid byte sequence in UTF-8
which I cannot figure out how to catch inside of the block. I see two solutions, both are slow:
1. ~8 times slower:
open('file').each |line| begin CSV.parse(line) rescue ArgumentError line.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '?') line.encode!('UTF-8', 'UTF-16') CSV.parse(line) end end
2. ~2 times slower
fix encoding in the file before sending it to CSV.
What is a faster way to deal with rows containing 'bad' characters?