2
votes

I'm parsing a CSV file that I've pulled from an FTP site. I want to parse the CSV and extract some specific fields to store in the database. I encounter some encoding I don't understand and I believe CSV.parse also isn't expecting the encoding:

filename = "#{RAILS_ROOT}/spec/files/20120801.01.001.CSV"
filestream = File.new(filename, "r")
while (line = filestream.gets)
  puts "line: #{line}"
  CSV.parse(line) do |row|
    case row[0]
    when "RH"
     # do something
    when "SH"
     #do something else
    end
  end
end

The first line in the CSV file looks something like this:

"\376\377\000\"\000R\000H\000\"\000,\0002\0000\0004\0005\000/\0000\0008\000/\0000\0002\000 \0000\0005\000:\0005\0007\000:\0002\0001\000 \000-\0000\0007\0000\0000\000,\0002\0000\0001\0002\000/\0000\0008\000/\0000\0001\000 \0000\0000\000:\0000\0000\000:\0000\0000\000 \000-\0000\0004\0000\0000\000,\0002\0000\0001\0002\000/\0000\0008\000/\0000\0001\000 \0002\0003\000:\0005\0009\000:\0001\0004\000 \000-\0000\0007\0000\0000\000,\000\"\000Y\0003\000B\0003\0003\000Z\000N\000K\000A\000U\000B\000H\000N\000\"\000,\0000\0000\0001\000,\000\n"

I have a different CSV file that I created myself and it prints out as human-readable text. What am I missing here? Do I need to apply a some encoding to the CSV string before passing to CSV.parse.

Here's the stacktrace:

CSV::IllegalFormatError
/Users/project/app/models/parse_csv.rb:5:in `parse'

I am forced to use ruby v1.8.7 at the moment.

I know that I could use CSV.open, but I'm intentionally trying to feed CSV.parse an IO stream so that I can grab CSV files from an FTP site using SFTP to stream the files into memory without having to store the CSV file to disk:

 sftp.open_handle("/path/to/remote.file") do |handle|     
      data = sftp.read(handle)   
   end

Thanks in advance for any ideas!

1

1 Answers

2
votes

The line has double quotes in it which may need to be escaped. I found this on ruby-forum.com.

It's just a guess, but maybe you could try replacing every double-quote character that isn't either preceded or followed by a comma with a single quote? Something like the untested code below:

line.gsub(/[^,]"[^,]/,"'")

It would probably require reading the whole file first, writing out a corrected version, and then calling the CSV methods on that, but it beats doing it by hand :).

Also, as an aside, I think instead of

while (line = filestream.gets)

you could do

filestream.gets.each_line do |line|

which might be more rubyish (maybe?)