2
votes

I have a Rails application that accepts file uploads of CSV files. When developing the feature locally on my Mac, I received an "invalid byte sequence in UTF-8" error when trying to parse the uploaded file (using Ruby's standard library CSV).

So after doing some research and reading some answers to similar questions on StackOverflow, I tried using a gem to sniff out the character encoding (namely CharDet), and then when opening the file via the CSV library, I would specify the encoding. And this solved all my problems, and life was good.

    content = File.read(fullpath)
    self.file_encoding = CharDet.detect(content)['encoding']
    CSV.table(fullpath, :encoding => file_encoding, :header_converters => :downcase).headers

But then I deployed this code to the production Linux environment, and again with the "invalid byte sequence in UTF-8" errors. What a mystery (to me anyway)! After quite some time trying to resolve the error, I tried removing the code that specified the encoding upon opening the file. And miraculously it fixed the problem on production, but now local Mac development is broken.

Keep in mind, that in both cases I'm uploading the same file using the same browser. Does anyone have any insight on what is going on here?

By the way, versions of ruby are close, but not the same. The Mac is ruby 1.9.3-p0, and the Linux server is 1.9.2-p180. The app is Rails 3.2.6.

1
Have you tried running the same ruby version (which is just a good idea generally).Frederick Cheung

1 Answers

1
votes

A few thoughts:

  1. Have you confirmed the encoding of the file that you're uploading?
  2. Have you tested with 1.9.2-p180 on your Mac, as Frederick Cheung suggested?
  3. Have you tried outputting the results of CharDet.detect on each platform to see what the encoding of the received file (as opposed to the uploaded file) is? I wonder if some configuration is different between Apache on Linux and WEBrick on your Mac?
  4. Are you using the same version of CharDet on both platforms? What libraries does it use (e.g. iconv), and are they the same version on both platforms?

I'm not aware of any differences in behavior with regard to encoding between 1.9.2 and 1.9.3, but I haven't specifically researched it either. It could also be a difference in the configuration of the MRI build.