Load file on Ruby with two separate encodings

Question

I have a large file with two different encodings. The "main" file is UTF-8, but some characters like <80> (€ in isoxxx) or <9F> (ß in isoxxx) are in ISO-8859-1 encoding. I can use this to replace the invalid characters:

 string.encode("iso8859-1", "utf-8", {:invalid => :replace, :replace => "-"}).encode("utf-8")

The problem is, that I need this wrong encoded characters, so replacing to "-" is useless for me. How can i fix the wrong encoded characters in the document with ruby?

EDIT: I've tried the :fallback option, but with no success (no replacements where made):

 string.encode("iso8859-1", "utf-8",
     :fallback => {"\x80" => "123"}
 )

fallback will only without the other options. see the link i posted earlier. — phoet
No, I've tried it without the additional options and doesn't work :( — f00860

Sony Santos Sony Santos · Accepted Answer · 2012-07-09T13:09:36

I used the following code (Ruby 1.8.7). It tests each char >= 128 ASCII to check whether it's the beginning of a valid utf-8 sequence. If not, it's assumed to be iso8859-1 and converts it to utf-8.

Due the fact your file is large, this procedure can be very slow!

class String
  # Grants each char in the final string is utf-8-compliant.
  # based on http://php.net/manual/en/function.utf8-encode.php#39986
  def utf8
    ret = ''

    # scan the string
    # I'd use self.each_byte do |b|, but I'll need to change i
    a = self.unpack('C*')
    i = 0
    l = a.length
    while i < l
      b = a[i]
      i += 1

      # if it's ascii, don't do anything.
      if b < 0x80
        ret += b.chr
        next
      end

      # check whether it's the beginning of a valid utf-8 sequence
      m = [0xc0, 0xe0, 0xf0, 0xf8, 0xfc, 0xfe]
      n = 0
      n += 1 until n > m.length || (b & m[n]) == m[n-1]

      # if not, convert it to utf-8
      if n > m.length
        ret += [b].pack('U')
        next
      end

      # if yes, check if the rest of the sequence is utf8, too
      r = [b]
      u = false

      # n bytes matching 10bbbbbb follow?
      n.times do
        if i < l
          r << a[i]
          u = (a[i] & 0xc0) == 0x80
          i += 1
        else
          u = false
        end
        break unless u
      end

      # if not, converts it!
      ret += r.pack(u ? 'C*' : 'U*')
    end

    ret
  end

  def utf8!
    replace utf8
  end
end

# let s be the string containing your file.
s2 = s.utf8

# or
s.utf8!

Load file on Ruby with two separate encodings

3 Answers