37
votes

Hannibal episodes in tvdb have weird characters in them.

For example:

Œuf

So ruby spits out:

./manifesto.rb:19:in `encode': "\xC3" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
    from ./manifesto.rb:19:in `to_json'
    from ./manifesto.rb:19:in `<main>'

Line 19 is:

puts @tree.to_json

Is there a way to deal with these non utf characters? I'd rather not replace them, but convert them? Or ignore them? I don't know, any help appreciated.

Weird part is that script works fine via cron. Manually running it creates error.

4
set proper codepage like ISO-8859-1 instead of ASCII-8BIT to the variable @tree as @tree.force_encoding('ISO-8859-1'). Because ASCII-8BIT is used just for binary files.Малъ Скрылевъ
I guess the cron environment is somehow resolving the default input encoding for you. I think your input is actually UTF-8 in the first place (C3 is a common byte to see at the start of a multi-byte character from European characters)Neil Slater
@Малъ Скрылевъ: In this case I think the input may not be an ISO-8859 variant, but UTF-8 that has been incorrectly defaulted. Although with just one sample point without the matching character it could be anythingNeil Slater
@NeilSlater why do you think? isn't the char Œ in the iso cp?Малъ Скрылевъ
if sudo solves the problem, the problem was in default cp, please do knowledge which default cp is, and set it in ruby for default user. like this: Encoding.default_external = Encoding::UTF_8 replacing utf to proper oneМалъ Скрылевъ

4 Answers

19
votes

It seems you should use another encoding for the object. You should set the proper codepage to the variable @tree, for instance, using instead of by using @tree.force_encoding('ISO-8859-1'). Because ASCII-8BIT is used just for binary files.

To find the current external encoding for ruby, issue:

Encoding.default_external

If solves the problem, the problem was in default codepage (encoding), so to resolve it you have to set the proper default codepage (encoding), by either:

  1. In ruby to change encoding to or another proper one, do as follows:

    Encoding.default_external = Encoding::UTF_8
    
  2. In , grep current valid set up:

    $ sudo env|grep UTF-8
    LC_ALL=ru_RU.UTF-8
    LANG=ru_RU.UTF-8
    

    Then set them in .bashrc properly, in a similar way, but not exactly with ru_RU language, such as the following:

    export LC_ALL=ru_RU.UTF-8
    export LANG=ru_RU.UTF-8
    
20
votes

File.open(yml_file, 'w') should be change to File.open(yml_file, 'wb')

2
votes

I just suffered through a number of hours trying to fix a similar problem. I'd checked my locales, database encoding, everything I could think of and was still getting ASCII-8BIT encoded data from the database.

Well, it turns out that if you store text in a binary field, it will automatically be returned as ASCII-8BIT encoded text, which makes sense, however this can (obviously) cause problems in your application.

It can be fixed by changing the column encoding back to :text in your migrations.

2
votes

I had the same problems when saving to the database. I'll offer one thing that I use (perhaps, this will help someone).

if you know that sometimes your text has strange characters, then before saving you can encode your text in some other format, and then decode the text again after it is returned from the database.

example:

string = "Œuf"

before save we encode string

text_to_save = CGI.escape(string)

(character "Œ" encoded in "%C5%92" and other characters remained the same)

=> "%C5%92uf"

load from database and decode

CGI.unescape("%C5%92uf")

=> "Œuf"