Character conversion in ruby 1.8.7 from pdftk unicode conversion results

Question

I am parsing titles from pdf files using pdftk has various language specific characters in it.

This ruby on rails application I need to do this in is using ruby 1.8.7 and rails 2.3.14 so any encoding solutions built into ruby 1.9 aren't an option for me right now.

Example of what I need to do:

If the title includes a ü, when I read the pdf content using pdftk (either command line or using ruby pdf-toolkit gem) the "ü" gets converted to ü

In my application, I really want this in the ü as this seems to work fine for my needs in a web page and in XML file.

I can convert the character explicitly in ruby using

>> string = "&#252;"
=> "&#252;"
>> string.gsub("&#252;","ü")
=> "ü"

but obviously I don't want to do this one by one.

I've tried using Iconv to do this but I feel I don't know what to specify to get this converted to the rendered character. I thought maybe this was just a utf-8 but it doesn't seem to convert to rendered character

>> Iconv.iconv("latin1", "utf-8","&#252;").join
=> "&#252;"

I am little confused about what format to/from to use here to get the end result of the rendered character.

So how do use Iconv or other tools to make this conversion for all characters converted to this HTML code from pdftk?

Or how to tell pdftk to do this when I read the pdf file in the first place!

Streamline Streamline · Accepted Answer · 2012-05-17T14:28:33

Ok - I think the issue here is the codes that pdftk are returning are HTML so unescaping the HTML first is the path that works

>> Iconv.iconv("utf8", "latin1", CGI.unescapeHTML(string) ).join
=> "ü"

Update:

Using the following

  pdf = PDF::Toolkit.open(file)
  pdf.title = Iconv.iconv("utf8", "latin1", CGI.unescapeHTML(pdf.title)).join

This seems to work for most languages but when I apply this to japanese and chinese, it mangles things and doesn't result in the original as it appears in the PDF.

Update:

Getting closer - it appears that the html codes pdftk puts in the title for japanese and chinese already render correctly if I just unescape them and don't attempt any Iconv conversion.

CGI.unescapeHTML(pdf.title)

This renders correctly.

So... how do I test the pdf.title to see ahead of time if this is chinese or japanese (double byte ?) before I try to apply the conversion needed for other languages?

Character conversion in ruby 1.8.7 from pdftk unicode conversion results

2 Answers