0
votes

I am trying to read a rtf file & extract the characters in it. E.g. below is the rtf version of ф

{\rtf1\ansi\ansicpg1252\fromtext \fbidis \deff0{\fonttbl {\f0\fswiss\fcharset0 Arial;} {\f1\fmodern Courier New;} {\f2\fnil\fcharset2 Symbol;} {\f3\fmodern\fcharset0 Courier New;} {\f4\fswiss\fcharset204 Arial;}} {\colortbl\red0\green0\blue0;\red0\green0\blue255;} \uc1\pard\plain\deftab360 \f0\fs20 \htmlrtf{\f4\fs20\htmlrtf0 \'f4\htmlrtf\f0}\htmlrtf0 \par }

As you can see the encoding in this is Windows-1252

#!/usr/bin/perl
use strict;
use utf8;
use Encode qw(decode encode);

binmode(STDOUT, ":utf8");
my $runtime = chr(0x0444);
   print "theta || ".$runtime." ||";

  my $hexstr = "0xF4";
  my $num = hex $hexstr;
  my $be_num = pack("N", $num);
  $runtime = decode( "cp1252",$be_num);
  print "\n".$runtime."\n";

$runtime = decode( "cp1251",$be_num);
  print "\n".$runtime."\n"

Output

theta || ф ||
ô

ф

As you can see that with cp1252 i am getting ô. Am i missing something ? I wanted to get encoding from the rtf. I expected to print ф but it printed ô

1
Re "As you can see that with cp1252 i am getting ô. Am i missing something ?" Apparently, because character F4 in cp1252 is ô. I can't tell you what you're missing since I don't know RTF.ikegami
Aside: my $be_num = pack("N", $num); produces "\0\0\0\xF4". To get "\xF4" as you want, you want my $be_num = pack("C", $num);. my $be_num = chr($num); will also do.ikegami
@ikegami if you save the above data in .rtf, you will get a ф. I wonder why it prints ф. It should print ôS Kr
Re "if you save the above data in .rtf, you will get a ф. I wonder why it prints ф.", I know, but I don't know RTF, so I can't help with that. Re "It should print ô", I cannot confirm or deny that for the same reason. All I could do is confirm that you are getting the correct output from your Perl script, so that's all I did.ikegami

1 Answers

4
votes

While the global codepage for the document is cp1252 there are local definitions:

  • The \xf4 char is written with font f4: {\f4...\'f4.
  • But the definition for font f4 is: {\f4\fswiss\fcharset204 Arial;}
  • \fcharset204 sets the charset for this font to 204, e.g. Russian, which is codepage 1251 (according to http://msdn.microsoft.com/en-us/library/cc194829.aspx)

And with codepage 1251 you get the expected character ф.

BTW, codepage 1252 is similar to latin-1 and does not have a character ф at all (see http://en.wikipedia.org/wiki/Windows-1252)