0
votes

I am running perl 5, version 24, subversion 3 (v5.24.3) built for MSWin32-x64-multi-thread (with 1 registered patch, see perl -V for more detail) (Active State).

Trying to parse HTML page encoded in UTF-8:

$request = new HTTP::Request('GET', $url);
$response = $ua->request($request);
$content = $response->content();

I parse the $content as one giant string using INDEX and SUBSTR functions, that works fine. HTML page contains string with value ÖBB and I need to insert it in the database exactly as ÖBB When I print it and insert in the db, instead of Ö I get some ascii characters.

NOTE: this question is not database related; MySQL handles utf-8 just fine, so if I insert value "ÖBB" it will take it no problem.

I've looked at great number of similar questions/answers here and in other forums and I am none wiser.

use utf-8 and binmode(STDOUT, ":utf8") has not worked for me... Would greatly appreciate a code snippet that would solve the issue, thank you.

3
How do you get that HTML page (string) into your program? - zdim
What database do you use? What driver? How do you connect to the database? - choroba
Consider using utf8::all which tries to make Perl as UTF8 friendly as possible. - Schwern
When you say "print", do you mean print to console? In that case, make sure you have a the Unicode codepage active. Type chcp on the command line. If it says "65001", you're good in that regard. If not, type chcp 65001 to activate the correct codepage. - Holli
Does your database configured to support UTF8? You did not indicate what database you use. If you use MYSQL DB then see the following instruction. In case if you in MS Windows do not forget to change code page in terminal window chcp 65001. - Polar Bear

3 Answers

1
votes

Decode inputs; encode outputs.


First of all, you don't decode your inputs.

$response->content returns the raw content that could be in any encoding. Use $response->decoded_content(); to get the decoded response if it's HTML.


Second of all, you might not be encoding your outputs.

You didn't specify which database driver you use. Most DBI drivers have an option you need to specify. For example, with MySQL, you want

my $dbh = DBI->connect(
   'dbi:mysql:...',
   $user, $password,
   {
      mysql_enable_utf8mb4 => 1,
      ...
   },
);

You mentioned use utf8;. That tells Perl that your source code is encoded using UTF-8 rather than ASCII. Do use it if your source code is encoded using UTF-8.

This is not directly related to your issue.


You mentioned binmode(STDOUT, ":utf8"). That's a very poor way of writing

use open ':std', ':encoding(UTF-8)';

The above handles that for STDIN, STDOUT and STDERR, and does so at compile time. It also sets the default for files open in scope of the pragma.

But that's assuming the terminal expects UTF-8. That would be the case if you used chcp 65001. For a version that handles whatever encoding the terminal expects, you can use the following:

BEGIN {
   require Win32;
   my $cie = "cp" . Win32::GetConsoleCP();
   my $coe = "cp" . Win32::GetConsoleOutputCP();
   my $ae  = "cp" . Win32::GetACP();

   binmode(STDIN,  ":encoding($cie)");
   binmode(STDOUT, ":encoding($coe)");
   binmode(STDERR, ":encoding($coe)");

   require open;
   "open"->import(":encoding($ae)");
}

This has a few more details.

This is not directly related to your issue.

0
votes

This is what worked:

use Win32::API;
binmode(STDOUT, ":unix:utf8");
$SetConsoleOutputCP= new Win32::API( 'kernel32.dll', 
'SetConsoleOutputCP', 'N','N' );
$SetConsoleOutputCP->Call(65001);

All this was on the surface and I simply overlooked it ;-)

For MySQL db to work right and accept utf-8 encoded string this connection parameter had to be enabled: mysql_enable_utf8 => 1,

-3
votes

There are several components are involved when you capture webpage and output it to the screen.

For the moment let's assume that you use Windows and run following script in a terminal window.

First you need to confirm that your terminal supports UTF8 encoding. Type command chcp and see if it will output 65001.

If it does then you set, if it does not then issue the following command chcp 65001.

Run the script with command perl script_name.pl and you should get output with ÖBB included in terminal window

use strict;
use warnings;

use utf8;
use feature 'say';

use HTTP::Tiny;

my $url = shift || 'https://www.thetrainline.com/en/train-companies/obb';

my $response = HTTP::Tiny->new->get($url);

if ($response->{success}) {
   my $html = $response->{content};

   $html =~ m/(<p>Planning.+pets.<\/p>)/;

   say $1;
}

To store data in UTF8 encoding in database, the database should be configured to support UTF8 encoding.

In case of MYSQL database the command should look like following

CREATE DATABASE mydb
  CHARACTER SET utf8
  COLLATE utf8_general_ci;

See the following MYSQL documentation webpage.

enter image description here enter image description here