Unknown charset accented characters convert to utf8

Question

I have a website that users may enter an accented character search term. Since users may come from various countries, various OS, the charset accented characters they input may be encoded in windows-1252, iso-8859-1, or even iso-8859-X, windows-125X.

I am using Perl, and my index server is Solr 8, all data in utf8. I can use decode+encode to convert it if the source charset is known, but how could I convert an unknown accented to utf8? How could I detect the charset of the source accented characters, in Perl?

use utf8;
use Encode;
encode("utf8",decode("cp1252",$input));

Joop Eggen Joop Eggen · Accepted Answer · 2020-07-19T00:57:58

The web page and the form need to specify UTF-8.

Then the browser can accept any script, and will send it to the server as UTF-8.

The form's encoding prevents the browser sending HTML entities like ă for special chars.

Header:

Content-type: text/html; charset=UTF-8

With perl (empty line for end-of-headers):

print "Content-Type: text/html; charset=UTF-8\n\n";

HTML content; in HTML 5:

<!DOCTYPE html>
<html>
    <meta charset="UTF-8">
...
<form ... accept-charset="UTF-8"

Unknown charset accented characters convert to utf8

1 Answers