Malformed UTF-8 character when matching Non Breaking Space

Question

I am using utf8 in my perl program and I have got the following code line:

$$pstring =~ s/\xA0/ /g;

which should clean out non breaking spaces from the string.

Under Ubuntu 16.04 and perl v5.22.1 this is not an issue but under Ubuntu 14.04 and v5.18.2 I get this Error:

Malformed UTF-8 character (fatal)

Then I inspected the string I was trying to match and found that there were non-breaking spaces in there, which could be deleted by the regex

$$pstring =~ s/[\xC2\xA0]/ /g;

but not with

$$pstring =~ s/\xC2\xA0/ /g;

My question is : What is the difference between the last two (Why does it only work with brackets) and is there another way of solving this?

Re "$$pstring", Why do you have a reference to a scalar? It's possibly legit, but it's quite odd — ikegami
Actually I had two consequent \xA0 chars in my String.. $$pstring =~ s/\xA0+/ /g; fixed it. — reencode

brian d foy brian d foy · Accepted Answer · 2018-08-24T15:24:55

My guess is that you are dealing with a raw, UTF-8 encoded string. You haven't shown how you got it or said why you'd want to do that. A small and complete demonstration program that shows how you get the input, how you change it, and what ultimately complains, would help people find the problem. If you add that small demonstration program to your question I might be able to give a better (or even different) answer.

The non-breaking space has code number U+00A0. Under UTF-8 it encodes to the two octets \xC2 and \xA0. Everything with a code number above U+007F has a multi-octet encoding under UTF-8. Everything under U+007F is really just ASCII, so ASCII works as UTF-8.

If you have the UTF-8 encoded text with the non-breaking space and remove just the \xA0 octet, there's a lonely \xC2 left over. Depending on what comes after it, that may be a problem. UTF-8 is designed to recognize where the problem is and correct itself though. It could pick up at the next legally encoded character and leave a substitution character to mark the error. Or, the program can complain and give up.

When you use the character class [\xC2\xA0], I'm guessing that it gets rid of either of those octets anywhere they appear. Since you don't report any other errors, I'm guessing that \xC2 doesn't appear anywhere else. Otherwise, other characters might change. Or, you're dealing with extended ASCII and removing the \xC2 leaves the right Latin-1 encoding. Does the number of substitutions reported by s/// equal the number (or double that) of non-breaking spaces?

If you have UTF-8 encoded text, read it as UTF-8:

open my $fh, '<:utf8', $filename or die ...

After you've read the data, don't worry about the encoding. Use the code numbers and Perl will figure it out. Or use the code names so future programmers know what you are doing without looking up the character:

my $string =~ s/\x{00A0}/ /g;
my $string =~ s/\N{NO-BREAK SPACE}/ /g;

When you are done, write it as UTF-8 text:

open my $fh, '>:utf8', $filename or die ...

The latest Learning Perl has a Unicode primer in the back that covers quite a bit of this.

Good luck!

Malformed UTF-8 character when matching Non Breaking Space

1 Answers