My guess is that you are dealing with a raw, UTF-8 encoded string. You haven't shown how you got it or said why you'd want to do that. A small and complete demonstration program that shows how you get the input, how you change it, and what ultimately complains, would help people find the problem. If you add that small demonstration program to your question I might be able to give a better (or even different) answer.
The non-breaking space has code number U+00A0. Under UTF-8 it encodes to the two octets \xC2 and \xA0. Everything with a code number above U+007F has a multi-octet encoding under UTF-8. Everything under U+007F is really just ASCII, so ASCII works as UTF-8.
If you have the UTF-8 encoded text with the non-breaking space and remove just the \xA0
octet, there's a lonely \xC2
left over. Depending on what comes after it, that may be a problem. UTF-8 is designed to recognize where the problem is and correct itself though. It could pick up at the next legally encoded character and leave a substitution character to mark the error. Or, the program can complain and give up.
When you use the character class [\xC2\xA0]
, I'm guessing that it gets rid of either of those octets anywhere they appear. Since you don't report any other errors, I'm guessing that \xC2
doesn't appear anywhere else. Otherwise, other characters might change. Or, you're dealing with extended ASCII and removing the \xC2
leaves the right Latin-1 encoding. Does the number of substitutions reported by s///
equal the number (or double that) of non-breaking spaces?
If you have UTF-8 encoded text, read it as UTF-8:
open my $fh, '<:utf8', $filename or die ...
After you've read the data, don't worry about the encoding. Use the code numbers and Perl will figure it out. Or use the code names so future programmers know what you are doing without looking up the character:
my $string =~ s/\x{00A0}/ /g;
my $string =~ s/\N{NO-BREAK SPACE}/ /g;
When you are done, write it as UTF-8 text:
open my $fh, '>:utf8', $filename or die ...
The latest Learning Perl has a Unicode primer in the back that covers quite a bit of this.
Good luck!
use diagnostics
– mob$$pstring
", Why do you have a reference to a scalar? It's possibly legit, but it's quite odd – ikegami