1
votes

htmlspecialchars() appears to be translating special chars like the following: āķūņūķī into their respective entity number:

ā ķ ū ņ ū ķ ī

While some remain untranslated such as:

žš

I would like htmlspecialchars() (or some other function) to not translate these alphabetical type of characters... So that it only translates the following (as it seems to indicate on the php.net manual):

  1. '&' (ampersand) becomes '&'
  2. '"' (double quote) becomes '"' when ENT_NOQUOTES is not set.
  3. "'" (single quote) becomes ''' only when ENT_QUOTES is set.
  4. '<' (less than) becomes '&lt;'
  5. '>' (greater than) becomes '&gt;'

The reason why I need this is because after a POST request, i am running this user input through htmlspecialchars() before placing it back into a new set of html inputs. Characters such as &,",',<,>, need to be translated so to not cause display errors etc. But i need the special chars such as 'āķūņūķī' remain unchanged. Else the user will be very confused.

2
htmlspecialchar()s 3rd parameter supports encoding, maybe this helps.ccKep

2 Answers

5
votes

Set the third parameter as UTF-8:

echo htmlentities('āķūņūķī', ENT_QUOTES, 'UTF-8');

The default encoding for htmlspecialchars is ISO-8859-1.

Test case:

var_dump(htmlentities('āķūņūķī'));
var_dump(htmlentities('āķūņūķī', ENT_QUOTES, 'UTF-8'));

Output:

string(84) "&Auml;�&Auml;&middot;&Aring;&laquo;&Aring;�&Aring;&laquo;&Auml;&middot;&Auml;&laquo;"
string(14) "āķūņūķī"

http://codepad.org/MCaDosQ5

0
votes

Characters with unicode values greater than 255 need to be converted into their numerical representation so that they are processed correctly.

The reason that two characters that you mentioned aren't converted is because they have Unicode values which can be represented as a single byte. The other characters, which have values greater than 256, require multiple bytes.

As for decoding on the receiving side, look at htmlspecialchars_decode. You can find the documentation on the PHP website -- htmlspecialchars_decode manual page