Encoding with unicode and non unicode characters in HTML

Question

I am using this package here: HTML.py 0.04

Here is what I am doing:

import html
h = html.HTML()
h.p('Some simple Euro: €1.14')
h.p(u'Some Euro: €1.14')

Now when I do >>> unicode(h) I get an error.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 18: ordinal not in range(128)

What is the best way to handle this? I need to write the html to a file.

In Py2, at best, try to always use unicode strings (u''), you can use from __future__ import unicode_literals at the top of your file to automagically convert most of '' to u'' automatically. Personally, I'd avoid Py2 if possible, it is a mess about encoding (doing lot of conversion in your back) and can be very confusing easily. — jeromej

bobince bobince · Accepted Answer · 2015-05-22T08:07:41

h.p('Some simple Euro: €1.14')

You should avoid byte strings ('' in Python 2, b'' in Python 3) for HTML content. The character model of HTML is Unicode, so only Unicode strings (u'') should be used.

You can get away with doing it wrong for simple ASCII characters. Because most common byte encodings are supersets of ASCII, Python 2 will implicitly convert ASCII byte strings to Unicode. But the € character isn't part of ASCII, so Python can't tell how to read it. If you have saved the source code above using the UTF-8 encoding then you have the byte string b'\xe2\x82\xac', which could mean €, â‚¬, 竄ｬ, or many other character sequences depending on what encoding is used.

Encoding with unicode and non unicode characters in HTML

1 Answers