urllib2, Google App Engine, and unicode question

Question

Hey guys, I'm just learning google app engine so I'm running into a bunch of problems...

My current predicament is this. I have a database,

class Website(db.Model):
    web_address = db.StringProperty()
    company_name = db.StringProperty()
    content = db.TextProperty()
    div_section = db.StringProperty()
    local_links = db.StringProperty()
    absolute_links = db.BooleanProperty()
    date_updated = db.DateTimeProperty()

and the problem i'm having is with the content property.

I'm using the db.TextProperty() because I need to store the contents of a webpage which have >500 bytes.

The problem i'm running into is urllib2.readlines() formats as unicode. When putting into a TextProperty() it's converting to ASCII. some of the characters are >128 and it throws a UnicodeDecodeError.

Is there a simple way to bypass this? For the most part, I don't care about those characters...

my error is:

Traceback (most recent call last):
File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/init.py", line 511, in call handler.get(*groups) File "/base/data/home/apps/game-job-finder/1.346504560470727679/main.py", line 61, in get x.content = website_data_joined File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/db/init.py", line 542, in set value = self.validate(value) File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/db/init.py", line 2407, in validate value = self.data_type(value) File "/base/python_runtime/python_lib/versions/1/google/appengine/api/datastore_types.py", line 1006, in new return super(Text, cls).new(cls, arg, encoding) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2124: ordinal not in range(128)

I would have thought that converting Unicode to ASCII would be "encoding" not "decoding". Are you sure it's not the other way around? — Luke Dunstan
could you add the snippet where you make the readline and put on datastore? — systempuntoout

Cameron Cameron · Accepted Answer · 2010-11-28T05:51:52

It would appear that the lines returned from readlines are not unicode strings, but rather byte strings (i.e. instances of str containing potentially non-ASCII characters). These bytes are the raw data received in the HTTP response body, and will represent different strings depending on the encoding used. They need to be "decoded" before they can be treated as text (bytes != characters).

If the encoding is UTF-8, this code should work properly:

f = urllib2.open('http://www.google.com')
website = Website()
website.content = db.Text(f.read(), encoding = 'utf-8-sig')    # 'sig' deals with BOM if present

Note that the actual encoding varies from website to website (sometimes even from page to page). The encoding used should be included in the Content-Type header in the HTTP response (see this question for how to get it), but if it's not, it may be included in a meta tag in the head of the HTML (in which case extracting properly is much more tricky):

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Note that there are sites that do not specify an encoding, or specify the wrong encoding.

If you really don't care about any characters but ASCII, you can ignore them and be done with it:

f = urllib2.open('http://www.google.com')
website = Website()
content = unicode(f.read(), errors = 'ignore')    # Ignore characters that cause errors
website.content = db.Text(content)    # Don't need to specify an encoding since content is already a unicode string

urllib2, Google App Engine, and unicode question

1 Answers