6
votes

I'm trying to get html of page which contains diacritics (í,č...). The problem is that urllib2.quote seems to not being work as I expected.

As far as I'm concerned, quote should convert url which contains diacritics to proper url.

Here is an example:

url = 'http://www.example.com/vydavatelství/'

print urllib2.quote(url)

>> http%3A//www.example.com/vydavatelstv%C3%AD/

The problem is that it changes http// string for some reason. Then the urllib2.urlopen(req) returns error:

response = urllib2.urlopen(req)
File "C:\Python27\lib\urllib2.py", line 154, in urlopen return opener.open(url, data, timeout) File "C:\Python27\lib\urllib2.py", line 437, in open response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 550, in http_response 'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 475, in error return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 409, in _call_chain result = func(*args)
File "C:\Python27\lib\urllib2.py", line 558, in http_error_default raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 400: Bad Request

1
Have you tried putting # -- coding: utf-8 -- at the top of your script??thefragileomen

1 Answers

7
votes

-- TL;DR --

Two things. First make sure you're including your shebang # -- coding: utf-8 -- at the top of your python script. This let's python know how to encode the text in your file. Second thing, you need to specify safe characters, which are not converted by the quote method. By default, only the / is specified as a safe character. This means that the : is being converted, which is breaking your URL.

url = 'http://www.example.com/vydavatelství/'
urllib2.quote(url,':/')
>>> http://www.example.com/vydavatelstv%C3%AD/

-- A little more on this --

So the first problem here is that urllib2's documentation is pretty poor. Going off the link that Kamal provided, I see no mention of the quote method in the docs. That makes trouble shooting pretty difficult.

With that said, let me explain this a little bit.

urllib2.quote seems to work the same as urllib's implementation of quote which is documented pretty well. urllib2.quote() takes four parameters

urllib.parse.quote(string, safe='/', encoding=None, errors=None)
##   string: string your trying to encode
##     safe: string contain characters to ignore. Defualt is '/'
## encoding: type of encoding url is in. Default is utf-8
##   errors: specifies how errors are handled. Default is 'strict' which throws a UnicodeEncodeError, I think.