python web crawler with ascii decoding

Question

I'm writing a web crawler of Wikipedia with Python. I extract the language information of the pages,which contain mulitple characters of language such as Chinese,Japanese When I got the strings I want and print them out, they are coded in ascii. so the result is like :

...('Vietnamese', 'vi', 'Ti\xe1\xba\xbfng Vi\xe1\xbb\x87t') {'confidence': 1.0, 'encoding': 'ascii'} ('Turkish', 'tr', 'T\xc3\xbcrk\xc3\xa7e') {'confidence': 1.0, 'encoding': 'ascii'} ('Ukrainian', 'uk', '\xd0\xa3\xd0\xba\xd1\x80\xd0\xb0\xd1\x97\xd0\xbd\xd1\x81\xd1\x8c\xd0\xba\xd0\xb0') {'confidence': 1.0, 'encoding': 'ascii'} ('Chinese', 'zh', '\xe4\xb8\xad\xe6\x96\x87') {'confidence': 1.0, 'encoding': 'ascii'}

My code:

def getLanguageContent(content):
    mainPattern = re.compile(matchReg)
    mainContentMatch = mainPattern.findall(content)
    return mainContentMatch

arr = getLanguageContent(getContentFromURL(sitePrefix))
print arr
for a in arr:
   a = str(a)
   print a

arr is a list like [('Simple English', 'simple', 'Simple English'), ('Arabic', 'ar', '\xd8\xa7\xd9\x84\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a\xd8\xa9'), ....]

I want to know how can I deal with this problem and print the string in their right decoding.Thanks a lot

'Ti\xe1\xba\xbfng Vi\xe1\xbb\x87t' is not coded in ASCII, that's clearly UTF-8. For that matter, you can't code 'Tiếng Việt' in ASCII, at least not without throwing away information (e.g., 'Tieng Viet'). — abarnert
Please show us the actual contents of arr, or the getContentFromURL function, or both, because otherwise it's impossible to do anything but guess at all the things you could be doing wrong here. — abarnert
I thought it is in ASCII because the result of chardet.detect(a) is ascii is something like {'confidence': 1.0, 'encoding': 'ascii'} — 郑穗展
arr is actually a list of tuples, so I have to call str() first — 郑穗展
You definitely don't want to call str on a tuple, or a list of tuples! In fact, that's your whole problem. Let me edit my answer to explain. — abarnert

abarnert abarnert · Accepted Answer · 2014-12-13T05:44:44

First, 'Ti\xe1\xba\xbfng Vi\xe1\xbb\x87t' is not coded in ASCII. It's clearly UTF-8. For that matter, you can't code 'Tiếng Việt' in ASCII, at least not without throwing away information (e.g., 'Tieng Viet'). And when I run chardet.detect on all of the strings in your example, I get UTF-8, with confidences ranging from 0.7525 and 0.99.

Your problem is that arr is a list of tuples of strings, not a list of strings. When you call str(a), on a tuple, what that does is to call repr on each element, then wrap the whole thing in quotes and parentheses and commas and so forth. The repr of a string is always in ASCII, with backslash escapes for non-ASCII, and ASCII-but-not-printable, characters. For example, str(('Vietnamese', 'vi', 'Tiếng Việt')) is "('Vietnamese', 'vi', 'Ti\\xe1\\xba\\xbfng Vi\\xe1\\xbb\\x87t')". That's not a useful string.

Instead of trying to figure out how to make a useless string useful, just use the useful strings you already have. Don't call str on a list of tuples of strings, or on each tuple of strings. Just use the strings inside each tuple. For example:

for language, code, name in arr:
    print name

That will (assuming your console can handle UTF-8) print out Tiếng Việt. Or, if you want to decode it to unicode, just uname = name.decode('utf-8'). Or, if you want to call chardet.detect(name), it'll verify that it's UTF-8 with 0.7525 confidence. And so on.

python web crawler with ascii decoding

2 Answers