8
votes

I'm trying to scrape a page in japanese using python, curl, and BeautifulSoup. I then save the text to a MySQL database that's using utf-8 encoding, and display the resulting data using Django.

Here is an example URL:

https://www.cisco.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.showJob&RID=930026&CurrentPage=180

I have a function I use to extract the HTML as a string:

def get_html(url):
    c = Curl()
    storage = StringIO()
    c.setopt(c.URL, str(url))
    cookie_file = 'cookie.txt'
    c.setopt(c.COOKIEFILE, cookie_file)
    c.setopt(c.COOKIEJAR, cookie_file)
    c.setopt(c.WRITEFUNCTION, storage.write)
    c.perform()
    c.close()
    return storage.getvalue()

I then pass it to BeautifulSoup:

html = get_html(str(scheduled_import.url))
soup = BeautifulSoup(html)

It is then parsed and saved it to a database. I then use Django to output the data to json. Here is the view I'm using:

def get_jobs(request):
    jobs = Job.objects.all().only(*fields)
    joblist = []
    for job in jobs:
        job_dict = {}
        for field in fields:
            job_dict[field] = getattr(job, field)
        joblist.append(job_dict)
    return HttpResponse(dumps(joblist), mimetype='application/javascript')

The resulting page displays bytecode such as:

xe3\x82\xb7\xe3\x83\xa3\xe3\x83\xaa\xe3\x82\xb9\xe3\x83\x88

\xe8\x81\xb7\xe5\x8b\x99\xe5\x86\x85\xe5\xae\xb9
\xe3\x82\xb7\xe3\x82\xb9\xe3\x82\xb3\xe3\x82\xb7\xe3\x82\xb9\xe3\x83\x86\xe3\x83\xa0\xe3\x82\xba\xe3\x81\xae\xe3\x82\xb3\xe3\x83\xa9\xe3\x83\x9c\xe3\x83\xac\xe3\x83\xbc\xe3\x82\xb7\xe3\x83\xa7\xe3\x83\xb3\xe4\xba\x8b\xe6\xa5\xad\xe9\x83\xa8\xe3\x81\xa7\xe3\x81\xaf\xe3\x80\x81\xe4\xba\xba\xe3\x82\x92\xe4\xb8\xad\xe5\xbf\x83\xe3\x81\xa8\xe3\x81\x97\xe3\x81\x9f\xe3\x82\xb3\xe3\x83\x9f\xe3\x83\xa5\xe3\x83\x8b\xe3\x82\xb1\xe3\x83\xbc\xe3\x82\xb7\xe3\x83\xa7\xe3\x83\xb3\xe3\x81\xab\xe3\x82\x88\xe3\x82\x8a\xe3\

Instead of japanese.

I've been researching all day and have converted my DB to utf-8, tried decoding the text from iso-8859-1 and encoding to utf-8.

Basically I have no idea what I'm doing and would appreciate any help or suggestions I can get so I can avoid spending another day trying to figure this out.

1
You forgot to tell Beautiful Soup the encoding. Get it from the response headers. - Ignacio Vazquez-Abrams
i believe the BeautifulSoup automatically sets the encoding based on the page's meta tag, according to this crummy.com/software/BeautifulSoup/bs3/documentation.html "A <META> tag may specify an encoding for the document." and soup.originalEncoding outputs 'iso-8859-1' - Ryan Rogers
You're assuming that the page has a META tag to read. - Ignacio Vazquez-Abrams
in this case it does, should've mentioned that - Ryan Rogers

1 Answers

0
votes

The examples you posted are somehow the ascii representation of the string. You need to convert this into a python unicode string. Usually you can use string encoding and decoding to do the job. If you are not sure which one is the correct way simply experiment with it in the python console.

Try my_new_string = my_string.decode('utf-8') to get the python unicode string. This should correctly display in Django templates, can be saved to the DB etc.. As an example you can also just try print my_new_string and will see it is outputting Japanese characters.