3
votes

so I am trying to take this data that uses unicode indicators and make it print with emojis. It is currently in a txt. file but I will write to an excel file later. So anyways I am getting an error I am not sure what to do with. This is the text I am reading:

"Thanks @UglyGod \ud83d\ude4f https:\\/\\/t.co\\/8zVVNtv1o6\"
"RT @Rosssen: Multiculti beatdown \ud83d\ude4f https:\\/\\/t.co\\/fhwVkjhFFC\"

And here is my code:

sampleFile= open('tweets.txt', 'r').read()
splitFile=sampleFile.split('\n')
for line in sampleFile:
    x=line.encode('utf-8')
    print(x.decode('unicode-escape'))

This is the error Message:

UnicodeDecodeError: 'unicodeescape' codec can't decode byte 0x5c in position 0: \ at end of string

Any ideas? This is how the data was originally generated.

class listener(StreamListener):

    def on_data(self, data):
        # Check for a field unique to tweets (if missing, return immediately)
        if "in_reply_to_status_id" not in data:
            return
        with open("see_no_evil_monkey.csv", 'a') as saveFile:
            try:
                saveFile.write(json.dumps(data) + "\n")
            except (BaseException, e):
                print ("failed on data", str(e))
                time.sleep(5)
        return True

    def on_error(self, status):
        print (status)
2
How was tweets.txt generated? - MattDMo
You are trying to decode a bytes object with 'unicode-escape' that was previously encoded with 'utf8', 'unicode-escape' cannot read strings encoded with 'utf8'. I believe the simplest solution to your problem would be to pass the correct encoding to the open function when reading from the file. - Dean Fenster
So this is the code that was used to generate the original data from twitter: - Patrick Reid
Hey, I added how the file information was generated - Patrick Reid

2 Answers

3
votes

Your emoji 🙏 is represented as a surrogate pair, see also here for info about this particular glyph. Python cannot decode surrogates, so you'll need to look at exactly how your tweets.txt file was generated, and try encoding the original tweets, along with the emoji, as UTF-8. This will make reading and processing the text file much easier.

3
votes

This is how the data was originally generated... saveFile.write(json.dumps(data) + "\n")

You should use json.loads() instead of .decode('unicode-escape') to read JSON text:

#!/usr/bin/env python3
import json

with open('tweets.txt', encoding='ascii') as file:
    for line in file:
        text = json.loads(line)
        print(text)