0
votes

I am trying to access csv file from aws s3 bucket and getting error 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte code is below I am using python 3.7 version

        from io import BytesIO
        import boto3
        import pandas as pd
        import gzip
        s3 = boto3.client('s3', aws_access_key_id='######',
        aws_secret_access_key='#######')

        response = s3.get_object(Bucket='#####', Key='raw.csv')
        # print(response)
        s3_data = StringIO(response.get('Body').read().decode('utf-8')

        data = pd.read_csv(s3_data)
        print(data.head())

kindly help me out here how i can resolve this issue

2
Are you working on windows or on linux? maybe it's an encoding problem of your .py filepapanito
@papanito I am working on linux. let me know if its linux based problem then how i can resolve itsuman
sorry maybe I missunderstood something, you get the error in line s3_data = StringIO(response.get('Body').read().decode('utf-8')?papanito

2 Answers

4
votes

using gzip worked for me

client = boto3.client('s3', aws_access_key_id=aws_access_key_id,
                                      aws_secret_access_key=aws_secret_access_key)

csv_obj = client.get_object(Bucket=####, Key=###)

body = csv_obj['Body']
with gzip.open(body, 'rt') as gf:
   csv_file = pd.read_csv(gf)
2
votes

The error you're getting means the CSV file you're getting from this S3 bucket is not encoded using UTF-8.

Unfortunately the CSV file format is quite under-specified and doesn't really carry information about the character encoding used inside the file... So either you need to know the encoding, or you can guess it, or you can try to detect it.

If you'd like to guess, popular encodings are ISO-8859-1 (also known as Latin-1) and Windows-1252 (which is roughly a superset of Latin-1). ISO-8859-1 doesn't have a character defined for 0x8b (so that's not the right encoding), but Windows-1252 uses that code to represent a left single angle quote (‹).

So maybe try .decode('windows-1252')?

If you'd like to detect it, look into the chardet Python module which, given a file or BytesIO or similar, will try to detect the encoding of the file, giving you what it thinks the correct encoding is and the degree of confidence it has in its detection of the encoding.

Finally, I suggest that, instead of using an explicit decode() and using a StringIO object for the contents of the file, store the raw bytes in an io.BytesIO and have pd.read_csv() decode the CSV by passing it an encoding argument.

import io

s3_data = io.BytesIO(response.get('Body').read())

data = pd.read_csv(s3_data, encoding='windows-1252')

As a general practice, you want to delay decoding as much as you can. In this particular case, having access to the raw bytes can be quite useful, since you can use that to write a copy of them to a local file (that you can then inspect with a text editor, or on Excel.)

Also, if you want to do detection of the encoding (using chardet, for example), you need to do so before you decode it, so again in that case you need the raw bytes, so that's yet another advantage to using the BytesIO here.