0
votes

Hello stackoverflow community,

I am having some issues reading parquet files. The problems start after I upload the Parquet file to Azure Data Lake gen 2 using Python.

I am using the official Micorsoft documentation: https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-directory-file-acl-python

Besides the authentification, this part:

def upload_file_to_directory():
try:



    file_system_client = service_client.get_file_system_client(file_system="my-file-system")



    directory_client = file_system_client.get_directory_client("my-directory")
    
    file_client = directory_client.create_file("uploaded-file.txt")
    local_file = open("C:\\file-to-upload.txt",'r')



    file_contents = local_file.read()



    file_client.append_data(data=file_contents, offset=0, length=len(file_contents))



    file_client.flush_data(len(file_contents))



except Exception as e:
  print(e)

When I use the code to upload a small csv file, it works totally fine. The csv file is uploaded and when I download the file I can open it without any problems.

If I convert the same data frame to a small parquet file and upload the file, the upload works fine. But when I download the file and try to open it, I get the error message:

ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

If I read the Parquet fiel directly without uploading, it works fine.

Does anyone have a suggestion how I need to modify the code so I don't destroy my parquet file?

Thanks!

2

2 Answers

1
votes

I'm not sure what's wrong with your code(It seems your code is not complete), you can have a try this code, it works on my side:

try:
    file_system_client = service_client.get_file_system_client(file_system="my-file-system")

    directory_client = file_system_client.get_directory_client("my-directory")

    file_client = directory_client.create_file("data.parquet")

    df = pd.DataFrame({'one': [-1, np.nan, 2.5],
                       'two': ['foo', 'bar', 'baz'],
                       'three': [True, False, True]},
                      index=list('abc')).to_parquet()

    file_client.append_data(data=df, offset=0, length=len(df))

    file_client.flush_data(len(df))

except Exception as e:
    print(e)
0
votes

I just resolved this error in my project today.

I am using pyarrow.parquet.write_table to write my Parquet file.

I was passing a native Python file object to the where parameter, which somehow caused the footer to never get written.

When I switched to using PyArrow output streams instead of native Python file objects, the footer got written correctly on stream close, which resolved this issue for me.