Hello stackoverflow community,
I am having some issues reading parquet files. The problems start after I upload the Parquet file to Azure Data Lake gen 2 using Python.
I am using the official Micorsoft documentation: https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-directory-file-acl-python
Besides the authentification, this part:
def upload_file_to_directory():
try:
file_system_client = service_client.get_file_system_client(file_system="my-file-system")
directory_client = file_system_client.get_directory_client("my-directory")
file_client = directory_client.create_file("uploaded-file.txt")
local_file = open("C:\\file-to-upload.txt",'r')
file_contents = local_file.read()
file_client.append_data(data=file_contents, offset=0, length=len(file_contents))
file_client.flush_data(len(file_contents))
except Exception as e:
print(e)
When I use the code to upload a small csv file, it works totally fine. The csv file is uploaded and when I download the file I can open it without any problems.
If I convert the same data frame to a small parquet file and upload the file, the upload works fine. But when I download the file and try to open it, I get the error message:
ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
If I read the Parquet fiel directly without uploading, it works fine.
Does anyone have a suggestion how I need to modify the code so I don't destroy my parquet file?
Thanks!