Overwrite parquet file with pyarrow in S3

Question

I'm trying to overwrite my parquet files with pyarrow that are in S3. I've seen the documentacion and I haven't found anything.

Here is my code:

from s3fs.core import S3FileSystem
import pyarrow as pa
import pyarrow.parquet as pq

s3 = S3FileSystem(anon=False)
output_dir = "s3://mybucket/output/my_table"

my_csv = pd.read_csv(file.csv)
my_table = pa.Table.from_pandas(my_csv , preserve_index=False)

pq.write_to_dataset(my_table, 
                    output_dir,
                    filesystem=s3,
                    use_dictionary=True,
                    compression='snappy')

Is there something like mode = "overwrite" option in write_to_dataset function?

Igor Tavares Igor Tavares · Accepted Answer · 2020-01-10T13:00:22

I think the best way to do it is with AWS Data Wrangler that offers 3 differents write modes:

append
overwrite
overwrite_partitions

Example:

import awswrangler as wr

wr.s3.to_parquet(
    dataframe=df,
    path="s3://...",
    mode="overwrite",
    dataset=True,
    database="my_database",  # Optional, only with you want it available on Athena/Glue Catalog
    table="my_table",
    partition_cols=["PARTITION_COL_NAME"])

Overwrite parquet file with pyarrow in S3

2 Answers