I have a dataframe with a date column. I have parsed it into year, month, day columns. I want to partition on these columns, but I do not want the columns to persist in the parquet files.
Here is my approach to partitioning and writing the data:
df = df.withColumn('year', f.year(f.col('date_col'))).withColumn('month',f.month(f.col('date_col'))).withColumn('day',f.dayofmonth(f.col('date_col')))
df.write.partitionBy('year','month', 'day').parquet('/mnt/test/test.parquet')
This properly creates the parquet files, including the nested folder structure. However I do not want the year, month, or day columns in the parquet files.