Spark load parquet can't infer timestamp from partitioned column

Question

I can save a parquet file partitioned by a column that looks like a timestamp, but is actually a string. When I try to load that parquet back into spark using spark.read.load(), it automatically infers the partitioned column has a date, causing me to lose all my time information. Is there a way to read the parquet file back in with the partitioned column as a string or better yet have it automatically parse into a timestamp given a specified format? Here's an example:

test_df = spark.createDataFrame(
    [
        ('2020-01-01T00-00-01', 'hello'),
    ],
    [
        'test_dt', 'col1'
    ]
)
test_df.write.save('hdfs:///user/test_write', 'parquet', mode='overwrite', partitionBy='test_dt')
test_read = spark.read.load('hdfs:///user/test_write', 'parquet')
test_read.show(1)

This returns:

+-----+----------+
| col1|   test_dt|
+-----+----------+
|hello|2020-01-01|
+-----+----------+

That's expected. Partitioned data doesn't preserve type information (it is just directory structure). You'll have to cast. — user10938362

mazaneicha mazaneicha · Accepted Answer · 2020-02-11T22:35:35

If you set spark.sql.sources.partitionColumnTypeInference.enabled to false before read, Spark will seize all attempts to infer data types of partition columns (they will be treated as strings).

Spark load parquet can't infer timestamp from partitioned column

1 Answers