1
votes

I can save a parquet file partitioned by a column that looks like a timestamp, but is actually a string. When I try to load that parquet back into spark using spark.read.load(), it automatically infers the partitioned column has a date, causing me to lose all my time information. Is there a way to read the parquet file back in with the partitioned column as a string or better yet have it automatically parse into a timestamp given a specified format? Here's an example:

test_df = spark.createDataFrame(
    [
        ('2020-01-01T00-00-01', 'hello'),
    ],
    [
        'test_dt', 'col1'
    ]
)
test_df.write.save('hdfs:///user/test_write', 'parquet', mode='overwrite', partitionBy='test_dt')
test_read = spark.read.load('hdfs:///user/test_write', 'parquet')
test_read.show(1)

This returns:

+-----+----------+
| col1|   test_dt|
+-----+----------+
|hello|2020-01-01|
+-----+----------+
1
That's expected. Partitioned data doesn't preserve type information (it is just directory structure). You'll have to cast. - user10938362

1 Answers

2
votes

If you set spark.sql.sources.partitionColumnTypeInference.enabled to false before read, Spark will seize all attempts to infer data types of partition columns (they will be treated as strings).