3
votes

I am adding a partition column to Spark Dataframe. New column(s) contains year month and day. I have a timestamp column in my dataframe.

DataFrame dfPartition = df.withColumn("year", df.col("date").substr(0, 4));
dfPartition = dfPartition.withColumn("month", dfPartition.col("date").substr(6, 2));
dfPartition =  dfPartition.withColumn("day", dfPartition.col("date").substr(9, 2));

I can see the correct values of columns when I output the dataframe eg : 2016 01 08

But When I export this dataframe to hive table like

dfPartition.write().partitionBy("year", "month","day").mode(SaveMode.Append).saveAsTable("testdb.testtable");

I see that directory structure generated misses leading zeroes. I tried to cast column to String but did not work.

Is there is a way to capture two digits date/month in hive partition

Thanks

2
what kind of type does your column for the dates have? if it is an integral type this probably is the reason why he deletes leading zeroes in the parsing process. - Matthias Kricke
its String and I also do substring on that column ( which returns String too ) also I can see 0's in the Dataframe output. - user1544460
This looks like a bug in spark: you do have the correct datatypes (strings) that should retain the 0's. Feel free to ask on the spark user's list: they may ask you to file a report. - WestCoastProjects

2 Answers

2
votes

Per Spark documentation, partition-column-type inference is a feature enabled by default. OP string values, since they are interpretable as ints, were converted as such. If this is undesirable in the Spark session as a whole, one can disable it by setting the corresponding spark configuration attribute to false:

SparkSession.builder.config("spark.sql.sources.partitionColumnTypeInference.enabled", value = false)

or by running the corresponding SET key=value command using SQL. Otherwise, one can individually-counteract it at the column level w/ the corresponding Spark-native format-string function as J.Doe suggests.

0
votes

Refer to Add leading zeros to Columns in a Spark Data Frame

you can see the answer of how to add leading 0's with this answer:

 val df2 = df
        .withColumn("month", format_string("%02d", $"month"))

I tried this on my code using the snippet below and it worked!

            .withColumn("year", year(col("my_time")))
            .withColumn("month", format_string("%02d",month(col("my_time")))) //pad with leading 0's
            .withColumn("day", format_string("%02d",dayofmonth(col("my_time")))) //pad with leading 0's
            .withColumn("hour", format_string("%02d",hour(col("my_time")))) //pad with leading 0's
            .writeStream
            .partitionBy("year", "month", "day", "hour")