How to determine partition key/column with Spark

Question

Suppose I use partitionBy to save some data to disk, e.g. by date so my data looks like this:

/mydata/d=01-01-2018/part-00000
/mydata/d=01-01-2018/part-00001
...
/mydata/d=02-01-2018/part-00000
/mydata/d=02-01-2018/part-00001
...

When I read the data using Hive config and DataFrame, so

val df = sparkSession.sql(s"select * from $database.$tableName")

I can know that:

Filter queries on column d will push down
No shuffles will occur if I try to partition by d (e.g. GROUP BY d)

BUT, suppose I don't know what the partition key is (some upstream job writes the data, and has no conventions). How can I get Spark to tell me which is the partition key, in this case d. Similarly if we have multiple partitions (e.g. by month, week, then day).

Currently the best code we have is really ugly:

def getPartitionColumnsForHiveTable(databaseTableName: String)(implicit sparkSession: SparkSession): Set[String] = {
    val cols = sparkSession.
      sql(s"desc $databaseTableName")
      .select("col_name")
      .collect
      .map(_.getAs[String](0))
      .dropWhile(r => !r.matches("# col_name"))
    if (cols.isEmpty) {
      Set()
    } else {
      cols.tail.toSet
    }
  }

are you sure no shuffles will occur in this case? I thought only bucketed hive tables have this behavior? — Raphael Roth
@RaphaelRoth I might be out of date, Spark changes the way files are read into partitions seemingly on every release (so what was once true isn't always true). — samthebest

philantrovert philantrovert · Accepted Answer · 2018-09-03T10:57:33

Assuming you don't have = and / in your partitioned column values, you can do:

val df = spark.sql("show partitions database.test_table")

val partitionedCols: Set[String] = try { 
  df.map(_.getAs[String](0)).first.split('/').map(_.split("=")(0)).toSet
} catch {
  case e: AnalysisException => Set.empty[String]
}

You should get an Array[String] with the partitioned column names.

How to determine partition key/column with Spark

3 Answers