Suppose I use partitionBy to save some data to disk, e.g. by date so my data looks like this:
/mydata/d=01-01-2018/part-00000
/mydata/d=01-01-2018/part-00001
...
/mydata/d=02-01-2018/part-00000
/mydata/d=02-01-2018/part-00001
...
When I read the data using Hive config and DataFrame, so
val df = sparkSession.sql(s"select * from $database.$tableName")
I can know that:
- Filter queries on column
dwill push down - No shuffles will occur if I try to partition by
d(e.g.GROUP BY d)
BUT, suppose I don't know what the partition key is (some upstream job writes the data, and has no conventions). How can I get Spark to tell me which is the partition key, in this case d. Similarly if we have multiple partitions (e.g. by month, week, then day).
Currently the best code we have is really ugly:
def getPartitionColumnsForHiveTable(databaseTableName: String)(implicit sparkSession: SparkSession): Set[String] = {
val cols = sparkSession.
sql(s"desc $databaseTableName")
.select("col_name")
.collect
.map(_.getAs[String](0))
.dropWhile(r => !r.matches("# col_name"))
if (cols.isEmpty) {
Set()
} else {
cols.tail.toSet
}
}