4
votes

I am trying to parse out the Location from Hive partitioned table in Spark using this query:

val dsc_table = spark.sql("DESCRIBE FORMATTED data_db.part_table")

I was not able to find any query or any other way in Spark to specifically select Location column from this query.

4

4 Answers

3
votes

df.inputFiles method in dataframe API will print file path. It returns a best-effort snapshot of the files that compose this DataFrame.

spark.read.table("DB.TableName").inputFiles
Array[String]: = Array(hdfs://test/warehouse/tablename)
2
votes

You can use spark's utility of table reading:

spark.read.table("myDB.myTable").select(input_file_name).take(1) Will result in a string like:

19/06/18 09:59:55 WARN util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
res1: Array[org.apache.spark.sql.Row] = Array([hdfs://nameservice1/my/path/to/table/store/part-00000-d439163c-9fc4-4768-8b0b-c963f5f7d3d2.snappy.parquet])

I used take(1) only to print one row to show the result here. You may not want to use it if you want all the locations. From this result you can parse the string accordingly in case you want only the location part.

1
votes

You can also use .toDF method on desc formatted table then filter from dataframe.

DataframeAPI:

scala> :paste
spark.sql("desc formatted data_db.part_table")
.toDF //convert to dataframe will have 3 columns col_name,data_type,comment
.filter('col_name === "Location") //filter on colname
.collect()(0)(1)
.toString

Result:

String = hdfs://nn:8020/location/part_table

(or)

RDD Api:

scala> :paste
spark.sql("desc formatted data_db.part_table")
.collect()
.filter(r => r(0).equals("Location")) //filter on r(0) value
.map(r => r(1)) //get only the location
.mkString //convert as string
.split("8020")(1) //change the split based on your namenode port..etc

Result:

String = /location/part_table
1
votes

Didnt find an answer in Pyspark

table_location = spark.sql("describe formatted DB.TableName").filter((F.col('col_name')=='Location')).select("data_type").toPandas().astype(str)['data_type'].values[0]