How to get all the current partitions of a Delta Table using pyspark?

Question

I am using the OSS version of delta lake along with spark 3.0.1. My current use-case requires me to discover all the current partitions in a given delta table.

My data is stored in './data/raw' and is partitioned by the column sensorId (the path mentioned is relative path to my python script).

I am trying to use the SHOW PARTITIONS syntax as mentioned in the documentation. However, I am getting errors.

This is how my code looks like:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("TestScript").getOrCreate()
df=spark.sql("SHOW PARTITIONS delta.`./data/raw`")
df.show()

The spark-submit command looks as follows:

spark-submit --packages io.delta:delta-core_2.12:0.8.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" test_script.py

And I get the following error:

pyspark.sql.utils.AnalysisException: Database 'delta' not found;

My other question related to this is whether SHOW PARTITIONS will give me all the partitions or does it put a limit on the result. If there is a limit, what is the best way to discover/get all the partitions of a delta table.

Are u using Databricks? Is table saved as delta table? The path that you are giving is the path where the physical parquet files are saved for the delta table? — Nikunj Kakadiya
@AlexOtt Should I remove the tag? I thought since delta is developed by Databricks, I can reach relevant audience. — ganapathy

Nikunj Kakadiya Nikunj Kakadiya · Accepted Answer · 2021-02-23T07:48:21

You could see the partitions in couple of ways. The link of the documentation that you have given shows you the ways for doing it.

If you have save your data as a delta table, you can get the partitions information by providing the table name instead of the delta path and it would return you the partitions information.

spark.sql("SHOW Partitions schema.tableName").show()
You can also use the option where you specify the path where the physical files for the table lives. In your case your are passing the relative path based on where your python script is present but that would not work. You need to pass the exact path of the S3 bucket or any storage you are using for it to work.

spark.sql("SHOW Partitions delta.`dbfs:/mnt/S3/tables/data/raw` ").show()

To answer your last question whether Show partitions will give you all the partitions. The answer is yes but if you check that using df.show() if will show you only the first 20 rows.

If you want to see all the rows/partitions for the table you can do count on the dataframe and then pass that as a second parameter to the show method.

val count = df.count()
df.show(count, truncate = False)

How to get all the current partitions of a Delta Table using pyspark?

1 Answers