10
votes

I have recently started discovering Databricks and faced a situation where I need to drop a certain column of a delta table. When I worked with PostgreSQL it was as easy as

ALTER TABLE main.metrics_table 
DROP COLUMN metric_1;

I was looking through Databricks documentation on DELETE but it covers only DELETE the rows that match a predicate.

I've also found docs on DROP database, DROP function and DROP table but absolutely nothing on how to delete a column from a delta table. What am I missing here? Is there a standard way to drop a column from a delta table?

3

3 Answers

10
votes

There is no drop column option on Databricks tables: https://docs.databricks.com/spark/latest/spark-sql/language-manual/alter-table-or-view.html#delta-schema-constructs

Remember that unlike a relational database there are physical parquet files in your storage, your "table" is just a schema that has been applied to them.

In the relational world you can update the table metadata to remove a column easily, in a big data world you have to re-write the underlying files.

Technically parquet can handle schema evolution (see Schema evolution in parquet format). But the Databricks implementation of Delta does not. It probably just too complicated to be worth it.

Therefore the solution in this case is to create a new table and insert the columns you want to keep from the old table.

5
votes

One way that I figured out to make that work is to first drop the table and then recreate the table from the dataframe using the overwriteSchema option to true. You also need to use the option of mode = overwrite so that it recreate the physical files using new schema that the dataframe contains.

Break down of the steps :

  1. Read the table in the dataframe.
  2. Drop the columns that you don't want in your final table
  3. Drop the actual table from which you have read the data.
  4. now save the newly created dataframe after dropping the columns as the same table name.
  5. but make sure you use two options at the time of saving the dataframe as table.. (.mode("overwrite").option("overwriteSchema", "true") )

Above steps would help you recreate the same table with the extra column/s removed. Hope it helps someone facing the similar issue.

3
votes

use below code :

df = spark.sql("Select * from <DB Name>.<Table Name>")

df1 = df.drop("<Column Name>")

spark.sql("DROP TABLE if exists <DB Name>.<TableName>_OLD")

spark.sql("ALTER TABLE <DB Name>.<TableName> RENAME TO <DB Name>.<Table Name>_OLD ")

df1.write.format("delta").mode("OVERWRITE").option("overwriteSchema", "true").saveAsTable("<DB Name>.<Table Name>")