0
votes

i'm writing pyspark script on Databricks notebook to insert/update/query cassandra tables, however I cannot find a way to delete rows from table, i tried spark sql:

spark.sql("DELETE from users_by_email where email_address IN ('[email protected]')")

I also don't see it's possible to delete data using dataframe. is there any workaround?

2
Instead of dropping that row you can just filter out that rows - Mahesh Gupta

2 Answers

1
votes

You can load the dataframe and filter it:

import pyspark.sql.functions as f

df = spark.sql("SELECT * from users_by_email")
df_filtered = df.filter(f.col("email_address") == "[email protected]")

Then you can save the dataframe with the overwrite option or, also, in a new table.

-1
votes

Spark does not allow update and Delete Query over dataframe. You need to use Python external API in the code for deletion.

You can check below Python API which provide .delete() function for delete.

https://docs.datastax.com/en/developer/python-driver/3.18/api/cassandra/cqlengine/models/#cassandra.cqlengine.models.Model-methods