0
votes

We are saving a dataframe but we need to check that the dataframe should not be empty.

To achieve this we are using df.isEmpty() which is a very common practice while saving a DF.

My concern is that df.isEmpty, head(1), limit(1) all of these performs an Action which will execute the whole plan for the 1st time & then when we save it will trigger(execute) the plan again the 2nd time. Isn't it very bad, is there a better way of doing this?

In most of the code examples, blogs I came across this is the common way of saving non-empty dataframes Check of empty (which triggers action & executes plan), then save(which triggers action & executes whole plan again)

1
df.isEmpty, head(1), limit(1) are the best options you got. They will just grab the 1st row so they are not that slow.Salim

1 Answers

1
votes

I wouldn't use df.rdd.isEmpty. This approach converts the dataframe to an rdd which may not utilize the underlying optimizer (catalyst optimizer) and slow down the process.

Use count() but be sure to persist your data in order to avoid unnecessary plan executions.

dataframe.persist() // persist data in order to avoid redundant executions
if (dataframe.count() > 0) // first action. triggers plan
    dataframe
       .write
       .mode("overwrite")
       .format("desired.format")
       .save("foo/bar") // second action. due to previous persis(), plan will not be triggered
dataframe.unpersist() // unpersist it, data is no longer needed

Hope it helps