I want to know till when a Dataframe or RDD is kept alive or when it dies/removed.Is it different for Dataframe and RDD?
- Are all parent Dataframes kept alive in memory till the last Dataframe / RDD is written to Disk or displayed on screen
When a transformation is applied to a Dataframe/RDD then a new Dataframe/RDD is created. In that case will 10 transformations create 10 Dataframe/RDD and will they be alive till the end of the application or final Dataframe/RDD is written to disk? Please see below for sample code
val transformDF1 = readDF.withColumn("new_column", sometransformation) val transformDF2 = transformDF1.groupBy("col1","col2").agg(sum("col3")) transformDF2.write.format("text").save(path)What about in the case when we chain the transformations together before assigning to a variable. Like Below
val someDF = df
.where(some_col = "some_val")
.withColumn("some-page", col("other_page") + 1)
.drop("other_page")
.select(col("col1"), col("col2")
)
vall someDF1 = someDF.join(someotherDF, joincond, "inner"). select("somecols")
val finalDF = someDF1.distinct()
finalDF.write.save(path)
In the above code
- We have someDF created from a chain of transformations on df dataframe. Now each transformation in the chain creates a Dataframe. So does each Dataframe created by a transformation in the chain remain alive in memory till finalDF is written to a file Or is it that only the Dataframe from the last transformation in the chain which is assigned to variable someDF remains in memory. If latter is the case then till when someDF is retained and if former is the case till when they are retained in memory
- What about other dataframe someDF1, what is its lifetime?
- In case the chained transformation are not retained as soon as the control moves to new transformation in the chain then is it better to chain as many transformations as possible to help maintain more available memory. But will GC be a catch/bottle neck in case of chained transformations(in case we are chaining them heavily)?