Getting the number of rows in a Spark dataframe without counting

Question

I am applying many transformations on a Spark DataFrame (filter, groupBy, join). I want to have the number of rows in the DataFrame after each transformation.

I am currently counting the number of rows using the function count() after each transformation, but this triggers an action each time which is not really optimized.

I was wondering if there is any way of knowing the number of rows without having to trigger another action than the original job.

randal25 randal25 · Accepted Answer · 2019-05-17T14:56:39

You could use an accumulator for each stage and increment the accumulator in a map after each stage. Then at the end after you do your action you would have a count for all the stages.

val filterCounter = spark.sparkContext.longAccumulator("filter-counter")
val groupByCounter = spark.sparkContext.longAccumulator("group-counter")
val joinCounter = spark.sparkContext.longAccumulator("join-counter")

myDataFrame
    .filter(col("x") === lit(3))
    .map(x => {
      filterCounter.add(1)
      x
    })        .groupBy(col("x"))
    .agg(max("y"))
    .map(x => {
      groupByCounter.add(1)
      x
    })
    .join(myOtherDataframe, col("x") === col("y"))
    .map(x => {
      joinCounter.add(1)
      x
    })
    .count()

print(s"count for filter = ${filterCounter.value}")
print(s"count for group by = ${groupByCounter.value}")
print(s"count for join = ${joinCounter.value}")

Getting the number of rows in a Spark dataframe without counting

3 Answers