Scala RDD groupby count along with all columns

Question

I need to get the all the columns along with the count.In Scala RDD.

Col1 col2  col3 col4
us    A     Q1   10
us    A      Q3   10
us    A      Q2   20
us    B      Q4   10
us    B      Q5   20
uk    A      Q1   10
uk    A      Q3   10
uk    A      Q2   20
uk    B      Q4   10
uk    B      Q5   20

I want result like:

Col1    col2       col3     col4     count
us         A           Q1       10          3
us         A           Q3      10          3
us         A           Q3      10          3
us         B           Q4      10          2
us         B           Q5      20          2
uk         A           Q1       10          3
uk         A           Q3      10          3
uk         A           Q3      10          3
uk         B           Q4      10          2
uk         B           Q5      20          2

This is something like group by of col1, col2 and gets counts. Now I need along with col13,col4.

I am trying the SCALA RDD like:

val Top_RDD_1 = RDD.groupBy(f=> ( f._1,f._2 )).mapValues(_.toList)

This produces

RDD[((String, String), List[(String, String, String, Double, Double, Double)])]

Nothing but (col1,col2), List (col1,col2,col3,col14) result like (us,A) List((us,a,Q1,10),(us,a,Q3,10),(us,a,Q2,20)).,,,

How can I take the list count and access the list value.

Please help me spark SCALA RDD code.

Thanks Balaji.

Tzach Zohar Tzach Zohar · Accepted Answer · 2017-01-13T09:13:55

I can't see a way to do this in one "scan" of the RDD - you'll have to calculate the counts using reduceByKey and then join to the original RDD. To do that efficiently (without causing re-calculation of the input) you'd better cache/persist the input before the join:

val keyed: RDD[((String, String), (String, String, String, Int))] = input
  .keyBy { case (c1, c2, _, _) => (c1, c2) }
  .cache()

val counts: RDD[((String, String), Int)] = keyed.mapValues(_ => 1).reduceByKey(_ + _)

val result = keyed.join(counts).values.map {
  case ((c1, c2, c3, c4), count) => (c1, c2, c3, c4, count)
}

Scala RDD groupby count along with all columns

2 Answers