How to group by a select number of fields in an RDD looking for duplicates based on those fields

Question

I am new to Scala and Spark. I am working in the Spark Shell.
I need to Group By and sort by the first three fields of this file, looking for duplicates. If I find duplicates within the group, I need to append a counter to the third field, starting at "1" and incrementing by "1", for each record in the duplicate group. Resetting the counter back to "1" when reading a new group. When no duplicates are found, then just append the counter which would be "1".

CSV File contains the following:
("00111","00111651","4444","PY","MA")
("00111","00111651","4444","XX","MA")
("00112","00112P11","5555","TA","MA")

val csv = sc.textFile("file.csv")
val recs = csv.map(line => line.split(",")

If I apply the logic properly on the above example, the resulting RDD of recs would look like this:
("00111","00111651","44441","PY","MA")
("00111","00111651","44442","XX","MA")
("00112","00112P11","55551","TA","MA")

It's a bit "please write my code" as it stands. What have you tried? What approach have you considered? — The Archetypal Paul
Again. It's not "write my code". I am new to SCALA. I was able to identify the duplicates using the following: var addrDupes = loadAddresses.map(a => (a.field1,a.field2,a.field3)).countByValue.toList Just can't figure out how to change the field — denairPete
It wasn't an accusation, but a problem statement without any code will tend to get interpreted that way. Anyway, you have two answers now, one from me. — The Archetypal Paul
I'll remember to include my code next time. However my actual code includes 300 Java classes and a much more complex data structure that I'm dealing with. The example I posted was the most basic aspect of what I am trying to do without including all of the other stuff. — denairPete

Justin Pihony Justin Pihony · Accepted Answer · 2015-07-24T17:29:30

How about group the data, change it and put it back:

val csv = sc.parallelize(List(
  "00111,00111651,4444,PY,MA",
  "00111,00111651,4444,XX,MA",
  "00112,00112P11,5555,TA,MA"
))
val recs = csv.map(_.split(","))
val grouped = recs.groupBy(line=>(line(0),line(1), line(2)))
val numbered = grouped.mapValues(dataList=>
      dataList.zipWithIndex.map{case(data, idx) => data match {
          case Array(fst,scd,thd,rest@_*) => Array(fst,scd,thd+(idx+1)) ++ rest
      }
    })
numbered.flatMap{case (key, values)=>values}

How to group by a select number of fields in an RDD looking for duplicates based on those fields

2 Answers