Spark Scala GroupBy column and sum values

Question

I am a newbie in Apache-spark and recently started coding in Scala.

I have a RDD with 4 columns that looks like this: (Columns 1 - name, 2- title, 3- views, 4 - size)

aa    File:Sleeping_lion.jpg 1 8030
aa    Main_Page              1 78261
aa    Special:Statistics     1 20493
aa.b  User:5.34.97.97        1 4749
aa.b  User:80.63.79.2        1 4751
af    Blowback               2 16896
af    Bluff                  2 21442
en    Huntingtown,_Maryland  1 0

I want to group based on Column Name and get the sum of Column views.

It should be like this:

aa   3
aa.b 2
af   2
en   1

I have tried to use groupByKey and reduceByKey but I am stuck and unable to proceed further.

Why have you bet on RDD API if "I am a newbie in Apache-spark and recently started coding in Scala."? Why not Spark SQL's Dataframe API? — Jacek Laskowski
I have improved my answer below to include two alternatives ways to achive the result. — eugenio calabrese

eugenio calabrese eugenio calabrese · Accepted Answer · 2018-03-30T13:21:13

This should work, you read the text file, split each line by the separator, map to key value with the appropiate fileds and use countByKey:

sc.textFile("path to the text file")
 .map(x => x.split(" ",-1))
 .map(x => (x(0),x(3)))
 .countByKey

To complete my answer you can approach the problem using dataframe api ( if this is possible for you depending on spark version), example:

val result = df.groupBy("column to Group on").agg(count("column to count on"))

another possibility is to use the sql approach:

val df = spark.read.csv("csv path")
df.createOrReplaceTempView("temp_table")
val result = sqlContext.sql("select <col to Group on> , count(col to count on) from temp_table Group by <col to Group on>")

Spark Scala GroupBy column and sum values

2 Answers