spark scala reducekey dataframe operation

Question

I'm trying to do a count in scala with dataframe. My data has 3 columns and I've already loaded the data and split by tab. So I want to do something like this:

val file = file.map(line=>line.split("\t"))
val x = file1.map(line=>(line(0), line(2).toInt)).reduceByKey(_+_,1)

I want to put the data in dataframe, and having some trouble on the syntax

val file = file.map(line=>line.split("\t")).toDF
val file.groupby(line(0))
        .count()

Can someone help check if this is correct?

lev lev · Accepted Answer · 2017-11-05T04:53:38

spark needs to know the schema of the df
there are many ways to specify the schema, here is one option:

val df = file
   .map(line=>line.split("\t"))
   .map(l => (l(0), l(1).toInt)) //at this point spark knows the number of columns and their types
   .toDF("a", "b") //give the columns names for ease of use

df
 .groupby('a)
 .count()

spark scala reducekey dataframe operation

1 Answers