1
votes

I want to convert a Spark DataFrame into another DataFrame with a specific manner as follows:

I have Spark DataFrame:

col  des
A    a
A    b
B    b
B    c

As a result of the operation I would like to a have also a Spark DataFrame as:

col  des
A    a,b
B    b,c

I tried to use:

result <- summarize(groupBy(df, df$col), des = n(df$des))

As a result I obtained the count. Is there any parameter of (summarize or agg) that converts column into a list or something similar, but with assumption that all operations are done on Spark?

Thank you in advance

2
you can use collect_list() however is not integrated into SparkR yet, I'm afraid.mtoto
Unfortunately it is not :( I found that it is implemented in pySpark, that is one possible solution.Meyk

2 Answers

2
votes

Here is the solution in scala, you need to figure out for the SparkR.

  val dataframe = spark.sparkContext.parallelize(Seq(
    ("A", "a"),
      ("A", "b"),
      ("B", "b"),
      ("B", "c")
  )).toDF("col", "desc")

  dataframe.groupBy("col").agg(collect_list(struct("desc")).as("desc")).show

Hope this helps!

1
votes

sparkR code:

    sc <- sparkR.init()
    sqlContext <- sparkRSQL.init(sc)

    #create R data frame

    df <- data.frame(col= c("A","A","B","B"),des= c("a","b","b","c"))

    #converting to spark dataframe

    sdf <- createDataFrame( sqlContext, df)

    registerTempTable(sdf, "sdf")

    head(sql(sqlContext, "SQL QUERY"))

write the corresponding sql query in it and execute it.