0
votes

I have a simple csv file with columns(id:Integer,Name:String,Gender:String). I want to add a column to this dataframe using withColumn. I am aware that first parameter is column name but not quite sure how to pass the second parameter which will group by the gender. Can someone help me with this?

I have tried using df.withColumn("Count",df.groupby("Gender").count()) but it is giving type mismatch error saying it is expecting a column type but getting a dataframe. I have tried using col() also expecting it to type cast it.

val newDf=df.withColumn("count",col(df.groupBy("Gender").count()))

I expect a new dataframe with count column added.

1

1 Answers

1
votes

I think this is what you are looking for, use of window function and count function

import spark.implicits._

val df = Seq(
  (1, "name1", "M"),
  (2, "name2", "M"),
  (3, "name3", "M"),
  (4, "name4", "M"),
  (5, "name5", "F"),
  (6, "name6", "F"),
  (7, "name7", "M")
).toDF("id", "Name", "Gender")

val window = Window.partitionBy("Gender")

df.withColumn("count", count($"Gender").over(window))
  .show(false)

Output:

+---+-----+------+-----+
|id |Name |Gender|count|
+---+-----+------+-----+
|5  |name5|F     |2    |
|6  |name6|F     |2    |
|1  |name1|M     |5    |
|2  |name2|M     |5    |
|3  |name3|M     |5    |
|4  |name4|M     |5    |
|7  |name7|M     |5    |
+---+-----+------+-----+