10
votes

My intention is to do the equivalent of the basic sql

select shipgrp, shipstatus, count(*) cnt 
from shipstatus group by shipgrp, shipstatus

The examples that I have seen for spark dataframes include rollups by other columns: e.g.

df.groupBy($"shipgrp", $"shipstatus").agg(sum($"quantity"))

But no other column is needed in my case shown above. So what is the syntax and/or method call combination here?

Update A reader has suggested this question were a duplicate of dataframe: how to groupBy/count then filter on count in Scala : but that one is about filtering by count : there is no filtering here.

1
@AbuShoeb That other one is about filtering by count so it's not a clear duplicate.WestCoastProjects
You are right, thanks!Abu Shoeb
you might retract your "duplicate" vote ;)WestCoastProjects
Done! Still learning :-PAbu Shoeb

1 Answers

17
votes

You can similarly do count("*") in spark agg function:

df.groupBy("shipgrp", "shipstatus").agg(count("*").as("cnt"))

val df = Seq(("a", 1), ("a", 1), ("b", 2), ("b", 3)).toDF("A", "B")

df.groupBy("A", "B").agg(count("*").as("cnt")).show
+---+---+---+
|  A|  B|cnt|
+---+---+---+
|  b|  2|  1|
|  a|  1|  2|
|  b|  3|  1|
+---+---+---+