0
votes

I have following df:-

result state clubName
win XYZ club1
win XYZ club2
win XYZ club1
win PQR club3

I need state wise max wining clubName

val byState =Window.partitionBy("state").orderBy('state)

I tried creating a window but does not helps..

Expected Result :-

Some like this in sql

select temp.res
(select count(result) as res
from table
group by clubName) temp
group by state

e.g

state max_count_of_wins clubName
XYZ 2 club1
2
What do you mean by state wise max wining clubName, can you add the expected result aswell ?koiralo
added as per comment @koiraloDot Net Dev 19
club1 and club2 have the same number of wins. why do you choose club1? and why there is no row for state PQR in your result?blackbishop
club1 has one win. Why does your sample result has two? In general, most of the time, if you can express what you need in sql, you can just use spark.sql to get it. Your sql isn't actually valid though. Perhaps, it makes sense to start with fixing that.Dima
sorry..I have edited the input.. @blackbishopDot Net Dev 19

2 Answers

0
votes

You can get the win count for each club, then assign a rank for each club ordered by wins, and filter those rows with rank = 1.

import org.apache.spark.sql.expressions.Window

val df2 = df.withColumn(
    "wins", 
    count(when(col("result") === "win", 1))
    .over(Window.partitionBy("state","clubName"))
).withColumn(
    "rn", 
    row_number().over(Window.partitionBy("state").orderBy(desc("wins")))
).filter("rn = 1").selectExpr("state", "wins as max_count_of_wins", "clubName")

df2.show
+-----+-----------------+--------+
|state|max_count_of_wins|clubName|
+-----+-----------------+--------+
|  PQR|                1|   club3|
|  XYZ|                2|   club1|
+-----+-----------------+--------+
0
votes

You can also use a SQL-dialect with SparkSQL (find doc here):

df.sql("""
SELECT tt.name, tt.state, MAX(tt.nWins) as max_count_of_wins
FROM (
  SELECT t1.clubName as name, t1.state as state, COUNT(1) as nWins
  FROM Table1 t1
  WHERE t1.result = 'win'
  GROUP BY state, name
  ) as tt
GROUP BY tt.state;
""")

where the table in the dataframe would be named Table1 and your dataframe df.

p.s. if you want to try it yourself use the initialization

CREATE TABLE Table1
    (`result` varchar(3), `state` varchar(3), `clubName` varchar(5))
;
    
INSERT INTO Table1
    (`result`, `state`, `clubName`)
VALUES
    ('win', 'XYZ', 'club1'),
    ('win', 'XYZ', 'club2'),
    ('win', 'XYZ', 'club1'),
    ('win', 'PQR', 'club3')
;

on http://sqlfiddle.com.