How compute the percentile in PySpark dataframe for each key?

Question

I have a PySpark dataframe consists of three columns x, y, z.

X may have multiple rows in this dataframe. How can I compute the percentile of each key in x separately?

+------+---------+------+
|  Name|     Role|Salary|
+------+---------+------+
|   bob|Developer|125000|
|  mark|Developer|108000|
|  carl|   Tester| 70000|
|  carl|Developer|185000|
|  carl|   Tester| 65000|
| roman|   Tester| 82000|
| simon|Developer| 98000|
|  eric|Developer|144000|
|carlos|   Tester| 75000|
| henry|Developer|110000|
+------+---------+------+

The output needed:

+------+---------+------+----------
|  Name|     Role|Salary|      50%|
+------+---------+------+----------
|   bob|Developer|125000|117500.0 |
|  mark|Developer|108000|117500.0 |
|  carl|   Tester| 70000|72500.0  |
|  carl|Developer|185000|117500.0 |
|  carl|   Tester| 65000|72500.0  |
| roman|   Tester| 82000|72500.0  |
| simon|Developer| 98000|117500.0 |
|  eric|Developer|144000|117500.0 |
|carlos|   Tester| 75000|72500.0  |
| henry|Developer|110000|117500.0 |
+------+---------+------+---------

@, I dont have idea how it will be the output, im just looking to split the role into several range. What i mean by split, is to take for example the developer between [0-25% of the developer salary], then [25-50%], [50-75%] and [75%-100%]. Im doing that because i have a huge dataframe and i nedd to know (filter) for example the developper with high salary, etc — bib
You need to take the time to figure out what you want the output to be, and describe it in detail. What you're saying in your most recent comment and what you're asking in the question appear to be two different things. As it stands, this question may be closed as "Too Broad" or "Unclear what you're asking". — pault

Ala Tarighati Ala Tarighati · Accepted Answer · 2019-08-09T14:24:23

Try groupby + F.expr:

import pyspark.sql.functions as F

df1 = df.groupby('Role').agg(F.expr('percentile(Salary, array(0.25))')[0].alias('%25'),
                             F.expr('percentile(Salary, array(0.50))')[0].alias('%50'),
                             F.expr('percentile(Salary, array(0.75))')[0].alias('%75'))
df1.show()

Output:

+---------+--------+--------+--------+
|     Role|     %25|     %50|     %75|
+---------+--------+--------+--------+
|   Tester| 68750.0| 72500.0| 76750.0|
|Developer|108500.0|117500.0|139250.0|
+---------+--------+--------+--------+

Now you might join df1 with the original dataframe:

df.join(df1, on='Role', how='left').show()

Output:

+---------+------+------+--------+--------+--------+
|     Role|  Name|Salary|     %25|     %50|     %75|
+---------+------+------+--------+--------+--------+
|   Tester|  carl| 70000| 68750.0| 72500.0| 76750.0|
|   Tester|  carl| 65000| 68750.0| 72500.0| 76750.0|
|   Tester| roman| 82000| 68750.0| 72500.0| 76750.0|
|   Tester|carlos| 75000| 68750.0| 72500.0| 76750.0|
|Developer|   bob|125000|108500.0|117500.0|139250.0|
|Developer|  mark|108000|108500.0|117500.0|139250.0|
|Developer|  carl|185000|108500.0|117500.0|139250.0|
|Developer| simon| 98000|108500.0|117500.0|139250.0|
|Developer|  eric|144000|108500.0|117500.0|139250.0|
|Developer| henry|110000|108500.0|117500.0|139250.0|
+---------+------+------+--------+--------+--------+

How compute the percentile in PySpark dataframe for each key?

2 Answers