So, I've done enough research and haven't found a post that addresses what I want to do.
I have a PySpark DataFrame my_df
which is sorted
by value
column-
+----+-----+
|name|value|
+----+-----+
| A| 30|
| B| 25|
| C| 20|
| D| 18|
| E| 18|
| F| 15|
| G| 10|
+----+-----+
The summation of all the counts in value
column is equal to 136
. I want to get all the rows whose combined values >= x% of 136
. In this example, let's say x=80
. Then target sum = 0.8*136 = 108.8
. Hence, the new DataFrame will consist of all the rows that have a combined value >= 108.8
.
In our example, this would come down to row D
(since combined values upto D = 30+25+20+18 = 93
).
However, the hard part is that I also want to include the immediately following rows with duplicate values. In this case, I also want to include row E
since it has the same value as row D
i.e. 18
.
I want to slice my_df
by giving a percentage x
variable, for example 80
as discussed above. The new DataFrame should consist of the following rows-
+----+-----+
|name|value|
+----+-----+
| A| 30|
| B| 25|
| C| 20|
| D| 18|
| E| 18|
+----+-----+
One thing I could do here is iterate through the DataFrame (which is ~360k rows)
, but I guess that defeats the purpose of Spark.
Is there a concise function for what I want here?
sort
the DataFrame? Is it based on thevalue
? Orvalue
andname
? – paultvalue
– kev