I want to make a grouped filter using dplyr
, in a way that within each group only that row is returned which has the minimum value of variable x
.
My problem is: As expected, in the case of multiple minima all rows with the minimum value are returned. But in my case, I only want the first row if multiple minima are present.
Here's an example:
df <- data.frame(
A=c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
x=c(1, 1, 2, 2, 3, 4, 5, 5, 5),
y=rnorm(9)
)
library(dplyr)
df.g <- group_by(df, A)
filter(df.g, x == min(x))
As expected, all minima are returned:
Source: local data frame [6 x 3]
Groups: A
A x y
1 A 1 -1.04584335
2 A 1 0.97949399
3 B 2 0.79600971
4 C 5 -0.08655151
5 C 5 0.16649962
6 C 5 -0.05948012
With ddply, I would have approach the task that way:
library(plyr)
ddply(df, .(A), function(z) {
z[z$x == min(z$x), ][1, ]
})
... which works:
A x y
1 A 1 -1.04584335
2 B 2 0.79600971
3 C 5 -0.08655151
Q: Is there a way to approach this in dplyr? (For speed reasons)
filter(df.g, rank(x) == 1)
? – hadleyrank(x)==1
give the desired results? – Ricardo Saportamin_rank
helps here. He needs the first min value (look atplyr
solution). 2) In whatever programming language you write, the algorithmic complexity ofrank
(ties=min, max, first etc..) will be bigger than just computingmin
. – Arunrank(x, ties.method="first")==1
works, as min and min_rank do not differentiate between multiple minima. – Felix Swhich.min
to be premature optimisation. AFAIK it's a natural choice, reads well, easy to understand, fast as it happens to be O(n) too. – Arun