I use ddply
quite frequently, but historically with summarize
(occasionally mutate
) and only basic functions like mean()
, var1 - var2
, etc. I have a dataset in which I'm trying to apply a custom, more involved function and started trying to dig into how to do this with ddply
. I've got a successful solution, but I don't understand why it works like this vs. for more "normal" functions.
Related
- Custom Function not recognized by ddply {plyr}...
- How do I pass variables to a custom function in ddply?
- r-help: [R] Correct use of ddply with own function (I ended up basing my solution on this)
Here's an example data set:
library(plyr)
df <- data.frame(id = rep(letters[1:3], each = 3),
value = 1:9)
Normally, I'd use ddply
like so:
df_ply_1 <- ddply(df, .(id), mutate, mean = mean(value))
My visualization of this is that ddply
splits df
into "mini" data frames based on grouped combos of id
, and then I add a new column by calling mean()
on a column name that exists in df
. So, my attempt to implement a function extended this idea:
# actually, my logical extension of the above was to use:
# ddply(..., mean = function(value) { mean(value) })
df_ply_2 <- ddply(df, .(id), mutate,
mean = function(df) { mean(df$value) })
Error: attempt to replicate an object of type 'closure'
All the help on custom functions don't apply mutate
, but that seems inconsistent, or at least annoying to me, as the analog to my implemented solution is:
df_mean <- function(df) {
temp <- data.frame(mean = rep(mean(df$value), nrow(df)))
temp
}
df_ply_3 <- df
df_ply_3$mean <- ddply(df, .(id), df_mean)$mean
In-line, looks like I have to do this:
df_ply_4 <- df
df_ply_4$mean <- ddply(df, .(id), function(x) {
temp <- data.frame(mean = rep(mean(x$value), length(x$value)))
temp})$mean
Why can't I use mutate
with a custom function? Is it just that "built-in" functions return some sort of class that ddply
can deal with vs. having to kick out a full data.frame
and then call out only the column I care about?
Thanks for helping me "get it"!
Update after @Gregor's answer
Awesome answer, and I think I now get it. I was, indeed, confused about what mutate
and summarize
meant... thinking they were arguments to ddply
regarding how to handle the result vs. actually being the functions themselves. So, thanks for that big insight.
Also, it really helped to understand that without mutate/summarize
, I need to return a data.frame
, which is the reason I have to cbind
a column with the name of the column in the df
that gets returned.
Lastly if I do use mutate
, it's helpful to now realize I can return a vector result and get the right result. Thus, I can do this, which I've now understood after reading your answer:
# I also caught that the code above doesn't do the right thing
# and recycles the single value returned by mean() vs. repeating it like
# I expected. Now that I know it's taking a vector, I know I need to return
# a vector the same length as my mini df
custom_mean <- function(x) {
rep(mean(x), length(x))
}
df_ply_5 <- ddply(df, .(id), mutate,
mean = custom_mean(value))
Thanks again for your in-depth answer!
Update per @Gregor's last comment
Hmmm. I used rep(mean(x), length(x))
due to this observation for df_ply_3
's result (I admit to not actually looking at it closely when I ran it the first time making this post, I just saw that it didn't give me an error!):
df_mean <- function(x) {
data.frame(mean = mean(x$value))
}
df_ply_3 <- df
df_ply_3$mean <- ddply(df, .(id), df_mean)$mean
df_ply_3
id value mean
1 a 1 2
2 a 2 5
3 a 3 8
4 b 4 2
5 b 5 5
6 b 6 8
7 c 7 2
8 c 8 5
9 c 9 8
So, I'm thinking that my code was actually an accident based on the fact that I had 3 id
variables repeated 3 times. Thus the actual return was the equivalent of summarize
(one row per id
value), and recycled. Testing that theory appears accurate if I update my data frame like so:
df <- data.frame(id = c(rep(letters[1:3], each = 3), "d"),
value = 1:10)
I get an error when trying to use the df_ply_3
method with df_mean()
:
Error in `$<-.data.frame`(`*tmp*`, "mean", value = c(2, 5, 8, 10)) :
replacement has 4 rows, data has 10
So, the mini df passed to df_mean
returns a df
where mean
is the result of taking the mean if the value
vector (returns one value). So, my output was just a data.frame
of three values, one per id
group. I'm thinking the mutate
way sort of "remembers" that it was passed a mini data frame, and then repeats the single output to match it's length?
In any case, thanks for commenting on df_ply_5
; indeed, if I remove the rep()
bit and just return mean(x)
, it works great!
ddply(df, .(id), function(d) mutate(d, mean = mean(value)))
– baptistecustom_mean
function... thanks to recycling if you want the same value multiple times you can just return one value, it's a nice feature! Notice that yourdf_ply_1
, yourdf_ply_5
and @baptiste's comment code are all slightly different, but the returns are all the same. – Gregor Thomasmutate
actually is the function passed toddply()
(vs. my thought that it was some argument tellingddply()
how to return the result), it makes sense that I could call it like that as the function vs. trying to specify an additional function to "mutate on." – Hendy