As a learning exercise and because I'd like to do something similar with my own data, I'm trying to copy the answer to this example exactly but implement it in Python via rpy2.
This is turning out to be trickier than I thought because plyr uses a lot of convenient sytax (e.g. as.quoted variables, summarize, functions) that I haven't found easy to port to rpy2. Without even getting to the ggplot2 segment, this is what I've been able to manage so far, using **{} to allow use of the '.' arguments:
# import rpy2.robjects as ro
# from rpy2.robjects.packages import importr
# stats= importr('stats')
# plyr = importr('plyr')
# bs = importr('base')
# r = ro.r
# df = ro.DataFrame
mms = df( {'delicious': stats.rnorm(100),
'type':bs.sample(bs.as_factor(ro.StrVector(['peanut','regular'])), 100, replace=True),
'color':bs.sample(bs.as_factor(ro.StrVector(['r','g','y','b'])), 100, replace=True)} )
# first define a function, then use it in ddply call
myfunc = r('''myfunc <- function(var) {paste('n =', length(var))} ''')
mms_cor = plyr.ddply(**{'.data':mms,
'.variables':ro.StrVector(['type','color']),
'.fun':myfunc})
This runs without error, but printing the resulting mms_cor gives the following, which suggests the function isn't working correctly in the context of the ddply call (the length of the mms data.frame is 3, which is what I think is being calculated because other inputs to myfunc return different values):
type color V1
1 peanut b n = 3
2 peanut g n = 3
3 peanut r n = 3
4 peanut y n = 3
5 regular b n = 3
6 regular g n = 3
7 regular r n = 3
8 regular y n = 3
Ideally I would get this to work with summarize, as done in the example answer, to have multiple calculations/label the output, but I couldn't get this to work either, and it really becomes awkward syntax-wise:
mms_cor = plyr.ddply(plyr.summarize, n=bs.paste('n =', bs.length('delicious')),
**{'.data':mms,'.variables':ro.StrVector(['type','color'])})
This gives the same output as above with 'n = 1'. I know it's reflecting the length of the 1-item vector 'delicious', but can't figure out how to make this a variable instead of a string, or which variable it would be (which is why I moved toward the function above). Additionally, it would be useful to know how one might get the as.quoted variable syntax (e.g. ddply(.data=mms, .(type, color), ...)) to work with rpy2. I know plyr has several as_quoted methods, but I can't figure out how to use them because documentation and examples are tricky to find.
Any help is greatly appreciated. Thanks.
Edit:
lgautier's solution to fix myfunc with nrow not length.
myfunc = r('''myfunc <- function(var) {paste('n =', nrow(var))} ''')
Solution for ggplot2 if useful for others (note had to add x and y values to mms_cor as a workaround for using aes_string (can't get aes to work in Python environment):
#rggplot2 = importr('ggplot2') # note ggplot2 import above doesn't take 'mapping' kwarg
p = rggplot2.ggplot(data=mms, mapping=rggplot2.aes_string(x='delicious')) + \
rggplot2.geom_density() + \
rggplot2.facet_grid('type ~ color') + \
rggplot2.geom_text(data=mms_cor, mapping=rggplot2.aes_string(x='x', y='y', label='V1'), colour='black', inherit_aes=False)
p.plot()
length(var)
is the length of the R data.frame, which is 3.nrows(var)
is the number of rows. – lgautierplot_stuff
,source
that into your rpy session and call that funtction with the appropriate parameters. This also makes debugging the R code easier. – Paul Hiemstra