13
votes

I'd like to use pandas for all my analysis along with numpy but use Rpy2 for plotting my data. I want to do all analyses using pandas dataframes and then use full plotting of R via rpy2 to plot these. py2, and am using ipython to plot. What's the correct way to do this?

Nearly all commands I try fail. For example:

  • I'm trying to plot a scatter between two columns of a pandas DataFrame df. I'd like the labels of df to be used in x/y axis just like would be used if it were an R dataframe. Is there a way to do this? When I try to do it with r.plot, I get this gibberish plot:

In: r.plot(df.a, df.b) # df is pandas DataFrame

yields:

Out: rpy2.rinterface.NULL

resulting in the plot:

enter image description here

As you can see, the axes labels are messed up and it's not reading the axes labels from the DataFrame like it should (the X axis is column a of df and the Y axis is column b).

  • If I try to make a histogram with r.hist, it doesn't work at all, yielding the error:

    In: r.hist(df.a)
    Out: 
    ...
    vectors.pyc in <genexpr>((x,))
        293         if l < 7:
        294             s = '[' + \
    --> 295                 ', '.join((p_str(x, max_width = math.floor(52 / l)) for x in self[ : 8])) +\
        296                 ']'
        297         else:
    
    vectors.pyc in p_str(x, max_width)
        287                     res = x
        288                 else:
    --> 289                     res = "%s..." % (str(x[ : (max_width - 3)]))
        290             return res
        291 
    
    TypeError: slice indices must be integers or None or have an __index__ method
    

And resulting in this plot:

enter image description here

Any idea what the error means? And again here, the axes are all messed up and littered with gibberish data.

EDIT: This error occurs only when using ipython. When I run the command from a script, it still produces the problematic plot, but at least runs with no errors. It must be something wrong with calling these commands from ipython.

  • I also tried to convert the pandas DataFrame df to an R DataFrame as recommended by the poster below, but that fails too with this error:

    com.convert_to_r_dataframe(mydf) # mydf is a pandas DataFrame
    ----> 1 com.convert_to_r_dataframe(mydf)
    in convert_to_r_dataframe(df, strings_as_factors)
        275     # FIXME: This doesn't handle MultiIndex
        276 
    --> 277     for column in df:
        278         value = df[column]
        279         value_type = value.dtype.type
    
    TypeError: iteration over non-sequence
    

How can I get these basic plotting features to work on Pandas DataFrame (with labels of plots read from the labels of the Pandas DataFrame), and also get the conversion between a Pandas DF to an R DF to work?

EDIT2: Here is a complete example of a csv file "test.txt" (http://pastebin.ca/2311928) and my code to answer @dale's comment:

import rpy2
from rpy2.robjects import r
import rpy2.robjects.numpy2ri
import pandas.rpy.common as com
from rpy2.robjects.packages import importr
from rpy2.robjects.lib import grid
from rpy2.robjects.lib import ggplot2
rpy2.robjects.numpy2ri.activate()
from numpy import *
import scipy

# load up pandas df
import pandas
data = pandas.read_table("./test.txt")
# plotting a column fails
print "data.c2: ", data.c2
r.plot(data.c2)
# Conversion and then plotting also fails
r_df = com.convert_to_r_dataframe(data)
r.plot(r_df)

The call to plot the column of "data.c2" fails, even though data.c2 is a column of a pandas df and therefore for all intents and purposes should be a numpy array. I use the activate() call so I thought it would handle this column as a numpy array and plot it.

The second call to plot the dataframe data after conversion to an R dataframe also fails. Why is that? If I load up test.txt from R as a dataframe, I'm able to plot() it and since my dataframe was converted from pandas to R, it seems like it should work here too.

When I do try rmagic in ipython, it does not fire up a plot window for some reason, though it does not error. I.e. if I do:

In [12]: X = np.array([0,1,2,3,4])

In [13]: Y = np.array([3,5,4,6,7])
In [14]: import rpy2

In [15]: from rpy2.robjects import r

In [16]: import rpy2.robjects.numpy2ri

In [17]: import pandas.rpy.common as com

In [18]: from rpy2.robjects.packages import importr

In [19]: from rpy2.robjects.lib import grid

In [20]: from rpy2.robjects.lib import ggplot2


In [21]: rpy2.robjects.numpy2ri.activate()

In [22]: from numpy import *

In [23]: import scipy

In [24]: r.assign("x", X)
Out[24]: 
<Array - Python:0x592ad88 / R:0x6110850>
[       0,        1,        2,        3,        4]

In [25]: r.assign("y", Y)
<Array - Python:0x592f5f0 / R:0x61109b8>
[       3,        5,        4,        6,        7]

In [27]: %R plot(x,y)

There's no error, but no plot window either. In any case, I'd like to stick to rpy2 and not rely on rmagic if possible.

Thanks.

3
you can export csv and import back or use rpylocojay
@locojay: How do I use rpy with pandas dataframe?user248237
take a look at rpy.sourceforge.net/rpy/doc/rpy_html/DataFrame-class.html which uses std python ds... take same approach using pandas dflocojay

3 Answers

7
votes

[note: Your code in "edit 2" is working here (Python 2.7, rpy2-2.3.2, R-1.15.2).]

As @dale mentions it whenever R objects are anonymous (that is no R symbol exists for the object) the R deparse(substitute()) will end up returning the structure() of the R object, and a possible fix is to specify the "xlab" and "ylab" parameters; for some plots you'll have to also specify main (the title).

An other way to work around that is to use R's formulas and feed the data frame (more below, after we work out the conversion part).

Forget about what is in pandas.rpy. It is both broken and seem to ignore features available in rpy2.

An earlier quick fix to conversion with ipython can be turned into a proper conversion rather easily. I am considering adding one to the rpy2 codebase (with more bells and whistles), but in the meantime just add the following snippet after all your imports in your code examples. It will transparently convert pandas' DataFrame objects into rpy2's DataFrame whenever an R call is made.

from collections import OrderedDict
py2ri_orig = rpy2.robjects.conversion.py2ri
def conversion_pydataframe(obj):
    if isinstance(obj, pandas.core.frame.DataFrame):
        od = OrderedDict()
        for name, values in obj.iteritems():
            if values.dtype.kind == 'O':
                od[name] = rpy2.robjects.vectors.StrVector(values)
            else:
                od[name] = rpy2.robjects.conversion.py2ri(values)
        return rpy2.robjects.vectors.DataFrame(od)
    elif isinstance(obj, pandas.core.series.Series):
        # converted as a numpy array
        res = py2ri_orig(obj) 
        # "index" is equivalent to "names" in R
        if obj.ndim == 1:
            res.names = ListVector({'x': ro.conversion.py2ri(obj.index)})
        else:
            res.dimnames = ListVector(ro.conversion.py2ri(obj.index))
        return res
    else:
        return py2ri_orig(obj) 
rpy2.robjects.conversion.py2ri = conversion_pydataframe

Now the following code will "just work":

r.plot(rpy2.robjects.Formula('c3~c2'), data)
# `data` was converted to an rpy2 data.frame on the fly
# and the a scatter plot c3 vs c2 (with "c2" and "c3" the labels on
# the "x" axis and "y" axis).

I also note that you are importing ggplot2, without using it. Currently the conversion will have to be explicitly requested. For example:

p = ggplot2.ggplot(rpy2.robjects.conversion.py2ri(data)) +\
    ggplot2.geom_histogram(ggplot2.aes_string(x = 'c3'))
p.plot()
6
votes

You need to pass in the labels explicitly when calling the r.plot function.

r.plot([1,2,3],[1,2,3], xlab="X", ylab="Y")

When you plot in R, it grabs the labels via deparse(substitute(x)) which essentially grabs the variable name from the plot(testX, testY). When you're passing in python objects via rpy2, it's an anonymous R object and akin to the following in R:

> deparse(substitute(c(1,2,3)))
[1] "c(1, 2, 3)"

which is why you're getting the crazy labels.

A lot of times it's saner to use rpy2 to only push data back and forth.

r.assign('testX', df.A)
r.assign('testY', df.B)
%R plot(testX, testY)

rdf = com.convert_to_r_dataframe(df)
r.assign('bob', rdf)
%R plot(bob$$A, bob$$B)

http://nbviewer.ipython.org/4734581/

5
votes

use rpy. the conversion is part of pandas so you don't need to do it yoursef http://pandas.pydata.org/pandas-docs/dev/r_interface.html

In [1217]: from pandas import DataFrame

In [1218]: df = DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C':[7,8,9]},
   ......:                index=["one", "two", "three"])
   ......:

In [1219]: r_dataframe = com.convert_to_r_dataframe(df)

In [1220]: print type(r_dataframe)
<class 'rpy2.robjects.vectors.DataFrame'>