Non-conformable arrays in quantile regression model from pandas DataFrame using rpy2

Question

I am doing a quantile regression on the engel dataset with rpy2 (2.7.6):

import statsmodels as sm

from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri

pandas2ri.activate()

quantreg = importr('quantreg')

data = sm.datasets.engel.load_pandas().data

qreg = quantreg.rq('foodexp ~ income', data=data, tau=0.5)

However this generates the following error:

qreg = quantreg.rq('foodexp ~ income', data=data, tau=0.5)
Traceback (most recent call last):

  File "<ipython-input-22-02ee1015737c>", line 1, in <module>
    quantreg.rq('foodexp ~ income', data=data, tau=0.5)

  File "C:\Anaconda\lib\site-packages\rpy2\robjects\functions.py", line 178, in __call__
    return super(SignatureTranslatedFunction, self).__call__(*args, **kwargs)

  File "C:\Anaconda\lib\site-packages\rpy2\robjects\functions.py", line 106, in __call__
    res = super(Function, self).__call__(*new_args, **new_kwargs)

RRuntimeError: Error in y - x %*% z$coef : non-conformable arrays

From what I understand, non-conformable arrays in this case would mean there are some missing values or the 'arrays' being used are different sizes. I can confirm that this is NOT the case:

data.count()
Out[26]: 
income     235
foodexp    235
dtype: int64

data.shape
Out[27]: (235, 2)

What else could this error mean? Is it possible that the conversion from DataFrame to data.frame in rpy2 is not working correctly or maybe I'm missing something here? Can anyone else confirm this error?

Just in case here is some info regarding the version of R and Python.

R version 3.2.0 (2015-04-16) -- "Full of Ingredients"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

Python 2.7.11 |Anaconda 2.3.0 (64-bit)| (default, Dec  7 2015, 14:10:42) [MSC v.1500 64 bit (AMD64)]
on win32

Any help would be appreciated.

Edit 1:

If I load the dataset directly from R I don't get an error:

from rpy2.robjects import r

r.data('engel')
data = r['engel']

qreg = quantreg.rq('foodexp ~ income', data=data, tau=0.5)

So I think there is something wrong with the conversion with pandas2ri. The same error occurs when I try to convert the DataFrame to data.frame manually with pandas2ri.py2ri.

Edit 2:

Interestingly enough, if I used the deprecated pandas.rpy.common.convert_to_r_dataframe the error is gone:

import pandas.rpy.common as com

rdata = com.convert_to_r_dataframe(data)

qreg = quantreg.rq('foodexp ~ income', data=rdata, tau=0.5)

There is definitely a bug in pandas2ri which is also confirmed here.

From an R perspective I cannot be much help since I get the expected result from rq('foodexp ~ income', data=engel, tau=0.5). I'm wondering if you are actually getting substitution of the engel dataset into the R environment. — IRTFM
If you do everything in R, does it work? If so, then the problem is probably in the conversion from pandas to R. — BrenBarn
@BrenBarn If I do everything in R I don't get the error. I can also confirm that it has to be something wrong with the pandas2ri conversion by loading the dataset directly from R in Python (see edit). — pbreach

lgautier lgautier · Accepted Answer · 2016-01-02T02:41:02

As answered on the rpy2 issue tracker:

The root of the issue seems to be that the columns in the pandas data frame are converted to Array objects each with only one column.

>>> pandas2ri.py2ri_pandasdataframe(data) 
<DataFrame - Python:0x7f8af3c2afc8 / R:0x92958b0>
[Array, Array]
  income: <class 'rpy2.robjects.vectors.Array'>
  <Array - Python:0x7f8af57ef908 / R:0x92e1bf0>
[420.157651, 541.411707, 901.157457, ..., 581.359892, 743.077243, 1057.676711]
  foodexp: <class 'rpy2.robjects.vectors.Array'>
  <Array - Python:0x7f8af3c2ab88 / R:0x92e7600>
[255.839425, 310.958667, 485.680014, ..., 468.000798, 522.601906, 750.320163]

The distinction is a subtle one, but this seems to be confusing the quantreg package. There are other R functions appear to be working independently of whether the objects is an array with one column or a vector.

Turning the columns to R vectors appears to be what is required to solve the problem:

from rpy2.robjects.vectors import FloatVector
mydata=pandas2ri.py2ri_pandasdataframe(data)

from rpy2.robjects.packages import importr
base=importr('base')
mydata[0]=base.as_vector(mydata[0])
mydata[1]=base.as_vector(mydata[1])
# now this is working
qreg = quantreg.rq('foodexp ~ income', data=mydata, tau=0.5)

Now I would like to gather more data about whether this could solve the issue without breaking things else. For this I turned the fix into a custom converter derived from the pandas converter:

from rpy2.robjects import default_converter
from rpy2.robjects.conversion import Converter, localconverter

from rpy2.robjects.packages import importr
from rpy2.robjects import numpy2ri, pandas2ri, vectors
import numpy

my_converter = Converter('my converter',
                         template=pandas2ri.converter)
base=importr('base')

def ndarray_forcevector(obj):
    func=numpy2ri.converter.py2ri.registry[numpy.ndarray]
    # current conversion as performed by numpy
    res=func(obj)
    if len(obj.shape) == 1:
        # force into an R vector
        res=base.as_vector(res)
    return res

@my_converter.py2ri.register(pandas2ri.PandasSeries)
def py2ri_pandasseries(obj):
    # this is a copy of the function with the same name in pandas2ri, with
    # the call to ndarray_forcevector() as the only difference
    if obj.dtype == '<M8[ns]':
        # time series
        d = [vectors.IntVector([x.year for x in obj]),
             vectors.IntVector([x.month for x in obj]),
             vectors.IntVector([x.day for x in obj]),
             vectors.IntVector([x.hour for x in obj]),
             vectors.IntVector([x.minute for x in obj]),
             vectors.IntVector([x.second for x in obj])]
        res = vectors.ISOdatetime(*d)
        #FIXME: can the POSIXct be created from the POSIXct constructor ?
        # (is '<M8[ns]' mapping to Python datetime.datetime ?)
        res = vectors.POSIXct(res)
    else:
        # converted as a numpy array
        res = ndarray_forcevector(obj)
    # "index" is equivalent to "names" in R
    if obj.ndim == 1:
        res.do_slot_assign('names',
                           vectors.StrVector(tuple(str(x) for x in obj.index)))
    else:
        res.do_slot_assign('dimnames',
                          vectors.SexpVector(conversion.py2ri(obj.index)))
    return res

The easiest way to use this new converter might be in a context manager:

with localconverter(default_converter + my_converter) as cv:
    qreg = quantreg.rq('foodexp ~ income', data=data, tau=0.5)

Non-conformable arrays in quantile regression model from pandas DataFrame using rpy2

Edit 1:

Edit 2:

1 Answers