Value error: setting an array element with a sequence data preprocessing

Question

dataset = pd.read_csv('train_data.csv')
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(),[0,1,2,3,4,5,6,7,8,9,10,11,12])],remainder='passthrough')
X = np.array(ct.fit_transform(X))
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
regressor.fit(X, y)                 #Error is thrown here

TypeError Traceback (most recent call last) TypeError: float() argument must be a string or a number, not 'csr_matrix'

The above exception was the direct cause of the following exception: ValueError Traceback (most recent call last) ValueError: setting an array element with a sequence.

Welcome to StackOverflow! Just in case there might be no answer popping up soon, the code seems to lack the import statements, so it doesn't seem to be a self contained snippet. If you make it easier for others to run it directly and see the bug locally, where possible, you increase the chance of good answers. — E. T.

StupidWolf StupidWolf · Accepted Answer · 2021-02-26T19:04:52

Your error comes from this np.array(ct.fit_transform(X)) . From one hot you get a sparse matrix (type csr) and you don't need to put it inside a numpy array again. You can either convert it to dense using:

ct.fit_transform(X).todense()

But this can be really costly on the memory and unnecessary since the regressor can take a sparse matrix. You can simply pass it in, I illustrate below with an example dataset:

dataset = pd.DataFrame(np.random.choice(['A','B','C'],(50,13)),
columns=["v"+ str(i) for i in range(13)])
dataset['v14'] = np.random.uniform(0,1,50)
dataset['y'] = np.random.normal(0,1,50)

X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(),
[0,1,2,3,4,5,6,7,8,9,10,11,12])],remainder='passthrough')

We transform X now, keep it as sparse:

X = ct.fit_transform(X)
type(X)
scipy.sparse.csr.csr_matrix

Then regress:

from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
regressor.fit(X, y)

The above will not throw an error

Value error: setting an array element with a sequence data preprocessing

1 Answers