Couldn't load pyspark data frame to decision tree algorithm. It says can't work with pyspark data frame

Question

I was working on IBM's data platform. I was able to load data into the pyspark data frame and made a spark SQL table. After splitting the data set, then feeding it into the Classification algorithm. It rises errors like spark SQL data can't load. required ndarrays.

from sklearn.ensemble import RandomForestRegressor`
from sklearn.model_selection import train_test_split`
from sklearn import preprocessing`
import numpy as np`

X_train,y_train,X_test,y_test = train_test_split(x,y,test_size = 0.1,random_state = 42)
RM = RandomForestRegressor()
RM.fit(X_train.reshape(1,-1),y_train)`

Error:

TypeError: Expected sequence or array-like, got {<}class 'pyspark.sql.dataframe.DataFrame'>

after this error, I did something like this:

x = spark.sql('select Id,YearBuilt,MoSold,YrSold,Fireplaces FROM Train').toPandas()
y = spark.sql('Select SalePrice FROM Train where SalePrice is not null').toPandas()

Error:

AttributeError Traceback (most recent call last) in () 5 X_train,y_train,X_test,y_test = train_test_split(x,y,test_size = 0.1,random_state = 42) 6 RM = RandomForestRegressor() ----> 7 RM.fit(X_train.reshape(1,-1),y_train) /opt/ibm/conda/miniconda3.6/lib/python3.6/site-packages/pandas/core/generic.py in getattr(self, name) 5065 if self._info_axis._can_hold_identifiers_and_holds_name(name): 5066 return self[name] -> 5067 return object.getattribute(self, name) 5068 5069 def setattr(self, name, value): AttributeError: 'DataFrame' object has no attribute 'reshape'

I've done something like this, x = spark.sql('select Id,YearBuilt,MoSold,YrSold,Fireplaces FROM Train').toPandas() y = spark.sql('Select SalePrice FROM Train where SalePrice is not null').toPandas() — Muntakimur Rahaman
Edit your question to include the example(s) with supporting code. — samkart

rbcvl rbcvl · Accepted Answer · 2019-11-22T08:45:01

As the sklearn documentation says:

"""
    X : array-like or sparse matrix, shape = [n_samples, n_features]
"""
regr = RandomForestRegressor()
regr.fit(X, y)

So firstly you're trying to give as the X argument a pandas.DataFrame instead of an array.

Secondly the reshape() method is not an attribute of the DataFrame object but numpy array.

import numpy as np
x = np.array([[2,3,4], [5,6,7]]) 
np.reshape(x, (3, -1))

Hope this helps.

Couldn't load pyspark data frame to decision tree algorithm. It says can't work with pyspark data frame

1 Answers