0
votes

I am new to Machine Learning and to Python. I am trying to build a Random Forest model in order to predict cement strength. There are two .csv files: train_data.csv and test_data.csv.

This is what I have done. I am trying to predict the r2_score here.

df=pd.read_csv("train_data(1).csv")
X=df.drop('strength',axis=1)
y=df['strength']
model=RandomForestRegressor()
model.fit(X,y)
X_test=pd.read_csv("test_data.csv")
y_pred=model.predict(X_test)
acc_R=metrics.r2_score(y,y_pred)
acc_R

The problem here is that the shape of y and y_pred is different. So I get this error:

ValueError: Found input variables with inconsistent numbers of samples: [721, 309]

How do I correct this? Can someone explain to me what I am doing wrong?

2

2 Answers

0
votes
df_train = pd.read_csv("train_data(1).csv")
X_train = df.drop('strength',axis=1)
y_train = df['strength']
model=RandomForestRegressor()
model.fit(X_train,y_train)
df_test = pd.read_csv("test_data.csv")
X_test = df.drop('strength',axis=1) # if your test data consists of 'strength' 
y_test = df['strength'] # if your test data consists of 'strength' 
y_pred = model.predict(X_test)
acc_R = metrics.r2_score(y_test,y_pred)
acc_R
0
votes

You need to compare y_pred with y_test. Not y which you used to train the model:

acc_R=metrics.r2_score(y_test,y_pred)

There should be another list of labels for the y_test in test_data.csv.

Try the following:

df=pd.read_csv("train_data(1).csv")
X=df.drop('strength',axis=1)
y=df['strength']
model=RandomForestRegressor()
model.fit(X,y)
df1=pd.read_csv("test_data.csv") # we read the csv data from test
X_test=df1.drop('strength',axis=1) # get the fields that we will predict
y_test=df1['strength'] # get the correct labels for X_test
y_pred=model.predict(X_test) # get the predicted results
acc_R=metrics.r2_score(y_test,y_pred) # compare
acc_R