2
votes

I'm new to the Data Science and to Random Forest, of course, I've been trying to find Adjusted R squared and RMSE after applying Random Forest on a Dataset of (1239, 29).

import matplotlib.pyplot as plt 
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, mean_squared_log_error, mean_absolute_error
from sklearn.model_selection import train_test_split
X = df.loc[:, df.columns != 'PRODUCTMONTHLYREVENUE_LINE']
y = df.loc[:,['PRODUCTMONTHLYREVENUE_LINE']].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

I have applied Test Train before applying Random Forest on the dataset.

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import roc_auc_score
model = RandomForestRegressor(oob_score=True)
model.fit(X_train, y_train)
y_predict=model.predict(X_test)

Now When I try to run RMSE I'm getting Errors which I was not getting in OLS model.

RMSE= np.sqrt(mean_squared_error(y_predict,y))
RMSE

Getting the below error ValueError Traceback (most recent call last) in () ----> 1 RMSE= np.sqrt(mean_squared_error(y_predict,y)) 2 RMSE

2 frames /usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in check_consistent_length(*arrays) 210 if len(uniques) > 1: 211 raise ValueError("Found input variables with inconsistent numbers of" --> 212 " samples: %r" % [int(l) for l in lengths]) 213 214

ValueError: Found input variables with inconsistent numbers of samples: [248, 1239]

1

1 Answers

0
votes

You need to use the command below:

RMSE= np.sqrt(mean_squared_error(y_predict,y_test))
RMSE

y variable refers to whole label data. You splited with rate 20% and you should use that test data. Not the whole test data label