1
votes

I have a database of two columns["A", "B"] where "A" is the input variable and "B" is the target variable. All values are in integers.

My code:

X.shape
>>(2540, 1)

y.shape
>>(2540, 1)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)

import numpy as np
from sklearn.model_selection import train_test_split
np.random.rand(4)
X_train, X_test, y_train, y_test  = train_test_split(X,y,test_size = 0.2)

Linear Regression from Sklearn

regr = LinearRegression(fit_intercept=True)
regr.fit(X_train, y_train)  

print ('Coefficients: ', regr.coef_)
print ('Intercept: ',regr.intercept_)          
>>Coefficients:  [[43.95569425]]
>>Intercept:  [100.68681298]

I got R2 value of 0.93

The last record in X_train is 3687 and the corresponding y_train value is 212.220001

I used the last record for prediction, like

regr.predict([[3687]] )
>>array([161825.22279211])

I do not understand What is happening, I excepted the predicted value will be around 212.

But, The predicted value is 161825

Could you please explain what is the reason, thanks

1
1) Fitting a linear regression to an arbitrary data set does not guarantee that prediction for all data points in the training set will be reasonable. 2) To evaluate performance, you should not look at individual predictions. 3) You shoul evaluate on the held-out data X_test and y_test instead of a data point from X_train. 4) Is 3687 a scaled value?Mathias Müller

1 Answers

2
votes

perhaps you need to pass your test data through the scaler before feeding to the regression. try reg.predict(scaler.transform([3687])