0
votes

I am trying to build a house price prediction model with sklearn linear regression and I am getting a negative score.

Please what am I doing wrong?

dataset:

this is the dataset

Screenshot of Dataset: enter image description here

Please see below details:

Shape of dataframe: (23435, 190)

Code:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score

    properties_five = pd.read_csv('house_test.csv')
    
    X = properties_five.drop('price', axis='columns')
    y = properties_five['price']
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)
    
    lr_clf = LinearRegression()
    lr_clf.fit(X_train, y_train)
    print(lr_clf.score(X_train,y_train))
    print(lr_clf.score(X_test,y_test))
    
    cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
    
    print(cross_val_score(LinearRegression(), X, y, cv=cv))

score on training data: 0.0025884591059242013

score on test data : -1.6566338615525985e+24

3
Please share your code output - gtomer
thanks, I have updated my question with this - biggest_boy

3 Answers

2
votes

Your code seems fine - except the line df = pd.read_csv('house_test.csv') should probably be properties_five = pd.read_csv('house_test.csv') to match the next lines.

When I run it on this data set, I get the following output:

0.7307587542204755
0.465770160153375
[0.64358885 0.67211318 0.67817097 0.53631898 0.67390831]

Perhaps the linear regression simply performs poorly on your data set, or else your data set contains errors. A negative R² score means that you would be better off using "constant regression", that is having your prediction be always equal to the mean of y.

0
votes

Please share your outputs. Also linear regression is subject to outliers so you should standardize the numerical variables.

0
votes

You have read the file using df name, so the very next line you should replace properties_five with df. And try to standardize/normalize the dataset, I hope that it will help to reduce error, for example here you can find details.