Linear regression in scikit-learn

Question

I started learning maching learning on Python using Pandas and Sklearn. I tried to use the LinearRegression().fit method :

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split 
house_data = pd.read_csv(r"C:\Users\yassine\Desktop\ml\OC-tp-ML\house_data.csv")
y = house_data[["price"]] 
x = house_data[["surface","arrondissement"]] 
X = house_data.iloc[:, 1:3].values  
x_train, x_test, y_train, y_test = train_test_split (x, y, test_size=0.25, random_state=1) 
model = LinearRegression()
model.fit(x_train, y_train)

When I run the code, I have this message :

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Can You help me please.

The error tells you the problem, you have NaN values, infinite values, or extremely large values that scikit can't handle. Check for NaN rows in your data and try to remove them — G. Anderson
I got this :house_data.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 827 entries, 0 to 826 Data columns (total 3 columns): price 827 non-null int64 surface 822 non-null float64 arrondissement 822 non-null float64 dtypes: float64(2), int64(1) memory usage: 19.5 KB — Yass Abbah
Please do not use the comments space for posting code & results - edit & update your post instead — desertnaut

Charles Landau Charles Landau · Accepted Answer · 2018-12-13T16:20:43

Machine learning models may require you to impute the data as part of your data cleaning process. Linear regression cares a lot about the yhat, so I usually start with imputing the mean. If you aren't comfortable imputing the missing data, you can drop the observations that contain NaN (provided you only have a small proportion of NaN observations.)

Imputing the mean can look like this:

df = df.fillna(df.mean())

Imputing to zero can look like this:

df = df.fillna(0)

Imputing to a custom result can look like:

df = df.fillna(my_func(args))

Dropping altogether can look like:

df = df.dropna()

Prepping so that inf may be caught by these methods ahead of time can look like:

df.replace([np.inf, -np.inf], np.nan)

Linear regression in scikit-learn

1 Answers