4
votes

I started learning maching learning on Python using Pandas and Sklearn. I tried to use the LinearRegression().fit method :

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split 
house_data = pd.read_csv(r"C:\Users\yassine\Desktop\ml\OC-tp-ML\house_data.csv")
y = house_data[["price"]] 
x = house_data[["surface","arrondissement"]] 
X = house_data.iloc[:, 1:3].values  
x_train, x_test, y_train, y_test = train_test_split (x, y, test_size=0.25, random_state=1) 
model = LinearRegression()
model.fit(x_train, y_train) 

When I run the code, I have this message :

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Can You help me please.

1
The error tells you the problem, you have NaN values, infinite values, or extremely large values that scikit can't handle. Check for NaN rows in your data and try to remove themG. Anderson
house_data.info(), check the null valueBENY
I got this :house_data.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 827 entries, 0 to 826 Data columns (total 3 columns): price 827 non-null int64 surface 822 non-null float64 arrondissement 822 non-null float64 dtypes: float64(2), int64(1) memory usage: 19.5 KBYass Abbah
Please do not use the comments space for posting code & results - edit & update your post insteaddesertnaut

1 Answers

4
votes

Machine learning models may require you to impute the data as part of your data cleaning process. Linear regression cares a lot about the yhat, so I usually start with imputing the mean. If you aren't comfortable imputing the missing data, you can drop the observations that contain NaN (provided you only have a small proportion of NaN observations.)

Imputing the mean can look like this:

df = df.fillna(df.mean())

Imputing to zero can look like this:

df = df.fillna(0)

Imputing to a custom result can look like:

df = df.fillna(my_func(args))

Dropping altogether can look like:

df = df.dropna()

Prepping so that inf may be caught by these methods ahead of time can look like:

df.replace([np.inf, -np.inf], np.nan)