ValueError: Input contains NaN, infinity or a value too large for dtype('float32'). Why?

Question

I have gone through all the similar questions but none of them answer my query. I am using random forest classifier as follows:

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)
clf.fit(X_train, y_train)
clf.predict(X_test)

It's giving me this error:

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

However, when I do X_train.describe() I don't see any missing values. In fact, actually, I already took care of the missing values before even splitting my data.

When I do the following:

np.where(X_train.values >= np.finfo(np.float32).max)

I get:

(array([], dtype=int64), array([], dtype=int64))

And for these commands:

np.any(np.isnan(X_train)) #true
np.all(np.isfinite(X_train)) #false

And after getting the above results, I also tried this:

X_train.fillna(X_train.mean())

but I get the same error and it doesn't fix anything.

Please tell me where I'm going wrong. Thank you!

SkippyElvis SkippyElvis · Accepted Answer · 2019-07-19T18:58:13

Solution
X_train = X_train.fillna(X_train.mean())

Explanation
np.any(np.isnan(X_train)) evals to True, therefore X_train contains some nan values. Per pandas fillna() docs, DataFrame.fillna() returns a copy of the DataFrame with missing values filled. You must reassign X_train to the return value of fillna(), like X_train = X_train.fillna(X_train.mean())

Example

>>> import pandas as pd
>>> import numpy as np
>>> 
>>> a = pd.DataFrame(np.arange(25).reshape(5, 5))
>>> a[2][2] = np.nan
>>> 
>>> a
    0   1     2   3   4
0   0   1   2.0   3   4
1   5   6   7.0   8   9
2  10  11   NaN  13  14
3  15  16  17.0  18  19
4  20  21  22.0  23  24
>>> 
>>> a.fillna(1)
    0   1     2   3   4
0   0   1   2.0   3   4
1   5   6   7.0   8   9
2  10  11   1.0  13  14
3  15  16  17.0  18  19
4  20  21  22.0  23  24
>>> 
>>> a
    0   1     2   3   4
0   0   1   2.0   3   4
1   5   6   7.0   8   9
2  10  11   NaN  13  14
3  15  16  17.0  18  19
4  20  21  22.0  23  24
>>> 
>>> a = a.fillna(1)
>>> a
    0   1     2   3   4
0   0   1   2.0   3   4
1   5   6   7.0   8   9
2  10  11   1.0  13  14
3  15  16  17.0  18  19
4  20  21  22.0  23  24
>>>

ValueError: Input contains NaN, infinity or a value too large for dtype('float32'). Why?

1 Answers