3
votes

First, I have checked the different posts concerning this error and none of them can solve my issue.

So I am using RandomForest and I am able to generate the forest and to do a prediction but sometimes during the generation of the forest, I get the following error.

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

This error occurs with the same dataset. Sometimes the dataset creates an error during the training and most of the time not. The error sometimes occurs at the start and sometimes in the middle of the training.

Here's my code :

import pandas as pd
from sklearn import ensemble
import numpy as np

def azureml_main(dataframe1 = None, dataframe2 = None):

    # Execution logic goes here

    Input = dataframe1.values[:,:]
    InputData = Input[:,:15]
    InputTarget = Input[:,16:]

    limitTrain = 2175

    clf = ensemble.RandomForestClassifier(n_estimators = 10000, n_jobs = 4 );

    features=np.empty([len(InputData),10])
    j=0
    for i in range (0,14):
        if (i == 1 or i == 4 or i == 5 or i == 6 or i == 8 or i == 9 or  i == 10 or i == 11 or i == 13 or i == 14):
            features[:,j] = (InputData[:, i])
            j += 1     
        
    clf.fit(features[:limitTrain,:],np.asarray(InputTarget[:limitTrain,1],dtype = np.float32))

    res = clf.predict_proba(features[limitTrain+1:,:])

    listreu = np.empty([len(res),5])
    for i in range(len(res)):
        if(res[i,0] > 0.5):
            listreu[i,4] = 0;
        elif(res[i,1] > 0.5):
            listreu[i,4] = 1;
        elif(res[i,2] > 0.5):
            listreu[i,4] = 2;
        else:
            listreu[i,4] = 3;
    

    listreu[:,0] = features[limitTrain+1:,0]
    listreu[:,1] = InputData[limitTrain+1:,2]
    listreu[:,2] = InputData[limitTrain+1:,3]
    listreu[:,3] = features[limitTrain+1:,1]



    # Return value must be of a sequence of pandas.DataFrame
    return pd.DataFrame(listreu),

I ran my code locally and on Azure ML Studio and the error occurs in both cases.

I am sure that it is not due to my dataset since most of the time I don't get the error and I am generating the dataset myself from different input.

This is a part of the dataset I use

EDIT: I probably found out that I had 0 value which were not real 0 value. The values were like

3.0x10^-314

4
Is it possible for you to share the data and complete code. If yes, please do that. And also please check that you are using the latest versions of all libraries.Vivek Kumar
@VivekKumar I added a part of my dataset and the code I put in the question is all the code I use. Locally I use the latest version of scikit-learn and numpy 14.4.4 instead of 14.4.5 and I do not use pandas. And in Azure ML Studio, it is Microsoft which manage the environment and it uses Anaconda4.0/python3.5Thomas

4 Answers

3
votes

I would presume somewhere in you dataframe you sometimes have nan values.

these can simply be removed using

dataframe1 = dataframe1.dropna()

However, with this approach you could be losing some valueable training data so it may be worth looking into .fillna() or sklearn.preprocessing.Imputer in order to augment some values for the nan cells in the df.

Without seeing the source of dataframe1 it is hard to give a full / complete answer but it is possible that some sort of train, test split is going on resulting in the dataframe being passed only having nan values some of the time.

0
votes

Since I correct the problem of the edit, I have no more errors. I just replace 3.0x10^-314 values with zeros.

-1
votes

Some time ago I'v got unstable errors when I use explicit number of CPU in parameter such as your's n_jobs = 4. Try to not use n_jobs at all or use n_jobs = -1 for automatical CPU count detection. May be it will help.

-1
votes

Try to use float64 instead of float32. EDIT :

  • Show us the dataset that did it