0
votes

I'm trying to classify using sklearn's decision tree classifier. I've stored my training and testing datasets into two seperate pandas dataframes. I'm calling the classifier like so:

classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(features_in_training_set, class_labels_in_training_set)
predictions = classifier.predict(features_in_testing_set)

However, I'm receiving this error, which seems to be common when classifying with the tree.

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

I know that there are no missing values in either dataset. I am changing them using the imputer method. The print out of my dataframes shows this but to double check, I've also tried df.isna() and the outputs are all False. I don't think I have infinity values as the frames consist of binary values. I don't want to remove rows or the columns as I don't want to reduce my dataset. I also don't want to replace them on any other criteria.

I'm not quite sure how to find which columns are too large for dtype float 32 and how to change them if they are. I have a feeling that it could be my timestamp column. Here's a snippet of the training data frame as it's quite large:

             time      A     B 
0       1.518999e+09   1     1
1       1.518999e+09   1     0 
2       1.518999e+09   0     1
3       1.518999e+09   0     0
1
This answer might help: stackoverflow.com/a/45745154/8345749Joe Patten
change the scale of your time data to log10, at a time scale so high you are essentially using a constant value. Log10 should help improve this.d_kennetz
The largest float32 is 3.4028234664e38 (as available from Wikipedia, cppreference, etc.).Davis Herring
If you think that time column is causing the error, you can leave that out of training features as a test. Otherwise, you need to show some data.Vivek Kumar
@JoePatten this shows me how to delete the rows that include Nan or infinite numbers. I specifically said that I didn't want to do that.user3755632

1 Answers

0
votes

Try to check the summary of the dataframe. for e.g. if your data frame shows missing samples for any features, it means there are null values for some of the observations.

a possible divide by null scenario is being encountered.