I'm trying to classify using sklearn's decision tree classifier. I've stored my training and testing datasets into two seperate pandas dataframes. I'm calling the classifier like so:
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(features_in_training_set, class_labels_in_training_set)
predictions = classifier.predict(features_in_testing_set)
However, I'm receiving this error, which seems to be common when classifying with the tree.
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
I know that there are no missing values in either dataset. I am changing them using the imputer method. The print out of my dataframes shows this but to double check, I've also tried df.isna()
and the outputs are all False. I don't think I have infinity values as the frames consist of binary values. I don't want to remove rows or the columns as I don't want to reduce my dataset. I also don't want to replace them on any other criteria.
I'm not quite sure how to find which columns are too large for dtype float 32 and how to change them if they are. I have a feeling that it could be my timestamp column. Here's a snippet of the training data frame as it's quite large:
time A B
0 1.518999e+09 1 1
1 1.518999e+09 1 0
2 1.518999e+09 0 1
3 1.518999e+09 0 0
float32
is 3.4028234664e38 (as available from Wikipedia, cppreference, etc.). – Davis Herringtime
column is causing the error, you can leave that out of training features as a test. Otherwise, you need to show some data. – Vivek Kumar