I wrote a program in python to use a machine learning algorithm to make predictions on data. I use the function RandomForestClassifier from Scikit Learn to create a random forest to make predictions.
The purpose of the program is to predict if an unknown astrophysical source is a pulsar or an agn; so it trains the forest on known data of which it knows if sources are pulsar or agn, then it makes predictions on unknown data, but it doesn’t work. The program predict that unknown data are all pulsar or all agn and it rarely predicts a different result, but not correct.
Below I describe the passages of my program.
It creates a data frame with data for all the sources: all_df It is made of ten columns, nine used as predictors and one as target:
predictors=all_df[['spec_index','variab_index','flux_density','unc_ene_flux100','sign_curve','h_ratio_12','h_ratio_23','h_ratio_34','h_ratio_45']]
targets=all_df['type']
type column contains the label “pulsar” or “agn” for each source.
The values of predictors and targets are used successively in the program to train the forest.
The program divides the predictors and the targets in two sets, the train, which is the 70% of the total, and the test, which is the 30% of the total of all_df, using the function train_test_split from Scikit Learn:
pred_train, pred_test, tar_train, tar_test=train_test_split(predictors, targets, test_size=0.3)
Data in these sets are mixed, so the program orders the indexes of these sets, without changing data position:
pred_train=pred_train.reset_index(drop=True)
pred_test=pred_test.reset_index(drop=True)
tar_train=tar_train.reset_index(drop=True)
tar_test=tar_test.reset_index(drop=True)
After that, the program creates and trains the random forest:
clf=RandomForestClassifier(n_estimators=1000,oob_score=True,max_features=None,max_depth=None,criterion='gini')#,random_state=1)
clf=clf.fit(pred_train,tar_train)
Now the program makes prediction on the test set:
predictions=clf.predict(pred_test)
At this point, the program seems to work.
Now it pass another data frame, with the unknown data, to the forest created above and I have the bad result described before. Can you help me? The problem could be an offset in randomforestclassifier, but I had no significative results modifying randomforestclassifier options. If you need, I can give further explanations. Thanks in advance.
Bye, Fabio
PS: I tried the cross validation too: I divided the train set into train and test again, with the same proportions (0.7 and 0.3), to create, train and test the forest before testing it on the initial test set, modifying randomforestclassifier options to obtain better results, but I had no improvements.
pred_test['flux_density'].plot()etc. - Kris