0
votes

I wrote a program in python to use a machine learning algorithm to make predictions on data. I use the function RandomForestClassifier from Scikit Learn to create a random forest to make predictions.

The purpose of the program is to predict if an unknown astrophysical source is a pulsar or an agn; so it trains the forest on known data of which it knows if sources are pulsar or agn, then it makes predictions on unknown data, but it doesn’t work. The program predict that unknown data are all pulsar or all agn and it rarely predicts a different result, but not correct.

Below I describe the passages of my program.

It creates a data frame with data for all the sources: all_df It is made of ten columns, nine used as predictors and one as target:

predictors=all_df[['spec_index','variab_index','flux_density','unc_ene_flux100','sign_curve','h_ratio_12','h_ratio_23','h_ratio_34','h_ratio_45']]
targets=all_df['type']

type column contains the label “pulsar” or “agn” for each source.

The values of predictors and targets are used successively in the program to train the forest.

The program divides the predictors and the targets in two sets, the train, which is the 70% of the total, and the test, which is the 30% of the total of all_df, using the function train_test_split from Scikit Learn:

pred_train, pred_test, tar_train, tar_test=train_test_split(predictors, targets, test_size=0.3)

Data in these sets are mixed, so the program orders the indexes of these sets, without changing data position:

pred_train=pred_train.reset_index(drop=True)
pred_test=pred_test.reset_index(drop=True)
tar_train=tar_train.reset_index(drop=True)
tar_test=tar_test.reset_index(drop=True)

After that, the program creates and trains the random forest:

clf=RandomForestClassifier(n_estimators=1000,oob_score=True,max_features=None,max_depth=None,criterion='gini')#,random_state=1)
clf=clf.fit(pred_train,tar_train)

Now the program makes prediction on the test set:

predictions=clf.predict(pred_test)    

At this point, the program seems to work.

Now it pass another data frame, with the unknown data, to the forest created above and I have the bad result described before. Can you help me? The problem could be an offset in randomforestclassifier, but I had no significative results modifying randomforestclassifier options. If you need, I can give further explanations. Thanks in advance.

Bye, Fabio

PS: I tried the cross validation too: I divided the train set into train and test again, with the same proportions (0.7 and 0.3), to create, train and test the forest before testing it on the initial test set, modifying randomforestclassifier options to obtain better results, but I had no improvements.

1
Could it be that the distribution of the predictors in your "test" data set aren't the same as those of your "unknown" data set. I would suggest to do some exploratory analysis of these distributions before trying to fix your prediction model (which might not be the culprit here). - Kris
Just make some plots like pred_test['flux_density'].plot() etc. - Kris
as @kris suggested do analysis on response distribution and if its not correct (distribution in training is way different than test) then you can do some stratified sampling. - abhiieor
setting both max_depth and max_features to None causes your model to overfit heavily. Why you set these particular values? Even default ones should be better than this. - lejlot

1 Answers

0
votes

Thanks for answering, guys.

As suggested, I did plots of the predictors in the “test” data and in the “unknown” data; the distributions are generally similar, but I prefer to make histograms to say it. So I tried to do histograms, but I couldn’t both for the test and the unknown data, using:

pylab.hist(unid_df.spec_index,bins=30)

I obtained: TypeError: len() of unsized object

I haven’t found a solution yet and I don’t know if this error can negatively act on the predictions.

Additional information: the ranges of the various predictors are of different order of magnitude. The ranges are the same for corresponding predictors of test and unknown data, but in few cases test data ranges have larger order of magnitude from the corresponding predictor of the unknown data. This is due to some points that have values much bigger than the most of the other points in the set.

Thanks again. Bye, Fabio