Rebuilding and training new Deep Learning Python model after feature importances and feature selection to reduce feature amount?

Question

i'm learning Deep Learning concepts with python and i've come so far with my project. This open project's purpose is to detect Liver cancer so patient avoid biopsy and can be healed sooner than usual.

I've a dataset of 427 patients on which genetics markers (2687 columns) methylation rate has been determined from 0 to 1 (0 = not methylated, 1 = fully methylated).

I used xgboost and I got a node-graph with renamed features by xgboost (So my first problem is I don't know what markers really are represented by these xgboost graph's labels (apparently with 3 tests (6 "yes" or "no" decision tree, fig. a), xgboost can determine if a patient have a liver cancer or not)

So, considering i'm not enough experimented and not english-native I'd like to get some of your advices to finalize it with wishes uppon my skill :

2 : Is there a simple way to make these "label" xgboost chosed become the real markers's name so i can test all my model with only these 3 ? unless i didn't understood well what was this graph ?

3 : I got this feature importance graph (fig. b) and again, i'd like to find the way to make model only with "important" markers (features), So instead having 2680+ columns (markers) for each patients, I have waaaaaay less features needed for the same accuracy. (my model is actually 99.5 accurate)

fig.a Nodes decisions tree by xgboost

fig.b (feature importance) link because you need power zoom : https://cdn.discordapp.com/attachments/314114979332882432/579000210760531980/features_importances.png

I have my whole notebook but i don't know how to show you interesting code parts (because you have to import dataset etc..) even the code that worked one day ago to get the shape of feature importance (that might return a simple 2687) isn't working anymore for me, such as "'Booster' object has no 'feature_importances_' " when I executes cells.. I don't know why ...

for indication When I do

cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=100, num_boost_round=100, early_stopping_rounds=10, metrics="error", as_pandas=True, seed=123)
cv_results

I have 0.0346 for train error mean, 0.00937 for train error std and 0.135 for test error std

At the moment I don't really have error, I just don't know how to take those xgboost label to translate and get in the concerned feature, xgboost return nodes named such as fl1754 or f93 etc and my features in data set are like "cg000001052" (it's CpG markers (fig c))

fig c . dataset format how CpG marker names (col) are displayed in dataset

Then I'll do another model with only these (considered)-importants features to see if it's still insanely accurate with thousands less markers

If you really need some parts, I'll be able to provide them to you, at the moment I'm just lost with my searches I don't find the type of answer I want whoever the basic idea I have is simple

As a newbie, I'd say I've noticed the f93 in the nodes graph is the second most important in the feature selection (I displayed by Desc order ! an exploit for me, hardest part of the project to be honest)

Now I feel close the end, the purpose was to reduce marker amount needed and i feel really close with such results :( then i'm lost

Any help is so welcome !

Ok so after some research I might found something that can answer this possibly. It would be the package "feature selection" of scikit learn But I don't have certitude if it will reduce drasticaly the amunt of CpG Needed. Also, if I do this, I would never have the answer about how to exploit the feature directly by their importances (where I was stuck in my work) to retrain a new model based on these famous "f93" and other few features .. What are your thoughts ? — Dinger

Dinger Dinger · Accepted Answer · 2019-05-20T14:59:05

Ok so basically I've tried to reset the dataset with only selected markers

Happen it doesn't work and I miss some technic until I have weird error trying to setting up new model like that.

I end up deciding this is the end of the project,

Solution is : there is no reachable solution

so conclusion : it detects cancer with 99.5% accuracy but need 2683 CpG markers, too bad I can't redim it too 500 as I almost succeed to do. that's too bad it was 99% complete

Thank you anyway for your precious knowledge and help

regards

Rebuilding and training new Deep Learning Python model after feature importances and feature selection to reduce feature amount?

1 Answers