i'm learning Deep Learning concepts with python and i've come so far with my project. This open project's purpose is to detect Liver cancer so patient avoid biopsy and can be healed sooner than usual.
I've a dataset of 427 patients on which genetics markers (2687 columns) methylation rate has been determined from 0 to 1 (0 = not methylated, 1 = fully methylated).
I used xgboost and I got a node-graph with renamed features by xgboost (So my first problem is I don't know what markers really are represented by these xgboost graph's labels (apparently with 3 tests (6 "yes" or "no" decision tree, fig. a), xgboost can determine if a patient have a liver cancer or not)
So, considering i'm not enough experimented and not english-native I'd like to get some of your advices to finalize it with wishes uppon my skill :
2 : Is there a simple way to make these "label" xgboost chosed become the real markers's name so i can test all my model with only these 3 ? unless i didn't understood well what was this graph ?
3 : I got this feature importance graph (fig. b) and again, i'd like to find the way to make model only with "important" markers (features), So instead having 2680+ columns (markers) for each patients, I have waaaaaay less features needed for the same accuracy. (my model is actually 99.5 accurate)
fig.a Nodes decisions tree by xgboost
fig.b (feature importance) link because you need power zoom : https://cdn.discordapp.com/attachments/314114979332882432/579000210760531980/features_importances.png
I have my whole notebook but i don't know how to show you interesting code parts (because you have to import dataset etc..) even the code that worked one day ago to get the shape of feature importance (that might return a simple 2687) isn't working anymore for me, such as "'Booster' object has no 'feature_importances_' " when I executes cells.. I don't know why ...
for indication When I do
cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=100, num_boost_round=100, early_stopping_rounds=10, metrics="error", as_pandas=True, seed=123)
cv_results
I have 0.0346 for train error mean, 0.00937 for train error std and 0.135 for test error std
At the moment I don't really have error, I just don't know how to take those xgboost label to translate and get in the concerned feature, xgboost return nodes named such as fl1754 or f93 etc and my features in data set are like "cg000001052" (it's CpG markers (fig c))
fig c . dataset format how CpG marker names (col) are displayed in dataset
Then I'll do another model with only these (considered)-importants features to see if it's still insanely accurate with thousands less markers
If you really need some parts, I'll be able to provide them to you, at the moment I'm just lost with my searches I don't find the type of answer I want whoever the basic idea I have is simple
As a newbie, I'd say I've noticed the f93 in the nodes graph is the second most important in the feature selection (I displayed by Desc order ! an exploit for me, hardest part of the project to be honest)
Now I feel close the end, the purpose was to reduce marker amount needed and i feel really close with such results :( then i'm lost
Any help is so welcome !