1
votes

I have a question about machine learning and decision tree. I work in computational biology (long RNA secondary structure prediction).

I have a program which predicts the accuracy of a predicted RNA secondary structure. The input argument to the program are

  • stem length (L) - values from 3,4,5,6,7 and 8
  • gap size (G) - values from from 0,1,2,3,4,5,6,7,and 8
  • chunk length (c) - values from from 60,70,80,90,100,120,130,140, and 150

I want to know, for a given RNA sequence of length (S), which L,G,C combination gives a maximum accuracy.

I have a training data set of 50 sequence files with sequence lengths S and for each these sequence files, the L,G,C input parameter combinations which gives maximum accuracy output are already known.

Is there a way that we can know which specific L, G, and C parameters to use in order to find maximum accuracy with out all the L,G, and C range values?

2

2 Answers

1
votes

Your problem statement is not very clear.

You want a supervised learning algorithm that learns from your 50 training examples and creates a predictor program that takes as input a "sequence file" and produces as output values of L, G and C for that sequence file.

Is that correct?

There are many choices for supervised learning algorithms. What exactly is the data in the sequence file? Is it a vector of real numbers? What structure does it have? If you had to determine L, G and C "by hand" for a sequence file could you do it? How would you do it?

1
votes

I think that you'd like to find the one "biggest influence factor" among the three parameters(L,G,C), which is not a very common requirement, but I think some ideas may help you.

  1. You may could build several different decision trees which use three parameters(one tree), two parameters(three trees because every time you pick two from three parameters) and only one parameter(of course, three trees). So you could have seven decision trees, then you could compare the accuracy output with them, so you could know the tcontribution of different parameters to the final prediction.

  2. You could calculate the relative matrix of these three parameters and the final prediction, then you know them weight of contribution to final result.

PS.You requirement(choose only one factor to make the machine learning )genrenally not a good a idea, because the contribution of different factors th prediction is normally not linear, that is to know, combining more factors in the imput vector in machine learning will normally bring you better result.

Good luck!