0
votes

Guys!

I am a newbie in machine learning methods and have a question about it. I try to use Caret package in R to start this method and work with my dataset.

I have a training dataset (Dataset1) with mutation information regarding my gene of interest let's say Gene A.

In Dataset1, I have the information regarding the mutation of Gene A in the form of Mut or Not-Mut. I used the Dataset1 with SVM model to predict the output (I chose SVM because it was more accurate than LVQ or GBM). So, in my first step, I divided my dataset into training and test groups because I've had information as a test and train set in the dataset. then I've done the cross validation with 10 fold. I tuned my model and assessed the performance of the model using the test dataset (using ROC curve). Everything goes fine till this step.

I have another dataset. Dataset2 which doesn't have mutation information regarding Gene A. What I want to do now is to use my tuned SVM model from the Dataset1 on the Dataset2 to see if it could give me mutation information regarding Gene A in the Dataset 2 in a form of Mut/Not-Mut. I've gone through Caret package guide but I couldn't get it. I am stuck here and don't know what to do.

I am not sure if I chose a right approach.Any suggestions or help would really be appreciated.

Here is my code till I tuned my model from the first dataset.

Selecting training and test models from the first dataset:

M_train <- Dataset1[Dataset1$Case=='train',-1] #creating train feature data frame

M_test <- Dataset1[Dataset1$Case=='test',-1] #creating test feature data frame

y=as.factor(M_train$Class) # Target variable for training


ctrl <- trainControl(method="repeatedcv", # 10fold cross validation
                     repeats=5, # do 5 repititions of cv
                     summaryFunction=twoClassSummary, # Use AUC to pick the best model
                     classProbs=TRUE)


#Use the expand.grid to specify the search space 
#Note that the default search grid selects 3 values of each tuning parameter

grid <- expand.grid(interaction.depth = seq(1,4,by=2), #tree depths from 1 to 4
                    n.trees=seq(10,100,by=10), # let iterations go from 10 to 100
                    shrinkage=c(0.01,0.1), # Try 2 values fornlearning rate 
                    n.minobsinnode = 20)


# Set up for parallel processing
#set.seed(1951)
registerDoParallel(4,cores=2)


#Train and Tune the SVM
svm.tune <- train(x=M_train,
                  y= M_train$Class,
                  method = "svmRadial",
                  tuneLength = 9, # 9 values of the cost function
                  preProc = c("center","scale"),
                  metric="ROC",
                  trControl=ctrl) # same as for gbm above

#Finally, assess the performance of the model using the test data set.

#Make predictions on the test data with the SVM Model
svm.pred <- predict(svm.tune,M_test)

confusionMatrix(svm.pred,M_test$Class)

svm.probs <- predict(svm.tune,M_test,type="prob") # Gen probs for ROC

svm.ROC <- roc(predictor=svm.probs$mut,
               response=as.factor(M_test$Class),
               levels=y))

plot(svm.ROC,main="ROC for SVM built with GA selected features")

So, here is where I stuck, how can I use svm.tune model to predict the mutation of Gene A in Dataset2?

Thanks in advance,

1

1 Answers

1
votes

Now you just take the model you built and tuned and predict off of it using predict :

D2.predictions <- predict(svm.tune, newdata = Dataset2)

They keys are to be sure that you have ALL off the same predictor variables in this set, with the same column names (and in my paranoid world in the same order).

D2.predictions will contain your predicted classes for the unlabeled data.