Guys!
I am a newbie in machine learning methods and have a question about it. I try to use Caret package in R to start this method and work with my dataset.
I have a training dataset (Dataset1) with mutation information regarding my gene of interest let's say Gene A.
In Dataset1, I have the information regarding the mutation of Gene A in the form of Mut or Not-Mut. I used the Dataset1 with SVM model to predict the output (I chose SVM because it was more accurate than LVQ or GBM). So, in my first step, I divided my dataset into training and test groups because I've had information as a test and train set in the dataset. then I've done the cross validation with 10 fold. I tuned my model and assessed the performance of the model using the test dataset (using ROC curve). Everything goes fine till this step.
I have another dataset. Dataset2 which doesn't have mutation information regarding Gene A. What I want to do now is to use my tuned SVM model from the Dataset1 on the Dataset2 to see if it could give me mutation information regarding Gene A in the Dataset 2 in a form of Mut/Not-Mut. I've gone through Caret package guide but I couldn't get it. I am stuck here and don't know what to do.
I am not sure if I chose a right approach.Any suggestions or help would really be appreciated.
Here is my code till I tuned my model from the first dataset.
Selecting training and test models from the first dataset:
M_train <- Dataset1[Dataset1$Case=='train',-1] #creating train feature data frame
M_test <- Dataset1[Dataset1$Case=='test',-1] #creating test feature data frame
y=as.factor(M_train$Class) # Target variable for training
ctrl <- trainControl(method="repeatedcv", # 10fold cross validation
repeats=5, # do 5 repititions of cv
summaryFunction=twoClassSummary, # Use AUC to pick the best model
classProbs=TRUE)
#Use the expand.grid to specify the search space
#Note that the default search grid selects 3 values of each tuning parameter
grid <- expand.grid(interaction.depth = seq(1,4,by=2), #tree depths from 1 to 4
n.trees=seq(10,100,by=10), # let iterations go from 10 to 100
shrinkage=c(0.01,0.1), # Try 2 values fornlearning rate
n.minobsinnode = 20)
# Set up for parallel processing
#set.seed(1951)
registerDoParallel(4,cores=2)
#Train and Tune the SVM
svm.tune <- train(x=M_train,
y= M_train$Class,
method = "svmRadial",
tuneLength = 9, # 9 values of the cost function
preProc = c("center","scale"),
metric="ROC",
trControl=ctrl) # same as for gbm above
#Finally, assess the performance of the model using the test data set.
#Make predictions on the test data with the SVM Model
svm.pred <- predict(svm.tune,M_test)
confusionMatrix(svm.pred,M_test$Class)
svm.probs <- predict(svm.tune,M_test,type="prob") # Gen probs for ROC
svm.ROC <- roc(predictor=svm.probs$mut,
response=as.factor(M_test$Class),
levels=y))
plot(svm.ROC,main="ROC for SVM built with GA selected features")
So, here is where I stuck, how can I use svm.tune model to predict the mutation of Gene A in Dataset2?
Thanks in advance,