0
votes

I have a training set that looks like

Name       Day         Area         X    Y    Month Night
ATTACK    Monday   LA           -122.41 37.78   8      0
VEHICLE  Saturday  CHICAGO      -1.67    3.15   2      0
MOUSE     Monday   TAIPEI       -12.5    3.1    9      1

Name is the outcome/dependent variable.

Here is what my code looks like so far in case it helps

ynn <- model.matrix(~Name , data = trainDF)
mnn <- model.matrix(~ Day+Area +X + Y + Month + Night, data = trainDF)
yCat<-make.names(trainDF$Name, unique=FALSE, allow_=TRUE)

I then setup tuning the parameters

nnTrControl=trainControl(method = "repeatedcv",number = 3,repeats=5,verboseIter = TRUE, returnData = FALSE, returnResamp = "all", classProbs = TRUE, summaryFunction = multiClassSummary,allowParallel = TRUE)
nnGrid = expand.grid(.size=c(1,4,7),.decay=c(0,0.001,0.1))
model <- train(y=yCat, x=mnn, method='nnet',linout=TRUE, trace = FALSE, trControl = nnTrControl,metric="logLoss", tuneGrid=nnGrid)

When I ran this, it was still running over 20 hours later, so I had to stop it

I read in the link below that its possible to parallelize the resampling of Caret using registerDoMC: R caret nnet package in Multicore

However, that only seems to work for cores. My machine uses 2 cores and 2 threads on each core. Is there a way to get a speedup using the threads in addition to using the 2 cores and registerDoMC(2)?

I also see in this link below that the user had to setup seeds for each resample: Fully reproducible parallel models using caret Do I also have to do that for my code? Why was this not used in the former link? What about if I used xgboost instead of nnet?

1
If you want to reproduce your results you will have to set your seed on every thread that you spawn. Depending on which OS you are working each thread will most likely be scheduled on a separate core on your CPU. This depends on your OS job scheduler.Stereo

1 Answers

0
votes

If you want to reproduce your results you will have to set your seed on every thread that you spawn. This is required because every thread will have a different random number every time an instance is spawned. Depending on which OS you are working each thread will most likely be scheduled on a separate core on your CPU. This depends on your OS job scheduler. In regards to using xgboost versus nnet, I think that the most important aspect should be whether you are interested in the model properties. I think that if you are starting with machine learning xgboost may be a bit easier than nnet. If computational performance is your biggest concern you may try to run your problem on a smaller subset first.

One thing I would do first is run a MCA analysis, which can be find in FactoMineR. This will allow you to see the amount of variance in each of your variables. You could drop variables that have too little variance and thereby speed up the performance of your learning task.