I have a training dataset and testing dataset each with approximately 1300 and 400 samples, respectively. I run a grid search, which creates x number of deep networks (using softmax for the output, RELU for the hidden layers, and gradient descent) with varying numbers of hidden nodes in a prespecified number of hidden layers. For example, if I say check all single layer models, the grid search will, for this example, create 100 deep networks with 1, 2, 3...100 hidden nodes in the single layer. For every model and for every epoch, the grid search will train the model and test it by feeding the model random batches of the training/testing data using a prespecified batch size. The program then spits out an AUC value after each epoch of training for all 100 models. Thus, we get 100 output files with all the AUC values after every epoch of training. I can then go through these files with a parser to see what the optimal model is and what the optimal number of epochs is.
However, when I run my grid search, I noticed that the best models in the first run are not the same as those in subsequent runs. I attribute this to the random batches fed into the model for training and testing but then how can I actually find the "optimal model"?