I am trying to plot the different learning outcome when using Batch gradient descent, Stochastic gradient descent and mini-batch stochastic gradient descent.
Everywhere i look, i read that a batch_size=1 is the same as having a plain SGD and a batch_size=len(train_data) is the same as having the Batch gradient descent.
I know that stochastic gradient descent is when you use only one single data sample for every update and batch gradient descent uses the entire training data set to compute the gradient of the objective function / update.
However, when implementing the batch_size using keras, it seems to be the opposite that is happening. Take my code for example, where I have set the batch_size equal to the length of my training_data
input_size = len(train_dataset.keys())
output_size = 10
hidden_layer_size = 250
n_epochs = 250
weights_initializer = keras.initializers.GlorotUniform()
#A function that trains and validates the model and returns the MSE
def train_val_model(run_dir, hparams):
model = keras.models.Sequential([
#Layer to be used as an entry point into a Network
keras.layers.InputLayer(input_shape=[len(train_dataset.keys())]),
#Dense layer 1
keras.layers.Dense(hidden_layer_size, activation='relu',
kernel_initializer = weights_initializer,
name='Layer_1'),
#Dense layer 2
keras.layers.Dense(hidden_layer_size, activation='relu',
kernel_initializer = weights_initializer,
name='Layer_2'),
#activation function is linear since we are doing regression
keras.layers.Dense(output_size, activation='linear', name='Output_layer')
])
#Use the stochastic gradient descent optimizer but change batch_size to get BSG, SGD or MiniSGD
optimizer = tf.keras.optimizers.SGD(learning_rate=0.001, momentum=0.0,
nesterov=False)
#Compiling the model
model.compile(optimizer=optimizer,
loss='mean_squared_error', #Computes the mean of squares of errors between labels and predictions
metrics=['mean_squared_error']) #Computes the mean squared error between y_true and y_pred
# initialize TimeStopping callback
time_stopping_callback = tfa.callbacks.TimeStopping(seconds=5*60, verbose=1)
#Training the network
history = model.fit(normed_train_data, train_labels,
epochs=n_epochs,
batch_size=hparams['batch_size'],
verbose=1,
#validation_split=0.2,
callbacks=[tf.keras.callbacks.TensorBoard(run_dir + "/Keras"), time_stopping_callback])
return history
train_val_model("logs/sample", {'batch_size': len(normed_train_data)})
When running this, the output seems to show a single update for each epoch i.e. SGD :
As can be seen underneath every epoch it says 1/1 which I assume means a single update iteration. If I on the other hand set the batch_size=1 I get 90000/90000 which is the size of my entire data-set (training time wise this also makes sense).
So, my question is, batch_size=1 is actually Batch gradient descent and not stochastic gradient descent and batch_size=len(train_data) is actually stochastic gradient descent and not batch gradient descent?
loss='mean_squared_error'
, you don't need to re-include it inmetrics
– desertnaut