1
votes

I am trying to plot the different learning outcome when using Batch gradient descent, Stochastic gradient descent and mini-batch stochastic gradient descent.

Everywhere i look, i read that a batch_size=1 is the same as having a plain SGD and a batch_size=len(train_data) is the same as having the Batch gradient descent.

I know that stochastic gradient descent is when you use only one single data sample for every update and batch gradient descent uses the entire training data set to compute the gradient of the objective function / update.

However, when implementing the batch_size using keras, it seems to be the opposite that is happening. Take my code for example, where I have set the batch_size equal to the length of my training_data

input_size = len(train_dataset.keys())
output_size = 10
hidden_layer_size = 250
n_epochs = 250

weights_initializer = keras.initializers.GlorotUniform()

#A function that trains and validates the model and returns the MSE
def train_val_model(run_dir, hparams):
    model = keras.models.Sequential([
            #Layer to be used as an entry point into a Network
            keras.layers.InputLayer(input_shape=[len(train_dataset.keys())]),
            #Dense layer 1
            keras.layers.Dense(hidden_layer_size, activation='relu', 
                               kernel_initializer = weights_initializer,
                               name='Layer_1'),
            #Dense layer 2
            keras.layers.Dense(hidden_layer_size, activation='relu', 
                               kernel_initializer = weights_initializer,
                               name='Layer_2'),
            #activation function is linear since we are doing regression
            keras.layers.Dense(output_size, activation='linear', name='Output_layer')
                                ])
    
    #Use the stochastic gradient descent optimizer but change batch_size to get BSG, SGD or MiniSGD
    optimizer = tf.keras.optimizers.SGD(learning_rate=0.001, momentum=0.0,
                                        nesterov=False)
    
    #Compiling the model
    model.compile(optimizer=optimizer, 
                  loss='mean_squared_error', #Computes the mean of squares of errors between labels and predictions
                  metrics=['mean_squared_error']) #Computes the mean squared error between y_true and y_pred
    
    # initialize TimeStopping callback 
    time_stopping_callback = tfa.callbacks.TimeStopping(seconds=5*60, verbose=1)
    
    #Training the network
    history = model.fit(normed_train_data, train_labels, 
         epochs=n_epochs,
         batch_size=hparams['batch_size'], 
         verbose=1,
         #validation_split=0.2,
         callbacks=[tf.keras.callbacks.TensorBoard(run_dir + "/Keras"), time_stopping_callback])
    
    return history

train_val_model("logs/sample", {'batch_size': len(normed_train_data)})

When running this, the output seems to show a single update for each epoch i.e. SGD Batch_size=len(input_data):

As can be seen underneath every epoch it says 1/1 which I assume means a single update iteration. If I on the other hand set the batch_size=1 I get 90000/90000 which is the size of my entire data-set (training time wise this also makes sense).

So, my question is, batch_size=1 is actually Batch gradient descent and not stochastic gradient descent and batch_size=len(train_data) is actually stochastic gradient descent and not batch gradient descent?

2
BTW, since you use loss='mean_squared_error', you don't need to re-include it in metricsdesertnaut

2 Answers

1
votes

There are actually three (3) cases:

  • batch_size = 1 means indeed stochastic gradient descent (SGD)
  • A batch_size equal to the whole of the training data is (batch) gradient descent (GD)
  • Intermediate cases (which are actually used in practice) are usually referred to as mini-batch gradient descent

See A Gentle Introduction to Mini-Batch Gradient Descent and How to Configure Batch Size for more details and references. Truth is, in practice, when we say "SGD" we usually mean "mini-batch SGD".

These definitions are in fact fully compliant with what you report from your experiments:

  • With batch_size=len(train_data) (GD case), only one update is indeed expected per epoch (since there is only one batch), hence the 1/1 indication in Keras output.

  • In contrast, with batch_size = 1 (SGD case), you expect as many updates as samples in your training data (since this is now the number of your batches), i.e. 90000, hence the 90000/90000 indication in Keras output.

i.e. the number of updates per epoch (which Keras indicates) is equal to the number of batches used (and not to the batch size).

0
votes

batch_size is the size of how large each update will be.

Here, batch_size=1 means the size of each update is 1 sample. By your definitions, this would be SGD.

If you have batch_size=len(train_data), that means that each update to your weights will require the resulting gradient from your entire dataset. This is actually just good old gradient descent.

Batch gradient descent is somewhere in the middle, where the batch_size isn't 1 and the batch size isn't your entire training dataset. Take 32 for example. Batch gradient descent would update your weights every 32 examples, so it smooths out the ruggedness of SGD with just 1 example (where outliers may have a lot of impact) and yet has the benefits that SGD has over regular gradient descne.t