How to calculate dense layer output parameter value

Question

I am new to keras and have read blog posts about deep learning classification using keras but, even after reading a lot of them, I am unable to figure out how each of them have calculated the parameter value of first dense layer just after flatten layer in their code. for example:

from keras.models import Sequential
from keras.layers import Dense, Conv2D, MaxPooling2D, Dropout, Flatten

def createModel():
model = Sequential()
model.add(Conv2D(32, (3, 3), padding='same', activation='relu',input_shape=input_shape))
model.add(Conv2D(32, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Conv2D(64, (3, 3), padding='same', activation='relu'))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Conv2D(64, (3, 3), padding='same', activation='relu'))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Flatten())

model.add(Dense(512, activation='relu'))

model.add(Dropout(0.5))
model.add(Dense(nClasses, activation='softmax'))

My doubts:

How did the programmer decide on the value '512' for this dense layer?
Is it totally random? because I know that in this example, flatten has 256 parameters so logic says they are multiplying it by 2 to get the value of 512. But, this logic does not follow in any other case I have read.
How did this dense layer affect the training?

If I put too large a value, like in my code below, going by the logic I multiplied my flatten parameter 86400 by 2 i.e. 172800, I get the following error:

model = Sequential()
model.add(Conv2D(32, (3, 3),input_shape=input_shape))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))


model.add(Conv2D(64, (3, 3) ))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Conv2D(96, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Flatten())


> model.add(Dense(172800))

model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(4))
model.add(Activation('softmax'))

model.summary()

ValueError: rng_mrg cpu-implementation does not support more than (2**31 -1) samples

HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'. HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.

This is my summary of the model without first dense layer

Layer (type)             Output Shape                 Param #   
=================================================================
conv2d_4 (Conv2D)      (None, 254, 254, 32)            896       
_________________________________________________________________
activation_4 (Activation)    (None, 254, 254, 32)      0         
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 127, 127, 32)      0         
_________________________________________________________________
dropout_4 (Dropout)          (None, 127, 127, 32)      0         
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 125, 125, 64)      18496     
_________________________________________________________________
activation_5 (Activation)    (None, 125, 125, 64)      0         
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 62, 62, 64)        0         
_________________________________________________________________
dropout_5 (Dropout)          (None, 62, 62, 64)        0         
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 60, 60, 96)        55392     
_________________________________________________________________
activation_6 (Activation)    (None, 60, 60, 96)        0         
_________________________________________________________________
max_pooling2d_6 (MaxPooling2 (None, 30, 30, 96)        0         
_________________________________________________________________
dropout_6 (Dropout)          (None, 30, 30, 96)        0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 86400)             0         
_________________________________________________________________
activation_7 (Activation)    (None, 86400)             0         
_________________________________________________________________
dropout_7 (Dropout)          (None, 86400)             0         
_________________________________________________________________
dense_2 (Dense)              (None, 4)                 345604    
_________________________________________________________________
activation_8 (Activation)    (None, 4)                 0         

Total params: 420,388
Trainable params: 420,388
Non-trainable params: 0

When I eliminate this layer altogether, my code works or even if I put smaller value, my code still works but, I don't want to blindly set parameters without knowing the reason.

pietz pietz · Accepted Answer · 2018-02-06T12:59:08

Many design decisions in Deep Learning come down to pragmatic rules that seem to work fairly well after trying different options.

The size of the second to last Dense layer is one of those examples. By giving a network more depth (more layers) and/or making it wider (more channels), we increase the theoretical learning capacity of the model. However, simply giving a network 10000 Dense layers with 172800 channels will likely not improve performance or even work at all.

In theory, 512 is completely arbitrary. In practice, it's inside the scope of sizes I have seen in other architectures. I understand your decision to connect the number of input units to the number of output units with a rate of 2. While it's entirely possible that this is the greatest idea in Deep Learning anyone has ever come up with, I commonly see examples where the size of the second to last dense layer is connected to the number of output classes in the final layer.

AlexNet uses 2000 units that feed into the final output of 1000 classes
VGG16 uses 4000 units that feed into the final output of 1000 classes

So as a rule of thumb, you could try to play with these rates of 2x to 4x and see where it gets you. The layer you tried to create would have had 15 billion parameters. That alone is roughly 100 times larger than the biggest architectures I have seen.

At this point I would like to stop guessing further recommendations, because it is dependent on so many factors.

How to calculate dense layer output parameter value

1 Answers