I have a simple model trained on MNIST with 600 nodes in a hidden layer.
Some precursors...
from __future__ import print_function
import keras
from keras.datasets import mnist
from keras.models import Sequential, Model
from keras.layers import Dense, Dropout, InputLayer, Activation
from keras.optimizers import RMSprop, Adam
import numpy as np
import h5py
import matplotlib.pyplot as plt
from keras import backend as K
import tensorflow as tf
MNIST Loading
batch_size = 128
num_classes = 10
epochs = 50
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')
# One hot conversion
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
Designing model
model = Sequential()
###Model###
model.add(Dense(600, input_dim=784))
model.add(Activation('relu'))
model.add(Dense(10))
model.add(Activation('softmax'))
model.summary()
tfcall = keras.callbacks.TensorBoard(log_dir='./keras600logs', histogram_freq=1, batch_size=batch_size, write_graph=True)
model.compile(loss='categorical_crossentropy',optimizer=Adam(), metrics=['accuracy'])
history = model.fit(x_train, y_train,
batch_size=batch_size,
epochs=10, #EPOCHS
verbose=1,
validation_data=(x_test, y_test),
callbacks=[tfcall])
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
Now comes the new part. I want to dynamically (i.e. with every new input image) be able to define a 'mask' that will turn off some of the 600 neurons in the hidden layer, preventing them from passing their activation on to the output layer.
mask_i = [0, 0, 1, 0, 1, .... 0, 1, 0, 0] (1x600)
such that for an input image i the mask indices with a 1 corresponds to a node that is shut off while processing image i.
What is the best way to go about doing this?
Do we have another node from input with weights TOWARDS hidden layer of -100000000 so that it will overwhelm whatever the activation is normally there (and relu will do the rest). This is kind of like hacking the bias dynamically.
Do we create another hidden layer where each of the 600 nodes is directly connected to exactly one node from the first hidden layer (itself) with a dynamic weight of either 0 (off) or 1 (proceed as normal) and then fully connect that new hidden layer to output?
Both of these seem a bit hackish, wanted to know what others out there think.