4
votes

I am experimenting with MNIST dataset to learn Keras library. In MNIST, there are 60k training images and 10k validation images.

In both sets, I'd like to introduce augmentation on 30% of the images.

datagen = ImageDataGenerator(horizontal_flip=True, vertical_flip=True)
datagen.fit(training_images)
datagen.fit(validation_images)

This does not augment images and I am not sure how to use model.fit_generator method. My current model.fit is as following:

model.fit(training_images, training_labels, validation_data=(validation_images, validation_labels), epochs=10, batch_size=200, verbose=2)

How do I apply augmentation on some of the images in this dataset?

1
Do you want to apply augmentation to both of data sets or only on 30% of them?Marcin Możejko
On both (training, validation), but don't want to apply it on all of them. Want to keep ~70% of the dataset as it is.Grimlock
Do you want a different 30% of images being augmented for each epoch, or a fixed 30% of images?Yu-Yang
30% is just to give a hint that I want a small part of the dataset modified. I am okay with "fixed 30% of images", but "different 30% of images being augmented for each epoch" sounds like a better idea!Grimlock

1 Answers

4
votes

I'd try to define my own generator in the following manner:

from sklearn.model_selection import train_test_split
from six import next

def partial_flow(array, flags, generator, aug_percentage, batch_size):
    # Splitting data into arrays which will be augmented and which won't
    not_aug_array, not_aug_flags, aug_array, aug_flags = train_test_split(
        array,
        test_size=aug_percentage)
    # Preparation of generators which will be used for augmentation.
    aug_split_size = int(batch_size * split_size)
    # We will use generator without any augmentation to yield not augmented data
    not_augmented_gen = ImageDataGenerator()
    aug_gen = generator.flow(
        x=aug_array,
        y=aug_flags,
        batch_size=aug_split_size)
    not_aug_gen = not_augmented_gen.flow(
        x=not_aug_array,
        y=not_aug_flags,
        batch_size=batch_size - aug_split_size)
    # Yiedling data
    while True:
        # Getting augmented data
        aug_x, aug_y = next(aug_gen)
        # Getting not augmented data
        not_aug_x, not_aug_y = next(not_aug_gen)
        # Concatenation
        current_x = numpy.concatenate([aug_x, not_aug_x], axis=0)
        current_y = numpy.concatenate([aug_y, not_aug_y], axis=0)
        yield current_x, current_y

Now you could run training by:

 batch_size = 200
 model.fit_generator(partial_flow(training_images, training_labels, 0.7, batch_size),
                     steps_per_epoch=int(training_images.shape[0] / batch_size),
                     epochs=10,
                     validation_data=partial_flow(validation_images, validation_labels, 0.7, batch_size),
                     validation_steps=int(validation_images.shape[0] / batch_size))