Keras - GRU layer with recurrent dropout - loss: 'nan', accuracy: 0

Question

Problem description

I am going through "Deep Learning in Python" by François Chollet (publisher webpage, notebooks on github). Replicating examples from Chapter 6 I encountered problems with (I believe) GRU layer with recurrent dropout.

The code in which I had first observed those errors is quite long, so I decided to stick to the simplest problem, which could replicate the error: classifying IMDB reviews into "positive" and "negative" categories.

When I use a GRU layer with recurrent dropout training loss (after couple of batches of first epoch) takes "value" of nan, while training accuracy (from the start of second epoch) takes the value of 0.

   64/12000 [..............................] - ETA: 3:05 - loss: 0.6930 - accuracy: 0.4844
  128/12000 [..............................] - ETA: 2:09 - loss: 0.6926 - accuracy: 0.4766
  192/12000 [..............................] - ETA: 1:50 - loss: 0.6910 - accuracy: 0.5573
(...) 
 3136/12000 [======>.......................] - ETA: 59s - loss: 0.6870 - accuracy: 0.5635
 3200/12000 [=======>......................] - ETA: 58s - loss: 0.6862 - accuracy: 0.5650
 3264/12000 [=======>......................] - ETA: 58s - loss: 0.6860 - accuracy: 0.5650
 3328/12000 [=======>......................] - ETA: 57s - loss: nan - accuracy: 0.5667   
 3392/12000 [=======>......................] - ETA: 57s - loss: nan - accuracy: 0.5560
 3456/12000 [=======>......................] - ETA: 56s - loss: nan - accuracy: 0.5457
(...)
11840/12000 [============================>.] - ETA: 1s - loss: nan - accuracy: 0.1593
11904/12000 [============================>.] - ETA: 0s - loss: nan - accuracy: 0.1584
11968/12000 [============================>.] - ETA: 0s - loss: nan - accuracy: 0.1576
12000/12000 [==============================] - 83s 7ms/step - loss: nan - accuracy: 0.1572 - val_loss: nan - val_accuracy: 0.0000e+00
Epoch 2/20

   64/12000 [..............................] - ETA: 1:16 - loss: nan - accuracy: 0.0000e+00
  128/12000 [..............................] - ETA: 1:15 - loss: nan - accuracy: 0.0000e+00
  192/12000 [..............................] - ETA: 1:16 - loss: nan - accuracy: 0.0000e+00
(...)
11840/12000 [============================>.] - ETA: 1s - loss: nan - accuracy: 0.0000e+00
11904/12000 [============================>.] - ETA: 0s - loss: nan - accuracy: 0.0000e+00
11968/12000 [============================>.] - ETA: 0s - loss: nan - accuracy: 0.0000e+00
12000/12000 [==============================] - 82s 7ms/step - loss: nan - accuracy: 0.0000e+00 - val_loss: nan - val_accuracy: 0.0000e+00
Epoch 3/20

   64/12000 [..............................] - ETA: 1:18 - loss: nan - accuracy: 0.0000e+00
  128/12000 [..............................] - ETA: 1:18 - loss: nan - accuracy: 0.0000e+00
  192/12000 [..............................] - ETA: 1:16 - loss: nan - accuracy: 0.0000e+00
(...)

Localizing the problem

To find out the solution I wrote the code presented below, which goes through several models (GRU/LSTM, {no dropout, only "normal" dropout, only recurrent dropout, "normal" and recurrent dropout, rmsprop/adam}) and presents loss and accuracy of all those models. (It also creates smaller, separate graphs for each model.)

# Based on examples from "Deep Learning with Python" by François Chollet:
## Constants, modules:
VERSION = 2

import os
from keras import models
from keras import layers
import matplotlib.pyplot as plt
import pylab

## Loading data:
from keras.datasets import imdb
(x_train, y_train), (x_test, y_test) = \
    imdb.load_data(num_words=10000)

from keras.preprocessing import sequence
x_train = sequence.pad_sequences(x_train, maxlen=500)
x_test = sequence.pad_sequences(x_test, maxlen=500)


## Dictionary with models' hyperparameters:
MODELS = [
    # GRU:
    {"no": 1,
     "layer_type": "GRU",
     "optimizer": "rmsprop",
     "dropout": None,
     "recurrent_dropout": None},

    {"no": 2,
     "layer_type": "GRU",
     "optimizer": "rmsprop",
     "dropout": 0.3,
     "recurrent_dropout": None},

    {"no": 3,
     "layer_type": "GRU",
     "optimizer": "rmsprop",
     "dropout": None,
     "recurrent_dropout": 0.3},

    {"no": 4,
     "layer_type": "GRU",
     "optimizer": "rmsprop",
     "dropout": 0.3,
     "recurrent_dropout": 0.3},

    {"no": 5,
     "layer_type": "GRU",
     "optimizer": "adam",
     "dropout": None,
     "recurrent_dropout": None},

    {"no": 6,
     "layer_type": "GRU",
     "optimizer": "adam",
     "dropout": 0.3,
     "recurrent_dropout": None},

    {"no": 7,
     "layer_type": "GRU",
     "optimizer": "adam",
     "dropout": None,
     "recurrent_dropout": 0.3},

    {"no": 8,
     "layer_type": "GRU",
     "optimizer": "adam",
     "dropout": 0.3,
     "recurrent_dropout": 0.3},

    # LSTM:
    {"no": 9,
     "layer_type": "LSTM",
     "optimizer": "rmsprop",
     "dropout": None,
     "recurrent_dropout": None},

    {"no": 10,
     "layer_type": "LSTM",
     "optimizer": "rmsprop",
     "dropout": 0.3,
     "recurrent_dropout": None},

    {"no": 11,
     "layer_type": "LSTM",
     "optimizer": "rmsprop",
     "dropout": None,
     "recurrent_dropout": 0.3},

    {"no": 12,
     "layer_type": "LSTM",
     "optimizer": "rmsprop",
     "dropout": 0.3,
     "recurrent_dropout": 0.3},

    {"no": 13,
     "layer_type": "LSTM",
     "optimizer": "adam",
     "dropout": None,
     "recurrent_dropout": None},

    {"no": 14,
     "layer_type": "LSTM",
     "optimizer": "adam",
     "dropout": 0.3,
     "recurrent_dropout": None},

    {"no": 15,
     "layer_type": "LSTM",
     "optimizer": "adam",
     "dropout": None,
     "recurrent_dropout": 0.3},

    {"no": 16,
     "layer_type": "LSTM",
     "optimizer": "adam",
     "dropout": 0.3,
     "recurrent_dropout": 0.3},
]

## Adding name:
for model_dict in MODELS:
    model_dict["name"] = f"{model_dict['layer_type']}"
    model_dict["name"] += f"_d{model_dict['dropout']}" if model_dict['dropout'] is not None else f"_dN"
    model_dict["name"] += f"_rd{model_dict['recurrent_dropout']}" if model_dict['recurrent_dropout'] is not None else f"_rdN"
    model_dict["name"] += f"_{model_dict['optimizer']}"

## Fucntion - defing and training model:
def train_model(model_dict):
    """Defines and trains a model, outputs history."""

    ## Defining:
    model = models.Sequential()
    model.add(layers.Embedding(10000, 32))

    recurrent_layer_kwargs = dict()
    if model_dict["dropout"] is not None:
        recurrent_layer_kwargs["dropout"] = model_dict["dropout"]
    if model_dict["recurrent_dropout"] is not None:
        recurrent_layer_kwargs["recurrent_dropout"] = model_dict["recurrent_dropout"]

    if model_dict["layer_type"] == 'GRU':
        model.add(layers.GRU(32, **recurrent_layer_kwargs))
    elif model_dict["layer_type"] == 'LSTM':
        model.add(layers.LSTM(32, **recurrent_layer_kwargs))
    else:
        raise ValueError("Wrong model_dict['layer_type'] value...")
    model.add(layers.Dense(1, activation='sigmoid'))

    ## Compiling:
    model.compile(
        optimizer=model_dict["optimizer"],
        loss='binary_crossentropy',
        metrics=['accuracy'])

    ## Training:
    history = model.fit(x_train, y_train,
                        epochs=20,
                        batch_size=64,
                        validation_split=0.2)

    return history

## Multi-model graphs' parameters:
graph_all_nrow = 4
graph_all_ncol = 4
graph_all_figsize = (20, 20)

assert graph_all_nrow * graph_all_nrow >= len(MODELS)

## Figs and axes of multi-model graphs:
graph_all_loss_fig, graph_all_loss_axs = plt.subplots(graph_all_nrow, graph_all_ncol, figsize=graph_all_figsize)
graph_all_acc_fig, graph_all_acc_axs = plt.subplots(graph_all_nrow, graph_all_ncol, figsize=graph_all_figsize)

## Loop trough all models:
for i, model_dict in enumerate(MODELS):
    history = train_model(model_dict)

    ## Metrics extraction:
    loss = history.history['loss']
    val_loss = history.history['val_loss']
    acc = history.history['accuracy']
    val_acc = history.history['val_accuracy']

    epochs = range(1, len(loss) + 1)

    ## Single-model grph - loss:
    graph_loss_fname = fr"{os.path.basename(__file__).replace('.py', '')}"
    graph_loss_fname += fr"_v{VERSION}_{model_dict['no']}_{model_dict['name']}_loss_graph.png"

    graph_loss_fig, graph_loss_ax = plt.subplots()
    graph_loss_ax.plot(epochs, loss, 'bo', label='Training loss')
    graph_loss_ax.plot(epochs, val_loss, 'b', label='Validation loss')
    graph_loss_ax.legend()
    graph_loss_fig.suptitle("Training and validation loss")
    graph_loss_fig.savefig(graph_loss_fname)
    pylab.close(graph_loss_fig)


    ## Single-model grph - accuracy:
    graph_acc_fname = fr"{os.path.basename(__file__).replace('.py', '')}"
    graph_acc_fname += fr"_v{VERSION}_{model_dict['no']}_{model_dict['name']}_acc_graph.png"

    graph_acc_fig, graph_acc_ax = plt.subplots()
    graph_acc_ax.plot(epochs, acc, 'bo', label='Training accuracy')
    graph_acc_ax.plot(epochs, val_acc, 'b', label='Validation accuracy')
    graph_acc_ax.legend()
    graph_acc_fig.suptitle("Training and validation acc")
    graph_acc_fig.savefig(graph_acc_fname)
    pylab.close(graph_acc_fig)

    ## Position of axes on multi-model graph:
    i_row = i // graph_all_ncol
    i_col = i % graph_all_ncol

    ## Adding model metrics to multi-model graph - loss:
    graph_all_loss_axs[i_row, i_col].plot(epochs, loss, 'bo', label='Training loss')
    graph_all_loss_axs[i_row, i_col].plot(epochs, val_loss, 'b', label='Validation loss')
    graph_all_loss_axs[i_row, i_col].set_title(fr"{model_dict['no']}. {model_dict['name']}")

    ## Adding model metrics to multi-model graph - accuracy:
    graph_all_acc_axs[i_row, i_col].plot(epochs, acc, 'bo', label='Training acc')
    graph_all_acc_axs[i_row, i_col].plot(epochs, val_acc, 'b', label='Validation acc')
    graph_all_acc_axs[i_row, i_col].set_title(fr"{model_dict['no']}. {model_dict['name']}")


## Saving multi-model graphs:
# Output files are quite big (8000x8000 PNG), you may want to decrease DPI.
graph_all_loss_fig.savefig(fr"{os.path.basename(__file__).replace('.py', '')}_ALL_loss_graph.png", dpi=400)
graph_all_acc_fig.savefig(fr"{os.path.basename(__file__).replace('.py', '')}_ALL_acc_graph.png", dpi=400)

Please find two main graphs below: Loss - binary crossentropy, Accuracy (I am not allowed te embed images in post due to low reputation).

I have also obtained similarly strange problems in regression model - the MAE was in range of several thousands - in the problem where $y$ range was maybe of several tens. (I decided not to include this model here, because it would make this question even longer.)

Versions of modules and libraries, hardware

Modules:

Keras                    2.3.1
Keras-Applications       1.0.8
Keras-Preprocessing      1.1.0
matplotlib               3.1.3
tensorflow-estimator     1.14.0
tensorflow-gpu           2.1.0
tensorflow-gpu-estimator 2.1.0

keras.json file:

{
    "floatx": "float32",
    "epsilon": 1e-07,
    "backend": "tensorflow",
    "image_data_format": "channels_last"
}

CUDA - I have CUDA 10.0 and CUDA 10.1 installed on my system.
CUDnn - I have three versions: cudnn-10.0 v7.4.2.24, cudnn-10.0 v7.6.4.38, cudnn-9.0 v7.4.2.24
GPU: Nvidia GTX 1050Ti 4gb
Windows 10 Home

Questions

Do you know what may be the reason of this behavior?
Is it possible that this is caused by multiple CUDA and CUDnn installations? Before observing the problem I have trained several models (both from book and my own ones) and the seemed to behave mor or less as expected, while having 2 CUDA and 2 CUDnn versions (those above without cudnn-10.0 v7.6.4.38) installed.
Is there any official/good source of adequate combinations of keras, tensorflow, CUDA, CUDnn (and other relevant things e.g. maybe Visual Studio)? I cannot really find any authoritative and up-to-date source.

I hope I've described everything clearly enough. If you have any questions, please ask.

user351437 user351437 · Accepted Answer · 2020-05-31T09:52:50

I finally found the solution (sort of). It's enough to change keras to tensorflow.keras.

Revised Code

# Based on examples from "Deep Learning with Python" by François Chollet:
## Constants, modules:
VERSION = 2

import os
#U: from keras import models
#U: from keras import layers
from tensorflow.keras import models
from tensorflow.keras import layers

import matplotlib.pyplot as plt
import pylab

## Loading data:
from keras.datasets import imdb

(x_train, y_train), (x_test, y_test) = \
    imdb.load_data(num_words=10000)

from keras.preprocessing import sequence

x_train = sequence.pad_sequences(x_train, maxlen=500)
x_test = sequence.pad_sequences(x_test, maxlen=500)

## Dictionary with models' hyperparameters:
MODELS_ALL = [
    # GRU:
    {"no": 1,
     "layer_type": "GRU",
     "optimizer": "rmsprop",
     "dropout": None,
     "recurrent_dropout": None},

    {"no": 2,
     "layer_type": "GRU",
     "optimizer": "rmsprop",
     "dropout": 0.3,
     "recurrent_dropout": None},

    {"no": 3,
     "layer_type": "GRU",
     "optimizer": "rmsprop",
     "dropout": None,
     "recurrent_dropout": 0.3},

    {"no": 4,
     "layer_type": "GRU",
     "optimizer": "rmsprop",
     "dropout": 0.3,
     "recurrent_dropout": 0.3},

    {"no": 5,
     "layer_type": "GRU",
     "optimizer": "adam",
     "dropout": None,
     "recurrent_dropout": None},

    {"no": 6,
     "layer_type": "GRU",
     "optimizer": "adam",
     "dropout": 0.3,
     "recurrent_dropout": None},

    {"no": 7,
     "layer_type": "GRU",
     "optimizer": "adam",
     "dropout": None,
     "recurrent_dropout": 0.3},

    {"no": 8,
     "layer_type": "GRU",
     "optimizer": "adam",
     "dropout": 0.3,
     "recurrent_dropout": 0.3},

    # LSTM:
    {"no": 9,
     "layer_type": "LSTM",
     "optimizer": "rmsprop",
     "dropout": None,
     "recurrent_dropout": None},

    {"no": 10,
     "layer_type": "LSTM",
     "optimizer": "rmsprop",
     "dropout": 0.3,
     "recurrent_dropout": None},

    {"no": 11,
     "layer_type": "LSTM",
     "optimizer": "rmsprop",
     "dropout": None,
     "recurrent_dropout": 0.3},

    {"no": 12,
     "layer_type": "LSTM",
     "optimizer": "rmsprop",
     "dropout": 0.3,
     "recurrent_dropout": 0.3},

    {"no": 13,
     "layer_type": "LSTM",
     "optimizer": "adam",
     "dropout": None,
     "recurrent_dropout": None},

    {"no": 14,
     "layer_type": "LSTM",
     "optimizer": "adam",
     "dropout": 0.3,
     "recurrent_dropout": None},

    {"no": 15,
     "layer_type": "LSTM",
     "optimizer": "adam",
     "dropout": None,
     "recurrent_dropout": 0.3},

    {"no": 16,
     "layer_type": "LSTM",
     "optimizer": "adam",
     "dropout": 0.3,
     "recurrent_dropout": 0.3},
]

MODELS_GRU_RECCURENT = [
    # GRU:
    {"no": 3,
     "layer_type": "GRU",
     "optimizer": "rmsprop",
     "dropout": None,
     "recurrent_dropout": 0.3},

    {"no": 4,
     "layer_type": "GRU",
     "optimizer": "rmsprop",
     "dropout": 0.3,
     "recurrent_dropout": 0.3},

    {"no": 7,
     "layer_type": "GRU",
     "optimizer": "adam",
     "dropout": None,
     "recurrent_dropout": 0.3},

    {"no": 8,
     "layer_type": "GRU",
     "optimizer": "adam",
     "dropout": 0.3,
     "recurrent_dropout": 0.3},
]

MODELS = MODELS_ALL   # "MODELS = MODELS_ALL" or "MODELS = MODELS_GRU_RECCURENT"

## Adding name:
for model_dict in MODELS:
    model_dict["name"] = f"{model_dict['layer_type']}"
    model_dict["name"] += f"_d{model_dict['dropout']}" if model_dict['dropout'] is not None else f"_dN"
    model_dict["name"] += f"_rd{model_dict['recurrent_dropout']}" if model_dict['recurrent_dropout'] is not None else f"_rdN"
    model_dict["name"] += f"_{model_dict['optimizer']}"


## Fucntion - defing and training model:
def train_model(model_dict):
    """Defines and trains a model, outputs history."""

    ## Defining:
    model = models.Sequential()
    model.add(layers.Embedding(10000, 32))

    recurrent_layer_kwargs = dict()
    if model_dict["dropout"] is not None:
        recurrent_layer_kwargs["dropout"] = model_dict["dropout"]
    if model_dict["recurrent_dropout"] is not None:
        recurrent_layer_kwargs["recurrent_dropout"] = model_dict["recurrent_dropout"]

    if model_dict["layer_type"] == 'GRU':
        model.add(layers.GRU(32, **recurrent_layer_kwargs))
    elif model_dict["layer_type"] == 'LSTM':
        model.add(layers.LSTM(32, **recurrent_layer_kwargs))
    else:
        raise ValueError("Wrong model_dict['layer_type'] value...")
    model.add(layers.Dense(1, activation='sigmoid'))

    ## Compiling:
    model.compile(
        optimizer=model_dict["optimizer"],
        loss='binary_crossentropy',
        metrics=['accuracy'])

    ## Training:
    history = model.fit(x_train, y_train,
                        epochs=20,
                        batch_size=64,
                        validation_split=0.2)

    return history


## Multi-model graphs' parameters:
graph_all_nrow = 4
graph_all_ncol = 4
graph_all_figsize = (20, 20)

assert graph_all_nrow * graph_all_nrow >= len(MODELS)

# fig and axes of multi-model graphs:
graph_all_loss_fig, graph_all_loss_axs = plt.subplots(graph_all_nrow, graph_all_ncol, figsize=graph_all_figsize)
graph_all_acc_fig, graph_all_acc_axs = plt.subplots(graph_all_nrow, graph_all_ncol, figsize=graph_all_figsize)

## Loop trough all models:
for i, model_dict in enumerate(MODELS):
    history = train_model(model_dict)

    ## Metrics extraction:
    loss = history.history['loss']
    val_loss = history.history['val_loss']
    acc = history.history['accuracy']
    val_acc = history.history['val_accuracy']

    epochs = range(1, len(loss) + 1)

    ## Single-model graph - loss:
    graph_loss_fname = fr"{os.path.basename(__file__).replace('.py', '')}"
    graph_loss_fname += fr"_v{VERSION}_{model_dict['no']}_{model_dict['name']}_loss_graph.png"

    graph_loss_fig, graph_loss_ax = plt.subplots()
    graph_loss_ax.plot(epochs, loss, 'bo', label='Training loss')
    graph_loss_ax.plot(epochs, val_loss, 'b', label='Validation loss')
    graph_loss_ax.legend()
    graph_loss_fig.suptitle("Training and validation loss")
    graph_loss_fig.savefig(graph_loss_fname)
    pylab.close(graph_loss_fig)

    ## Single-model graph - accuracy:
    graph_acc_fname = fr"{os.path.basename(__file__).replace('.py', '')}"
    graph_acc_fname += fr"_v{VERSION}_{model_dict['no']}_{model_dict['name']}_acc_graph.png"

    graph_acc_fig, graph_acc_ax = plt.subplots()
    graph_acc_ax.plot(epochs, acc, 'bo', label='Training accuracy')
    graph_acc_ax.plot(epochs, val_acc, 'b', label='Validation accuracy')
    graph_acc_ax.legend()
    graph_acc_fig.suptitle("Training and validation acc")
    graph_acc_fig.savefig(graph_acc_fname)
    pylab.close(graph_acc_fig)

    ## Position of axes on multi-model graph:
    i_row = i // graph_all_ncol
    i_col = i % graph_all_ncol

    ## Adding model metrics to multi-model graph - loss:
    graph_all_loss_axs[i_row, i_col].plot(epochs, loss, 'bo', label='Training loss')
    graph_all_loss_axs[i_row, i_col].plot(epochs, val_loss, 'b', label='Validation loss')
    graph_all_loss_axs[i_row, i_col].set_title(fr"{model_dict['no']}. {model_dict['name']}")

    ## Adding model metrics to multi-model graph - accuracy:
    graph_all_acc_axs[i_row, i_col].plot(epochs, acc, 'bo', label='Training acc')
    graph_all_acc_axs[i_row, i_col].plot(epochs, val_acc, 'b', label='Validation acc')
    graph_all_acc_axs[i_row, i_col].set_title(fr"{model_dict['no']}. {model_dict['name']}")

graph_all_loss_fig.suptitle(f"Loss - binary crossentropy [v{VERSION}]")
graph_all_acc_fig.suptitle(f"Accuracy [v{VERSION}]")

## Saving multi-model graphs:
graph_all_loss_fig.savefig(fr"{os.path.basename(__file__).replace('.py', '')}_ALL_v{VERSION}_loss_graph.png", dpi=400)
graph_all_acc_fig.savefig(fr"{os.path.basename(__file__).replace('.py', '')}_ALL_v{VERSION}_acc_graph.png", dpi=400)

## Saving multi-model graphs (SMALL):
graph_all_loss_fig.savefig(fr"{os.path.basename(__file__).replace('.py', '')}_ALL_v{VERSION}_loss_graph_SMALL.png", dpi=150)
graph_all_acc_fig.savefig(fr"{os.path.basename(__file__).replace('.py', '')}_ALL_v{VERSION}_acc_graph_SMALL.png", dpi=150)

Results

Graphs analogous to those in question: Loss - binary crossentropy, Accuracy

More on `keras` vs `tensorflow.keras`

As written in François Chollet's tweets (found here: https://stackoverflow.com/a/54117754) instead of stand-alone keras there will be tensorflow.keras (that is Keras as official API to TensorFlow) from now on. (I'm not completely sure if I'm 100% correct, feel free to correct me.)

I think it's better just to use tensorflow.keras instead of keras in future projects.