0
votes

I'm using tf.contrib.learn.Estimator to train a CNN having 20+ layers. I'm using GTX 1080 (8 GB) for training. My dataset is not so large but my GPU runs out of memory with a batch size greater than 32. So I'm using a batch size of 16 for training and Evaluating the classifier (GPU runs out of memory while evaluation as well if a batch_size is not specified).

  # Configure the accuracy metric for evaluation
  metrics = {
      "accuracy":
          learn.MetricSpec(
              metric_fn=tf.metrics.accuracy, prediction_key="classes"),
  }

  # Evaluate the model and print results
  eval_results = classifier.evaluate(
      x=X_test, y=y_test, metrics=metrics, batch_size=16)

Now the problem is that after every 100 steps, I only get training loss printed on screen. I want to print validation loss and accuracy as well, So I'm using a ValidationMonitor

 validation_monitor = tf.contrib.learn.monitors.ValidationMonitor(
      X_test,
      y_test,
      every_n_steps=50)
  # Train the model
  classifier.fit(
      x=X_train,
      y=y_train,
      batch_size=8,
      steps=20000,
      monitors=[validation_monitor]

ActualProblem: My code crashes (Out of Memory) when I use ValidationMonitor, I think the problem might be solved if I could specify a batch size here as well and I can't figure out how to do that. I want ValidationMonitor to evaluate my validation data in batches, like I do it manually after training using classifier.evaluate, is there a way to do that?

2

2 Answers

2
votes

The ValidationMonitor's constructor accepts a batch_size arg that should do the trick.

1
votes

You need to add config=tf.contrib.learn.RunConfig( save_checkpoints_secs=save_checkpoints_secs) in your model definition. The save_checkpoints_secs can be changed to save_checkpoints_steps, but not both.