Problem detecting large number of objects in single image with Tensorflow Object Detection API

Question

I need to detect large numbers of two classes of objects in a single image. I've had some success using the Tensorflow Object Detection API by retraining the faster_rcnn_inception_resnet_v2_atrous_coco network from the Object Detection Model Zoo using the following config file:

model {
  faster_rcnn {
    num_classes: 2
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 600
        max_dimension: 1024
      }
    }
    feature_extractor {
      type: 'faster_rcnn_inception_resnet_v2'
      first_stage_features_stride: 8
    }
    first_stage_anchor_generator {
      grid_anchor_generator {
        scales: [0.25, 0.5, 1.0, 2.0]
        aspect_ratios: [0.5, 1.0, 2.0]
        height_stride: 8
        width_stride: 8
      }
    }
    first_stage_atrous_rate: 2
    first_stage_box_predictor_conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.0
        }
      }
      initializer {
        truncated_normal_initializer {
          stddev: 0.01
        }
      }
    }
    first_stage_nms_score_threshold: 0.0
    first_stage_nms_iou_threshold: 0.7
    first_stage_max_proposals: 2000
    first_stage_localization_loss_weight: 2.0
    first_stage_objectness_loss_weight: 1.0
    initial_crop_size: 17
    maxpool_kernel_size: 1
    maxpool_stride: 1
    second_stage_box_predictor {
      mask_rcnn_box_predictor {
        use_dropout: false
        dropout_keep_probability: 1.0
        fc_hyperparams {
          op: FC
          regularizer {
            l2_regularizer {
              weight: 0.0
            }
          }
          initializer {
            variance_scaling_initializer {
              factor: 1.0
              uniform: true
              mode: FAN_AVG
            }
          }
        }
      }
    }
    second_stage_post_processing {
      batch_non_max_suppression {
        score_threshold: 0.0
        iou_threshold: 0.6
        max_detections_per_class: 1000
        max_total_detections: 1000
      }
      score_converter: SOFTMAX
    }
    second_stage_localization_loss_weight: 2.0
    second_stage_classification_loss_weight: 1.0
  }
}

train_config: {
  batch_size: 1
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        manual_step_learning_rate {
          initial_learning_rate: 0.0003
          schedule {
            step: 900000
            learning_rate: .00003
          }
          schedule {
            step: 1200000
            learning_rate: .000003
          }
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  gradient_clipping_by_norm: 10.0
  fine_tune_checkpoint: "/path/model.ckpt"
  from_detection_checkpoint: true
  load_all_detection_checkpoint_vars: true
  # Note: The below line limits the training process to 200K steps, which we
  # empirically found to be sufficient enough to train the pets dataset. This
  # effectively bypasses the learning rate schedule (the learning rate will
  # never decay). Remove the below line to train indefinitely.
  num_steps: 200000
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "/path/train.record"
  }
  label_map_path: "/path/label_map.pbtxt"
}

eval_config: {
  num_examples: 8000
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  max_evals: 10
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "/path/val.record"
  }
  label_map_path: "/path/label_map.pbtxt"
  shuffle: false
  num_readers: 1
}

However, using an Nvidia M10 with 8 GB memory, I'm only able to get detections on (roughly) the top half of the image:

This pattern is consistent across many images, with some images having a few bounding boxes lower down on the image, but no images having bounding boxes accurately distributed throughout the image. My first thought was that it was a memory problem, so I tried running the detection on a GPU with more memory (Nvidia V100 with 32 GB memory). I changed the config file to raise the first_stage_max_proposals from 2000 to 4000 and the max_detections_per_class/max_total_detections from 1000 to 2000 (on the 8 GB GPU these settings led to an Aborted (core dumped) error). The results were only marginally better:

I tried raising the first_stage_max_proposals to 8000 and the max_detections_per_class/max_total_detections to 4000, but this led to an Aborted (core dumped) error on the 32 GB GPU.

My questions are:

1) Are these the best config settings for detecting large numbers of objects in a single image?

2) Is there a better network than faster_rcnn_inception_resnet_v2_atrous_coco for this specific task?

3) Is there an entirely different approach that's better suited to this problem?

I've considered splitting the image up into smaller images and running it on those, but if possible I'd like to keep it as one image, as accurate counts of the objects are important to my application and splitting the objects along some dividing line might lead to inaccurate counts.

Thanks!

The max number of detections per image would probably be around 800 total objects, with the average number of detections per image at around 350-400 objects. — cd_warman
The results are very similar with max_proposals and max_total_detections set to 800, with no objects detected in the bottom half of the example image and fewer objects detected on the bottom half in general. — cd_warman
I never found a good solution. I ended up splitting the images (and training/annotations) into smaller images (each image into 3 sub-images) and running the training then testing on the sub-images. For the final output, I recombined the sub-images with some minor stuff to fix the bounding boxes along the edges. This worked well. I'm writing a paper to be published soon, I'll leave it here once I'm finished. — cd_warman

Kashif Iqbal Kashif Iqbal · Accepted Answer · 2021-07-26T17:11:40

I was facing the same problem. So, I just made some adjustments in the config file of model Faster R-CNN Inception ResNet V2 1024x1024 from Model Zoo. Like:

first_stage_max_proposals: 1500
max_detections_per_class: 1500
max_total_detections: 1500

Add the max_number_of_boxes: 1500 into the train_config, train_input_reader and eval_input_reader block. I also add max_num_boxes_to_visualize: 1500 to the eval_config block.

This work totally fine for me. So, now I am getting the detection of approximate 1500 objects in a single image.

Problem detecting large number of objects in single image with Tensorflow Object Detection API

2 Answers