Video Intelligence API - Label Segment time

Question

I am following this LABEL DETECTION TUTORIAL.

The code below does the following(after getting the response back)

Our response will contain result within an AnnotateVideoResponse, which consists of a list of annotationResults, one for each video sent in the request. Because we sent only one video in the request, we take the first segmentLabelAnnotations of the results. We then loop through all the labels in segmentLabelAnnotations. For the purpose of this tutorial, we only display video-level annotations. To identify video-level annotations, we pull segment_label_annotations data from the results. Each segment label annotation includes a description (segment_label.description), a list of entity categories (category_entity.description) and where they occur in segments by start and end time offsets from the beginning of the video.

segment_labels = result.annotation_results[0].segment_label_annotations
for i, segment_label in enumerate(segment_labels):
    print('Video label description: {}'.format(
        segment_label.entity.description))
    for category_entity in segment_label.category_entities:
        print('\tLabel category description: {}'.format(
            category_entity.description))

    for i, segment in enumerate(segment_label.segments):
        start_time = (segment.segment.start_time_offset.seconds +
                      segment.segment.start_time_offset.nanos / 1e9)
        end_time = (segment.segment.end_time_offset.seconds +
                    segment.segment.end_time_offset.nanos / 1e9)
        positions = '{}s to {}s'.format(start_time, end_time)
        confidence = segment.confidence
        print('\tSegment {}: {}'.format(i, positions))
        print('\tConfidence: {}'.format(confidence))
    print('\n')

So, it says "Each segment label annotation includes a description (segment_label.description), a list of entity categories (category_entity.description) and where they occur in segments by start and end time offsets from the beginning of the video."

But, in the output, all the labels urban area, traffic, vehicle.. have the same start and end time offsets which are basically the start and the end of the video.

$ python label_det.py gs://cloud-ml-sandbox/video/chicago.mp4
Operation us-west1.4757250774497581229 started: 2017-01-30T01:46:30.158989Z
Operation processing ...
The video has been successfully processed.

Video label description: urban area
        Label category description: city
        Segment 0: 0.0s to 38.752016s  
        Confidence: 0.946980476379

Video label description: traffic
        Segment 0: 0.0s to 38.752016s
        Confidence: 0.94105899334

Video label description: vehicle
        Segment 0: 0.0s to 38.752016s
        Confidence: 0.919958174229
...

Why is this happening?
Why is the API returning these offsets for all the labels and not the start and end time offsets of the segment where that particular label (entity) appears?(I feel like it has something to do with the video-level annotation but I am not sure)
How can I get the start and end time offsets of the segment where they actually appear?

dsesto dsesto · Accepted Answer · 2018-04-13T11:29:03

I see that the part of the tutorial that you are following uses the simplest examples available, while the list of samples provides a more complete example where more features of the Video Intelligence API are used.

In order to achieve the objective you want (have a more detailed information about the time instants when each annotation is identified), there are two possibilities that you can explore:

Option 1

The key point here is the fact that the video-level annotations only work over segments. As explained in this documentation page I linked, if segments in a video are not specified, the API will treat the video as a single segment. Therefore, if you want that the API returns more "specific" results about when each annotation is identified, you should split the video in segments yourself, by splitting it into different segments (which can overlap and may not need the complete video), and passing those arguments as part of the videoContext field in the annotate request.

If you do these through the API request, you may do a request such as the following one, defining as many segments as you want, by specifying the start and end TimeOffsets:

{
 "inputUri": "gs://cloud-ml-sandbox/video/chicago.mp4",
 "features": [
  "LABEL_DETECTION"
 ],
 "videoContext": {
  "segments": [
   {
    "startTimeOffset": "TO_DO",
    "endTimeOffset": "TO_DO"
   }
   {
    "startTimeOffset": "TO_DO",
    "endTimeOffset": "TO_DO"
   }
  ]
 }
}

If, instead, you are willing to use the Python Client Library, you can instead use the video_context parameter as in the code below:

video_client = videointelligence.VideoIntelligenceServiceClient()
features = [videointelligence.enums.Feature.LABEL_DETECTION]

mode = videointelligence.enums.LabelDetectionMode.SHOT_AND_FRAME_MODE
config = videointelligence.types.LabelDetectionConfig(label_detection_mode=mode)
context = videointelligence.types.VideoContext(label_detection_config=config)

operation = video_client.annotate_video("gs://cloud-ml-sandbox/video/chicago.mp4", features=features, video_context=context)

Option 2

The second option that I propose for your use case is using a different Label Detection Mode. The list of available Label Detection Modes is available in this documentation link. By default, the SHOT_MODE is used, and it will only provide video-level and shot-level annotations, which require that you work with segments as explained in Option 1. If, instead, you use FRAME_MODE, frame-level annotations will be processed. This is a costly option, as it analyses all the frames in the video and annotates each of them, but it may be a suitable option depending on your specific use case. This mode (well, actually the SHOT_AND_FRAME_MODE one, which is a combination of the two previous) is used in the more complete example that I mentioned at the beginning of my answer. The analyze_labels() function in that code provides a really complete example on how to perform video/shot/frame-level annotations, and specifically for frame-level annotation there is an explanation on how to obtain information about the frames were the annotations happen.

Note that this option is really costly, as I explained earlier, and for example, I have run it for the "chicago.mp4" video provided as a sample in the tutorial, and it took around 30 minutes to complete. However, the level of detail achieved is really high (again, each frame is analyzed, and then annotations are grouped by element), and this is the kind of response that you can expect to obtain:

"frameLabelAnnotations": [
     {
      "entity": {
       "entityId": "/m/088l6h",
       "description": "family car",
       "languageCode": "en-US"
      },
      "categoryEntities": [
       {
        "entityId": "/m/0k4j",
        "description": "car",
        "languageCode": "en-US"
       }
      ],
      "frames": [
       {
        "timeOffset": "0.570808s",
        "confidence": 0.76606256
       },
       {
        "timeOffset": "1.381775s",
        "confidence": 0.74966145
       },
       {
        "timeOffset": "2.468091s",
        "confidence": 0.85502887
       },
       {
        "timeOffset": "3.426006s",
        "confidence": 0.78749716
       },
      ]
     },

TL;DR:

The results returned by the type of call you are making following the simple example in the tutorial is expected. If there is no specific configuration, a video will be considered as a single segment, reason why the response you are getting identifies annotations in the whole video.

If you want to get more details about when are the elements identified, you will need to follow one of the two following approaches: (1) define segments in your video (which requires that you manually specify the segments in which you want to split the video), or (2) use FRAME_MODE (which is way more costly and precise).

Video Intelligence API - Label Segment time

1 Answers