3
votes

I have used Google's Vision OCR a lot and it is really very accurate. I was wondering if I can do the OCR on a video file or video stream. Say, I have some surveillance video and I want to get all the text throughout that video. In Google's Video intelligence API, I can only get the labels, which I am guessing is using label detection API of Google Vision. I think there might be challenges to OCR on every frame of video, but still wanted to try to start a discussion on how it can be done. There might not be a perfect solution, but even if we get 50% of it, it's better than nothing.

3
The best result will be achieved if we reconstruct text-containing surfaces first by connecting parts of them that were captured in different frames. Then combining several shots of the same surface segment we can get rid of mpeg artifacts, etc. - Nakilon
@Nakilon Can you please elaborate a bit more? I only got it partially. How can I connect parts of it captured in different frames? I can only get text containing parts. But tracking!? - Anmol Agrawal
You would need to find a solution for that. Google does not provide such service. - Nakilon
Alright. Will give it a try. - Anmol Agrawal

3 Answers

1
votes

Here is what I did:

  1. Go to this website an download this sample free Video: https://www.videvo.net/video/people-walking-past-the-911-memorial-sign-in-new-york/5283/

  2. Download and install VLC video player

  3. Follow the steps in this tutorial to extract the images from the video:

    a. Go to tools -> preferences. In the lower left cofner click the radio button 'All'.

    b. Click on the video category on the left in order to expand it. Click again in 'filters' in order to expand it.

    c. Select the 'scene filter' and choose the settings (see the image below).

    enter image description here

    d. Click the filters category and select the 'Scene video filter' checkbox(See image below)

    enter image description here

    e. After clicking 'Save'in the lower right corner, open the video you downloaded and play it. The images will be saved automatically.

    More details here.

  4. Go to this CLOUD VISION API page, and you can drag and drop any of the generated images to see a sample of the API capabilities.

1
votes

Here is an FFmpeg + Python approach to using Google Cloud Vision API for a video:

  1. Extract frames from the video to frames_path directory with FFmpeg:

    import os
    import subprocess
    
    def extract_frames_from_video(video_path, frames_path):
        subprocess.call("ffmpeg -r 1 -i {video_path} -r 1 {out_path}".format(
            video_path=video_path,
            out_path=os.path.join(frames_path, "frame_%06d.png")), shell=True)
    
  2. Call the Vision API for the extracted frames.


If you want to highlight the detections in the images and then reconstruct the video from the processed frames, the following approach can be used:

  1. Create soundless video from the frames:

    def convert_frames_to_video(frames_path, output_video_path, fps):
        subprocess.call(
            "ffmpeg -r {frame_rate} -f image2 " 
            "-i {frames_path} -vcodec libx264 -crf {quality} -pix_fmt yuv420p "
            "{out_path}".format(
                frame_rate=fps,
                frames_path=os.path.join(frames_path, "frame_%06d.png"),
                quality=15,  # Lower is better
                out_path=output_video_path), shell=True)
    
  2. Add sound from the input video to the final output video:

    def add_sound_from_video_to_video(sound_video_path, soundless_video_path,
                                      output_video_path):
        subprocess.call(
            "ffmpeg "
            "-i {video_path_without_audio} "
            "-i {video_path_with_audio} "
            "-c copy -map 0:0 -map 1:1 -shortest {output_video_path}".format(
                video_path_without_audio=soundless_video_path,
                video_path_with_audio=sound_video_path,
                output_video_path=output_video_path), shell=True)
    

Here is this whole pipeline that I programmed for face detection.

0
votes

Right now Google Cloud Video Intelligence API provides a OCR for videos. It aggregates the detections from multiple frames, which provides more consistent results compare to single frame OCR detection. You can check the feature in https://cloud.google.com/video-intelligence/docs/text-detection.