I have used Google's Vision OCR a lot and it is really very accurate. I was wondering if I can do the OCR on a video file or video stream. Say, I have some surveillance video and I want to get all the text throughout that video. In Google's Video intelligence API, I can only get the labels, which I am guessing is using label detection API of Google Vision. I think there might be challenges to OCR on every frame of video, but still wanted to try to start a discussion on how it can be done. There might not be a perfect solution, but even if we get 50% of it, it's better than nothing.
3 Answers
Here is what I did:
Go to this website an download this sample free Video: https://www.videvo.net/video/people-walking-past-the-911-memorial-sign-in-new-york/5283/
Download and install VLC video player
Follow the steps in this tutorial to extract the images from the video:
a. Go to tools -> preferences. In the lower left cofner click the radio button 'All'.
b. Click on the video category on the left in order to expand it. Click again in 'filters' in order to expand it.
c. Select the 'scene filter' and choose the settings (see the image below).
d. Click the filters category and select the 'Scene video filter' checkbox(See image below)
e. After clicking 'Save'in the lower right corner, open the video you downloaded and play it. The images will be saved automatically.
More details here.
Go to this CLOUD VISION API page, and you can drag and drop any of the generated images to see a sample of the API capabilities.
Here is an FFmpeg + Python approach to using Google Cloud Vision API for a video:
Extract frames from the video to
frames_path
directory with FFmpeg:import os import subprocess def extract_frames_from_video(video_path, frames_path): subprocess.call("ffmpeg -r 1 -i {video_path} -r 1 {out_path}".format( video_path=video_path, out_path=os.path.join(frames_path, "frame_%06d.png")), shell=True)
Call the Vision API for the extracted frames.
If you want to highlight the detections in the images and then reconstruct the video from the processed frames, the following approach can be used:
Create soundless video from the frames:
def convert_frames_to_video(frames_path, output_video_path, fps): subprocess.call( "ffmpeg -r {frame_rate} -f image2 " "-i {frames_path} -vcodec libx264 -crf {quality} -pix_fmt yuv420p " "{out_path}".format( frame_rate=fps, frames_path=os.path.join(frames_path, "frame_%06d.png"), quality=15, # Lower is better out_path=output_video_path), shell=True)
Add sound from the input video to the final output video:
def add_sound_from_video_to_video(sound_video_path, soundless_video_path, output_video_path): subprocess.call( "ffmpeg " "-i {video_path_without_audio} " "-i {video_path_with_audio} " "-c copy -map 0:0 -map 1:1 -shortest {output_video_path}".format( video_path_without_audio=soundless_video_path, video_path_with_audio=sound_video_path, output_video_path=output_video_path), shell=True)
Here is this whole pipeline that I programmed for face detection.
Right now Google Cloud Video Intelligence API provides a OCR for videos. It aggregates the detections from multiple frames, which provides more consistent results compare to single frame OCR detection. You can check the feature in https://cloud.google.com/video-intelligence/docs/text-detection.