How to use Google's Cloud Speech-to-Text REST API to transcribe a video

Question

I'd like to have the transcript of 2 people speaking in a video, but I get an empty response from the Cloud Speech-to-Text API

Approach:

I have a 56 minute video file containing a conversation between two people. I would like to have the transcript of that conversation, and I would like to use Google's Cloud Speech-to-Text API to get that.

To save a little on my Google Cloud Storage I converted to video to audio first by using mmpeg.

First I'd tried to figure out the audio codec by using the command below, and it looks like AAC.
ffmpeg -i video.mp4

Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'videoplayback.mp4':
  Metadata:
    major_brand     : mp42
    minor_version   : 0
    compatible_brands: isommp42
    creation_time   : 2015-12-30T08:17:14.000000Z
  Duration: 00:56:03.99, start: 0.000000, bitrate: 362 kb/s
    Stream #0:0(und): Video: h264 (Constrained Baseline) (avc1 / 0x31637661), yuv420p, 490x360 [SAR 1:1 DAR 49:36], 264 kb/s,     29.97 fps, 29.97 tbr, 30k tbn, 59.94 tbc (default)
    Metadata:
      handler_name    : VideoHandler
    Stream #0:1(eng): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 96 kb/s (default)
    Metadata:
      creation_time   : 2015-12-30T08:17:31.000000Z
      handler_name    : IsoMedia File Produced by Google, 5-11-2011

So I took that from the video by using:
ffmpeg -i video.mp4 -vn -acodec copy myaudio.aac

Details so far:
ffmpeg -i myaudio.aac
Outputs:

Input #0, aac, from 'myaudio.aac':
  Duration: 00:56:47.49, bitrate: 97 kb/s
    Stream #0:0: Audio: aac (LC), 44100 Hz, stereo, fltp, 97 kb/s

After that I converted it to opus because I'm told that opus is better
ffmpeg -i myaudio.aac -acodec libopus -b:a 97k -vbr on -compression_level 10 myaudio.opus

Info so far:
opusinfo myaudio.opus

User comments section follows...
    encoder=Lavc58.18.100 libopus
Opus stream 1:
    Pre-skip: 312
    Playback gain: 0 dB
    Channels: 2
    Original sample rate: 48000Hz
    Packet duration:   20.0ms (max),   20.0ms (avg),   20.0ms (min)
    Page duration:   1000.0ms (max), 1000.0ms (avg), 1000.0ms (min)
    Total data length: 29956714 bytes (overhead: 0.872%)
    Playback length: 56m:03.990s
    Average bitrate: 71.24 kb/s, w/o overhead: 70.62 kb/s

I this point I uploaded the myaudio.opus to the Google Cloud Storage.

curl POST 1
I started the speech recognition by doing a POST with curl:

curl --request POST  --header "Content-Type: application/json" --url 'https://speech.googleapis.com/v1/speech:longrunningrecognize?fields=done%2Cerror%2Cmetadata%2Cname%2Cresponse&key={MY_API_KEY}' --data '{"audio": {"uri": "gs://{MY_BUCKET}/myaudio.opus"},"config": {"encoding": "OGG_OPUS", "sampleRateHertz": 48000, "languageCode": "en-US"}}'

Response: {"name": "123456789"} 123456789 was not the actual value.

curl GET 1
Now I wanted to have the results:

curl --request GET --url 'https://speech.googleapis.com/v1/operations/123456789?fields=done%2Cerror%2Cmetadata%2Cname%2Cresponse&key={MY_API_KEY}'

This gave me the error : Error : Unable to recognize speech, possible error in encoding or channel config. Please correct the config and retry the request.

So I updated the encoding configuration from OGG_OPUS to LINEAR16.

curl POST 2
Did the post again:

curl --request POST  --header "Content-Type: application/json" --url 'https://speech.googleapis.com/v1/speech:longrunningrecognize?fields=done%2Cerror%2Cmetadata%2Cname%2Cresponse&key={MY_API_KEY}' --data '{"audio": {"uri": "gs://{MY_BUCKET}/myaudio.opus"},"config": {"encoding": "LINEAR16", "sampleRateHertz": 48000, "languageCode": "en-US"}}'

Response: {"name": "987654321"}

curl GET 2

curl --request GET --url 'https://speech.googleapis.com/v1/operations/987654321?fields=done%2Cerror%2Cmetadata%2Cname%2Cresponse&key={MY_API_KEY}'

Response:

{
  "name": "987654321",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeMetadata",
    "progressPercent": 100,
    "startTime": "2018-06-08T11:01:24.596504Z",
    "lastUpdateTime": "2018-06-08T11:01:51.825882Z"
  },
  "done": true
}

The problem is that I don't get the actual transcription. According the the documentation there should be a response key in the response containing the data.

Since I'm kinda stuck here I'd like to know if I'm doing something completely wrong. I don't have any technical or resource limitation so all suggestions are very welcome! Also happy to change my approach.

Thanks in advance! Cheers

adddog adddog · Accepted Answer · 2018-07-24T19:44:48

Looks like for the moment only WAV and FLAC are supported. Using the gcloud command locally, I had success:

gcloud ml speech recognize-long-running gs://bucket-name/file.flac  --language-code en-US --include-word-time-offsets > my_transcription.json

Had an byte limit error when using a local file. Says you can

How to use Google's Cloud Speech-to-Text REST API to transcribe a video

1 Answers