Google Cloud Speech to text returning empty result or error

Question

Working hard for 4 days now to fix the google cloud speech to text api to work, but still see no light at the end of the tunnel. Searched on the net a lot, read the documentations a lot but see no result.

Our site is bbsradio.com, we are trying to auto extract transcript from our mp3 files using google speech-to-text api. Code is written on PHP and almost exact copy of this: https://github.com/GoogleCloudPlatform/php-docs-samples/blob/master/speech/src/transcribe_async.php I see process is completed and its reached out here "$operation->pollUntilComplete();" but its not showing it was successful at "if ($operation->operationSucceeded()) {" and its not returning any error either at $operation->getError().

I am converting the mp3 to raw file like this: ffmpeg -y -loglevel panic -i /public_html/sites/default/files/show-archives/audio-clips-9-23-2020/911freefall2020-05-24.mp3 -f s16le -acodec pcm_s16le -vn -ac 1 -ar 16000 -map_metadata -1 /home/mp3_to_raw/911freefall2020-05-24.raw

While tried with FLAC format as well, not worked. I tested converted FLAC file using windows media player, I can listen conversation clearly. I checked the files its Hz 16000, channel = 1 and its 16 bit. I see file is uploaded in cloud storage. Checked this:

https://cloud.google.com/speech-to-text/docs/troubleshooting and https://cloud.google.com/speech-to-text/docs/best-practices

There are lot of discussion and documentation, seems nothing is helpful at this moment. If some one can really help me out to find out the issue, it will be really really really great!

Iñigo González Iñigo González · Accepted Answer · 2020-11-18T09:22:53

TLDR; convert from MP3 to a 1-channel FLAC file with the same sample rate as your MP3 file.

Long explanation:

Since you're using MP3 files as your process input, probably you MP3 compression artifacts might be hurting you when you resample to to 16KHz (you cannot hear this, but the algoritm will).

To confirm this theory:

Execute ffprobe -hide_banner filename.mp3 it will output something like this:

  Metadata:
    ...
  Duration: 00:02:12.21, start: 0.025057, bitrate: 320 kb/s
    Stream #0:0: Audio: mp3, 44100 Hz, stereo, s16p, 320 kb/s
    Metadata:
      encoder         : LAME3.99r

In this case, the sample rate is OK for Google-Spech-Api. Just transcode the file without changing the sample rate (remove the -ar 16000 from your ffmpeg command)
You might get into trouble if the original MP3 bitrate is low. 320kb/s seems safe (unless the recording has a lot of noise).
Take into account that voice recoded under 64kb/s (ISDN line quality) can be understood only by humans if there is some noise.

Google Cloud Speech to text returning empty result or error

2 Answers