Why Google Speech Recognition API only return first 2-3 seconds converted text of audio

Question

I created a project in Google Cloud Console, and enabled Google Speech API in this project, and create credentials. Also used the transcribe.py recommended by Google,

https://cloud.google.com/speech/docs/samples

https://github.com/GoogleCloudPlatform/python-docs-samples/tree/master/speech

https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/speech/api-client/transcribe.py

I can use it with API key generated by Google could console to successfully translate audio file(30 seconds) into text, but not fully, only first 2-3 seconds. My account now is of free trial, so I doubt whether it is because of my account type( free trial).

Response from google is like {"results": [{"alternatives": [{"confidence": 0.89569235, "transcript": "I've had a picnic in the forest and I'm going home so come on with me"}]}]}

The audio file is wav file with format( printed by ffprobe ) Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, 1 channels, s16, 256 kb/s

Audio file has been uploaded in google drive, link is here https://drive.google.com/file/d/0B3koIsnLksOLQXhvQ1ljS0dDXzg/view?usp=sharing

Anybody know whats wrong with above process/steps? or this is bug google speech recognition api?

Alex Alex · Accepted Answer · 2016-11-14T22:45:20

Using the Google APIs Explorer with the Cloud Speech API service, it was possible to isolate the following relevant speech recognition results by analyzing separate samples of your audio file:

Cut 1 : 0 - 00'08"015 , Result 9 : "I've had a picnic in the forest and I'm going home so come on come with me"
Cut 2 : 00'08"732 - 00'11"184 , Result 2 : "listen what's that"
Cut 3 : 00’13”500 - Till end , Result 2 : "what is it look"

These results were obtained using the following Configuration:

“config”: {
    “encoding”: “LINEAR16”,
    “sampleRate”: 16000,
    “maxAlternatives”: “30”,
}

As a matter of fact, there exists known issues with the speech API that is currently in Beta and so may prevent the transcription from working correctly (regardless if the account is on a paid or free trial basis). As described in the following best practices, there would be two issues to consider in your case:

A background music is playing throughout the speech recording which may create enough background noise to reduce the transcription accuracy. (Note that the Speech API was designed to transcribe the text of users dictating to an application’s microphone)
As advised further, it is recommended to split the audio when it is captured from more than one person. In your case, the frog’s sound may be detected as a different human voice and so also impact on the transcription accuracy.

Considering these two known issues, it would be important to remove any noise and only process the uniform speech originating from the protagonist of your recording. Alternatively, you can split the recording and try to transcribe individually each separate parts of the recording containing the voice of a single character.

Why Google Speech Recognition API only return first 2-3 seconds converted text of audio

2 Answers