0
votes

I am trying to implement Speech-To-Text in my application using Google Cloud Speech-To-Text API with Python. I get the transcription correctly, however the response contains only the transcript and confidence, but not the separate words. If I try to access the words, I get an empty list.

For accessing the results, I use the following code:

best_alternative = result.alternatives[0]
word = best_alternative
transcript = best_alternative.transcript
confidence = best_alternative.confidence
print(f'Transcript: {transcript}')
print(f'Confidence: {confidence:.0%}')

Printing out best_alternative.__dict__ gives me transcript and confidence, but not words. Is there any particular way of accessing the words in transcript or am I missing something?

UPDATE: Initially, I was initializing the recognition configuration like this:

config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=RATE,
    language_code=lan_code)
streaming_config = speech.StreamingRecognitionConfig(
        config=config,
        interim_results=True,
        enable_speaker_diarization=True)

With this configuration, the returned response didn't contain words, only transcript and confidence. Then I changed my configuration to this:

config = speech.RecognitionConfig()
config.sample_rate_hertz = 16000
config.language_code = 'en-US'
config.encoding = speech.RecognitionConfig.AudioEncoding.LINEAR16
config.enable_speaker_diarization = True

This eventually gave me words alongside with transcript and confidence. The words can be accessed using:

response.results[0].alternatives[0].words[i].word


    
1

1 Answers

0
votes

According to Cloud Speech-to-Text API REST documentation, speech.recognize method returns speech recognition response along SpeechRecognitionResult for each transcription result results[] object, whereas SpeechRecognitionAlternative retrieves transcript, confidence, words[] within a particular hypothesis.

Looking through the Python Google google-cloud-speech library implementation, I admit that for genuine SpeechRecognitionAlternative() class we can discover a list of word-specific information WordInfo for each recognized word.

print("Words: {}".format(result.alternatives[0].words[0].word))