3
votes

I successfully obtained the transcript and alternatives for a 5 minute long audio using Google Cloud Speech API (longrunningrecognize), but I'm not getting the full text of these 5 minutes, just a small transcript, as seen below:

{
  "name": "2340863807845687922",
  "metadata": {
    "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeMetadata",
    "progressPercent": 100,
    "startTime": "2018-09-20T13:25:57.948053Z",
    "lastUpdateTime": "2018-09-20T13:28:18.406147Z"
  },
  "done": true,
  "response": {
    "@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse",
    "results": [
      {
        "alternatives": [
          {
            "transcript": "I am recording it. I think",
            "confidence": 0.9223639
          }
        ]
      },
      {
        "alternatives": [
          {
            "transcript": "these techniques properly stated",
            "confidence": 0.9190353
          }
        ]
      }
    ]
  }
}

How do I get the full text generated by the transcription ?

3

3 Answers

1
votes

Google Speech API is very painful thing to work with. Beside not being able to translate long files they randomly skip large chunks of audio from the transcription. Possible solutions are:

  1. Split audio on chunks with voice activity detection and transcribe every chunk separately
  2. Use more reasonable service like Speechmatics, they will process big files without any issue with better accuracy
  3. Use open source speech recognizer like Kaldi.
1
votes

I successfully solved this issue. I had to properly convert the file with ffmpeg:

$ ffmpeg -i /home/user/audio_test.wav -ac 1 -ab 8k audio_test2.wav

*** Remove silence:

sox audio_test2.wav audio_no_silence4.wav silence -l 1 0.1 1% -1 2.0 1%

And fix my sync-request.json:

{"config": {
      "encoding":"MULAW",
      "sampleRateHertz": 8000,
      "languageCode": "pt-BR",
      "enableWordTimeOffsets": false,
    "enableAutomaticPunctuation": false,
 "enableSpeakerDiarization": true,
    "useEnhanced": true,
`enter code here`"diarizationSpeakerCount":2,
 "audioChannelCount": 1},
  "audio": {
      "uri":"gs://storage/audio_no_silence4.wav"
  }
}

And run curl after that. It is working perfectly now.

1
votes

Google Cloud Speech-to-Text provides very accurate results. For some long audios it provides the transcript broken into chunks as an array of alternatives as you observed. What I did was setting MaxAlternatives = 1 in my recognition config and then concatenating the alternatives array to get the full transcript. My recognition config in c# using Google.Cloud.Speech.V1 is given below

var config = new RecognitionConfig()


{
    Encoding = RecognitionConfig.Types.AudioEncoding.Linear16,
    //SampleRateHertz = 16000,
    LanguageCode = "en",
    EnableWordTimeOffsets = true,
    MaxAlternatives = 1
 };