0
votes

I am trying to get a final speech transcription/recognition result from a Fleck websocket audio stream. The method OnOpen executes code when the websocket connection is first established and the OnBinary method executes code whenever binary data is received from the client. I have tested the websocket by echoing the voice into the websocket and writing the same binary data back into the websocket at the same rate. This test worked so I know that the binary data is being sent correctly (640 byte messages with a 20ms frame size).

Therefore, my code is failing and not the service. My aim is to do the following:

  1. When the websocket connection is created, send the initial audio config request to the API with SingleUtterance == true
  2. Run a background task that listens for the streaming results waiting for isFinal == true
  3. Send each binary message received to the API for transcription
  4. When background task recognises isFinal == true, stop current streaming request and create a new request - repeating steps 1 through 4

The context of this project is transcribing all single utterances in a live phone call.

socket.OnOpen = () =>
            {
                firstMessage = true;
            };
socket.OnBinary = async binary =>
            {
                var speech = SpeechClient.Create();
                var streamingCall = speech.StreamingRecognize();
                if (firstMessage == true)
                {
                    await streamingCall.WriteAsync(
                    new StreamingRecognizeRequest()
                    {
                        StreamingConfig = new StreamingRecognitionConfig()
                        {
                            Config = new RecognitionConfig()
                            {
                                Encoding = RecognitionConfig.Types.AudioEncoding.Linear16,
                                SampleRateHertz = 16000,
                                LanguageCode = "en",
                            },
                            SingleUtterance = true,
                        }
                    });
                    Task getUtterance = Task.Run(async () =>
                    {
                        while (await streamingCall.ResponseStream.MoveNext(
                            default(CancellationToken)))
                        {
                            foreach (var result in streamingCall.ResponseStream.Current.Results)
                            {
                                if (result.IsFinal == true)
                                {
                                    Console.WriteLine("This test finally worked");
                                }
                            }
                        }
                    });
                    firstMessage = false;
                }
                else if (firstMessage == false)
                {
                    streamingCall.WriteAsync(new StreamingRecognizeRequest()
                    {
                        AudioContent = Google.Protobuf.ByteString.CopyFrom(binary, 0, 640)
                    }).Wait();
                }
            };
1

1 Answers

0
votes

The mayor problem is setting apart a piece of streaming to sent the Speech request. I found the code Google-Cloud-Speech-Node-Socket-Playground that could help you with Websockets and Speech integration, take a look on the function that manages the Google Speech request:

function startRecognitionStream(client, data) {
    recognizeStream = speechClient.streamingRecognize(request)
        .on('error', console.error)
        .on('data', (data) => {
            process.stdout.write(
                (data.results[0] && data.results[0].alternatives[0])
                    ? `Transcription: ${data.results[0].alternatives[0].transcript}\n`
                    : `\n\nReached transcription time limit, press Ctrl+C\n`);
            client.emit('speechData', data);

            // if end of utterance, let's restart stream
            // this is a small hack. After 65 seconds of silence, the stream will still throw an error for speech length limit
            if (data.results[0] && data.results[0].isFinal) {
                stopRecognitionStream();
                startRecognitionStream(client);
                // console.log('restarted stream serverside');
            }
        });
}

Just keep in mind that a bad audio quality will deliver bad results. Try to follow the Best Practices regarding the audio.

I should recognize the developer (Vinzenz Aubry) because his/her program works nicely!