I have a need to do some real time transcriptions from twilio phone calls using Google speech-to-text api and I've followed a few demo apps showing how to set this up. My application is in .net core 3.1 and I am using webhooks with a Twilio defined callback method. Upon retrieving the media from Twilio through the callback it is passed as Raw audio in encoded in base64 as you can see here.
https://www.twilio.com/docs/voice/twiml/stream
I've referenced this demo on Live Transcribing as well and am trying to mimic the case statement in the c#. Everything connects correctly and the media and payload is passed into my app just fine from Twilio.
The audio string is then converted to a byte[] to pass to the Task that needs to transcribe the audio
byte[] audioBytes = Convert.FromBase64String(info);
I am following the examples based of the Google docs that either stream from a file or an audio input (such as a microphone.) Where my use case is different is, I already have the bytes for each chunk of audio. The examples I referenced can be seen here. Transcribing audio from streaming input
Below is my implementation of the latter although using the raw audio bytes. This Task below is hit when the Twilio websocket connection hits the media event. I pass the payload directly into it. From my console logging I am getting to the Print Responses hit... console log, but it will NOT get into the while (await responseStream.MoveNextAsync())
block and log the transcript to the console. I do not get any errors back (that break the application.) Is this possible to even do? I have also tried loading the bytes into a memorystream object and passing them in as the Google doc examples do as well.
static async Task<object> StreamingRecognizeAsync(byte[] audioBytes)
{
var speech = SpeechClient.Create();
var streamingCall = speech.StreamingRecognize();
// Write the initial request with the config.
await streamingCall.WriteAsync(
new StreamingRecognizeRequest()
{
StreamingConfig = new StreamingRecognitionConfig()
{
Config = new RecognitionConfig()
{
Encoding =
RecognitionConfig.Types.AudioEncoding.Mulaw,
SampleRateHertz = 8000,
LanguageCode = "en",
},
InterimResults = true,
SingleUtterance = true
}
}); ;
// Print responses as they arrive.
Task printResponses = Task.Run(async () =>
{
Console.WriteLine("Print Responses hit...");
var responseStream = streamingCall.GetResponseStream();
while (await responseStream.MoveNextAsync())
{
StreamingRecognizeResponse response = responseStream.Current;
Console.WriteLine("Response stream moveNextAsync Hit...");
foreach (StreamingRecognitionResult result in response.Results)
{
foreach (SpeechRecognitionAlternative alternative in result.Alternatives)
{
Console.WriteLine("Google transcript " + alternative.Transcript);
}
}
}
});
//using (MemoryStream memStream = new MemoryStream(audioBytes))
//{
// var buffer = new byte[32 * 1024];
// int bytesRead;
// while ((bytesRead = await memStream.ReadAsync(audioBytes, 0, audioBytes.Length)) > 0)
// {
// await streamingCall.WriteAsync(
// new StreamingRecognizeRequest()
// {
// AudioContent = Google.Protobuf.ByteString
// .CopyFrom(buffer, 0, bytesRead),
// });
// }
//}
await streamingCall.WriteAsync(
new StreamingRecognizeRequest()
{
AudioContent = Google.Protobuf.ByteString
.CopyFrom(audioBytes),
});
await streamingCall.WriteCompleteAsync();
await printResponses;
return 0;
}