4
votes

I am writing an application which should receive audio and send it to Bing Recognition API to get text. I used the Service Library and it works with a wav file. So I wrote my own stream class to receive audio from mic or network (RTP) as send it to the recognition API. When I add a WAV header in front of the audio stream, it works for some seconds.

Debugging shows, that the recognition api reads form stream faster than it is filled by audio source (16k samplerate, 16 bit, mono).

So my question is: Is there a way to use the recognize api with a real-time (continuous) audio stream?

I know there is an example with a microphone client, but it works with microphone only and I need it for different sources.

3
Do you just want to to send audio in realtime and get back results as someone speaks? Or do you want to send an arbitrarily long stream of audio? Maybe if you link to the microphone example your question will be clearer.John Wiseman
I want to send audio in realtime to get partial results during speaking. Principially like the microphone sample in the sample folder but for different sources (e.g. RTP). But I hope I found a solution (have to do some more tests). If it works I will create an answer with the description.H.G. Sandhagen

3 Answers

1
votes

If you want to use sources other than a microphone, you can use a DataRecognitionClient class, by calling SpeechRecognitionServiceFactory's CreateDataClient method. Once you have the client object, you can take audio from any source--microphone, network, reading from a file, etc.--and send it to be processed with the client's SendAudio method. As you receive each audio buffer, you make a new call to SendAudio.

While you're in the process of sending audio with SendAudio, you will receive partial recognition results in realtime (or close) in the form of the client's OnPartialResponse event.

When you're done sending audio, you signal to the client that you're ready for the final recognition result by calling EndAudio. You should then receive a OnResponseReceived event from the client containing the final recognition hypotheses.

2
votes

I found a solution for my problem. I wrote a class AudioStream inherited from stream which buffers the input and wait when the Read method is called and its buffer is empty. This prevents the recognizer to stop because the read method return always a value > 0. Here is the important part code of this class:

public class AudioStream : Stream {
private AutoResetEvent _waitEvent = new AutoResetEvent(false);

internal void AddData(byte[] buffer, int count) {
    _buffer.Add(buffer, count);
    // Enable Read
    _waitEvent.Set();
}
public override int Read(byte[] buffer, int offset, int count) {
    int readCount = 0;
    if ((_buffer.Empty) {
        // Wait for input
        _waitEvent.WaitOne();
    }
    ......
    // Fill buffer from _buffer;

    _waitEvent.Reset();
    return length;
}
protected override void Dispose(bool disposing) {
    // Make sure, that there is no waiting Read
    // Clear buffer, dispose wait event etc.
}
......

}

Because audio data is received continously, the Read method will not "hang" longer than some miliseconds (e.g. RTP packages are received all 20 ms).

0
votes

Adding additional supporting information on this topic: The stream implementation has to support concurrent read/write operations, and block when it has no data.