14
votes

I've been working with Python speech recognition for the better part of a month now, making a JARVIS-like assistant. I've used both the Speech Recognition module with Google Speech API and Pocketsphinx, and I've used Pocketsphinx directly without another module. While the recognition is accurate, I've had a hard time working with the large amount of time these packages take to process speech. The way they seem to work is by recording from one point of silence to another, and then passing the recording to the STT engine. While the recording is being processed, no other sound can be recorded for recognition, which can be a problem if I'm trying to issue multiple complex commands in series.

When looking at the Google Assistant voice recognition, Alexa's voice recognition, or Mac OS High Sierra's offline recognition, I see words being recognized as I say them without any pause in the recording. I've seen this called realtime recognition, streaming recognition, and word-by-word recognition. Is there any way to do this in Python, preferably offline without using a client?

I tried (unsuccessfully) to accomplish this by changing pause threshold, speaking threshold, and non-speaking threshold for the SpeechRecognition recognizer, but that just caused the audio to segment strangely and still needed a second after each recognition before it could record again.

1
Do you have any update on this one? Maybe you found the answer, I'm looking for resources or existing solutions to write sth like that. Thanks in advancekolboc
Possibly... use parallel processes? @kolboc any progress?Kyle Swanson

1 Answers

5
votes

Pocketsphinx can process streams, see here

Python pocketsphinx recognition from the microphone

Kaldi can process streams too (more accurate than pocketsphinx)

https://github.com/alphacep/kaldi-websocket-python/blob/master/test_local.py

Google speech API can also process streams, see here:

Google Streaming Speech Recognition on an Audio Stream Python