0
votes

I am using pocketsphinx for speech recognition with a Spanish acoustic model and a JSGF grammar, with decent results so far.

However, I'm getting erroneous recognition results with audios that, at least to my ear, seem perfectly intelligible (not so much background noise, sampling frequency and bit depth according to acoustic model parameters, etc).

Also this audios that are not correctly recognized, do not seem to differ a great deal from the ones that are correctly recognized (in fact they sound pretty much the same to me).

So, I'm guessing there is something in the audio that makes it more difficult to recognize, perhaps some noise frequencies or other stuff that need to be filtered? (background noise, "pop" sounds of speech, frequencies outside the band of the human voice, etc)

In short, do you know if pocketsphinx already does something of this, and if not, do you know any best-practice filter/transformation/etc to be applied to an audio file in order to improve speech recognition results?

Thanks!

1
I cannot answer this question, but I can say it looks like you have an XY problem here: meta.stackexchange.com/questions/66377/what-is-the-xy-problem Without sharing your code, it is impossible for anyone to say if what you need is pre-processing, or if there is an error in your code somewhere. Be sure to share your code and optimally provide an MVCE: stackoverflow.com/help/mcve - bodangly
@bodangly I understand, but I am using pocketsphinx, which is a standard and very used library for this. So, my question is directed to other pocketsphinx users or developers with knowledge of its internals. (meaning that I am not, so far, coding anything but the API calls to pocketsphinx, which are trivial). - jotadepicas
You might need to instrument the internals of PocketSphinx code to determine exactly what is the cause of the different the output decisions. - hotpaw2

1 Answers

1
votes

No, any preprocessing is usually quite harmful for speech recognition accuracy.

The modern speech recognition algorithms are made the way that even slight preprocessing might get results much worse. It will not be easily distinguishable by your ear since your speech recognition capabilities are far more superior than computer ones. Things like slight echo added to improve naturalness or simple mp3 compression/decompression might reduce accuracy significantly.

The solution for this is to train a model from the same audio you want to recognize, for example, train on mp3 decompressed audio instead of clean one. Default model is trained on a clean audio and that makes it not very robust to sound modifications. Such multi-style training has its own disadvantages because it makes training data very big, so it still a subject of ongoing research.