0
votes

I'm looking for a way to extract features from audio where I said a digit for speech recognition of the digits 1-10 using backpropagation with neural networks (10 samples for each digit and 5 samples of each digit for testing).

I tried using raw audio data and I also tried feeding the data after fft, and feeding the data with only the ten top frequencies and failed.

Can you suggest a way to extract features of the audio which will help the neural network to gain reasonable results? It's a simple project so I'm not aiming for extremely high performance, but a reasonable performance to demonstrate the ability of such network to learn.

1

1 Answers

0
votes

Why don't you try MFCCs ? MFCCs is de facto a standard in ASR. They weren't design with DNN in mind, but they proved to work with several other ASR implementation (most notably, HMM).