I'm looking for a way to extract features from audio where I said a digit for speech recognition of the digits 1-10 using backpropagation with neural networks (10 samples for each digit and 5 samples of each digit for testing).
I tried using raw audio data and I also tried feeding the data after fft, and feeding the data with only the ten top frequencies and failed.
Can you suggest a way to extract features of the audio which will help the neural network to gain reasonable results? It's a simple project so I'm not aiming for extremely high performance, but a reasonable performance to demonstrate the ability of such network to learn.