4
votes

I am trying to do a quick spectral analysis on the streaming audio data to capture vowels (something like JLip-sync). Using PyAudio to capture the voice data in small chunks (1024) for short durations (0.0625 sec.). Using numpy.fft for the analysis, and to get rid of leakage using numpy.hanning window. I am using 4096*4 as the sampling rate (not 44100 or 22050, and open to discussion as well; 4096*4 being nearest to 22050).

Considering the frequencies I am interested in (ranging from 300 Hz to 3000Hz) how can the ideal window size be calculated using data length and min/max frequencies I am looking for?

Thanks.

Kadir

2

2 Answers

10
votes

@Kadir:

The purpose of windowing your data before processing it with a discrete Fourier transform (DFT or FFT), is to minimize spectral leakage, which happens when you try to Fourier-transform non-cyclical data.

Windowing works by forcing your data smoothly to zero at exactly the start and end of the sequence, but not before. Shortening your window destroys information unnecessarily.

So your window length should match the length of your sample sequences. For instance, with 1024 samples, your window length should be 1024.

If the highest frequency you want to resolve is 3 KHz, use 8192 samples or more, such as 16384, or 32768 samples, at various sampling rates.

Also, try a different FFT algorithm, different sample lengths, and different windows, including the Hann (Hanning), but also other windows with better side lobe attenuation, such as the Blackman-Harris series, and the Kaiser-Bessel series, etc.

If your application is noisy, you may have to choose between the better noise suppression windows, and the higher spectral resolution windows. So it's a good idea to try different windows, so you can find the best one for your application.

Now, write down your results with each setup (i.e. with each window, sample length, sampling rate, etc.), and look for results that agree across multiple setups. You will learn much about your data, and very likely find the answer to your problem.

You can do this with Matlab: http://www.mathworks.com/help/techdoc/ref/fft.html

Or with this online FFT spectrum analyzer: http://www.sooeet.com/math/fft.php

And don't forget to post your results here.

6
votes

The critical factor is how much resolution you need in the frequency domain to discriminate between different vowels. Resolution is 1 / T, where T is the duration of your FFT window. So if you sample for 62.5 ms then your maximum resolution is 16 Hz (i.e. each FFT bin is 16 Hz wide) if your FFT is the same size as your sampling interval (1024 samples). If you go to a smaller FFT then obviously your resolution will worsen proportionately, e.g. a 512 point FFT would only have a resolution of 32 Hz.