7
votes

I am new to FFTs and signal processing, so hopefully this question makes sense and/or isn't stupid.

I would like to perform spectrum analysis on a live audio signal. My goal is to find a good tradeoff between responsiveness and frequency resolution, such that I can take a guess at the pitch of the incoming audio in near-realtime.

From what I've gathered about the math behind the Fourier transform, there is an inherent balance between sample size and frequency resolution. The bigger the sample, the better resolution. Since I am trying to minimize sample size (to attain the near-realtime requirement), this means my resolution suffers (each slot in the output buffer corresponds to a wide frequency range, which is undesirable).

However, for my intended application, I don't care about most of the spectrum. I only need spectrum info for a narrow frequency range, say 100hz - 1600hz for example. Is there any way to modify an FFT implementation such that I can improve the resolution of the frequency domain output while keeping the input buffer size constant (and small)? In other words, can I trade output total bandwidth for output resolution? If so, how is this done?

Although I have a weak grasp of the math at best, it seems that padding the input buffer with zeros might be interesting, no?

Thanks in advance for any help you can offer.

6
Padding the input buffer with zeros is the same as the signal suddenly slamming to silence - the abrupt transition will generate its own frequencies in the output.Mark Ransom
That's true, but padding with zero won't have a net effect on the overall relative frequency distribution, right? And the increased buffer size would give me better output resolution too, right? It does feel like "getting something for nothing", though :/gga80
It is more useful to insert zeroes between input samples, rather than padding the input buffer (i.e. insert N zero samples between each real sample). This gives a higher resolution output spectrum, although no new information is actually gained - in effect it's just interpolation.Paul R

6 Answers

8
votes

You can't get additional information from nowhere, but you can reduce latency by overlapping successive FFTs. For real-time power spectrum estimates it's common to overlap successive input windows by 50%.

Inserting zeroes between samples is another useful trick - it gives you more apparent resolution in the output bins, but in reality all you are doing is interpolating, i.e. there is no additional information gained (of course). You might find this technique useful though, in addition to the overlap suggestion above.

3
votes

As Mark says, adding zeros will introduce harmonics (unwanted frequencies).

Also when you say "bigger the sample", do you mean more samples, or a higher frequency sample rate? A higher frequency sample rate will result in more samples per unit of time, but it seemed like you meant more samples at a fixed sample rate (ie. Analyzing larger chunks of time).

You mentioned an upper frequency of 1600Hz, so you will need a sample rate of AT LEAST 3200Hz, ie. Double.

As for period of time to process at once: you will need to trade responsiveness (a 10 second buffer will take 10s + processing time before you get the result) vs. Reducing noise. Smaller buffers are more likely to pick up spurious noise signals.

As an aside, thinking in the frequency domain can be challenging at first. I found the best thing for this, were not the various applied maths classes that I took at univ, but a crystallography class. A crystal diffraction pattern is merely a 2d Fourier transform. Getting a handle on how a diffraction pattern visually relates to the crystal structure proved very useful for when it came to work with FFTs of seismic data in my first job.

3
votes

I don't think there is a 'trick' to outperform the FFT. "Adding zeros" can also mean oversampling the signal. To get rid of the harmonics, the signal would have to be filtered (which will most certainly introduce extra noise). Then you would do a longer FFT, but after that the overall resolution will still be the same.

Also your windowing function will broaden the frequency peaks in your results.

OTOH, if a frequency falls between two FFT bins, it is possible to get a better resolution by looking at the ratio of the neighboring bins: http://www.tedknowlton.com/resume/FFT_Bin_Interp.html

But this does not work for more complex signals (with many simultanuous frequencies).

  • If you want to know if certain frequencies are present, I would look into filters and correlation.

  • If you want to nail down one certain frequency, you can first filter it out and then detect the zero-crossings. There are many parameters when designing a filter, so filter length is only one parameter that leads to a certain filter (step-) response time. You can do this for more than one frequency, one after the other...

Addition: Some intuition:

  1. Because the FFT is sufficient for reconstruction, there are principally infinitely many higher-resolution spectra that lead to the same sample vector, and none is more-correct. The bin interpolation essentially calculates another ('better fitting') representation than the evenly-spaced bins of the Fast-Fourier-Transform.

  2. In the discrete, quantized case, e.g. 8-bit, think about two frequencies that are very close. If the difference is small enough, they would yield the same, say 256, samples. But looking at more samples (maybe 1024) you would notice that the difference becomes big enough to be seen.

PS: The filtering for oversampling can also be done after the FFT by simply ignoring the higher bins.

2
votes

You could low-pass filter the data at 1600 Hz (or somewhat higher, say 2k), and then resample to a lower samplerate (twice the filter frequency e.g. 4k) to reduce the number of samples. Then use zero-padding to increase the frequency resolution.

0
votes

Your stated goal is incompatible with your question. The pitch of audio is not same as the resolved frequency peak. Please read the vast literature on vocal and musical pitch estimation (which applies to many other types of sounds that have a perceived pitch). Adaptive/incremental/sliding time domain techniques may give you a lower latency than frequency domain block based techniques.

Zero padding of the audio sample vector is nearly identical with interpolation of the frequency domain data. If there is little noise or nearby interference, you may find a more accurate (higher "resolution") frequency peak position. But you won't get any better rejection of nearby spectral peaks (separation resolution) or noise.

Windowing the data (von Hann, etc.) before your FFT may help remove some of the noise caused by nearby, but not-bin or 2-bin adjacent, frequencies.

Added: unless your after-sampling low-pass filter is nearly perfect and phase linear, you could actually lose frequency resolution near the edges of your desired frequency band. Filtering doesn't add any actual information into the band of interest, so is of no help in increasing "resolution". Windowing is more likely to reduce interference from other frequencies.

0
votes

You might want to look into Compressed Sensing. You can sample (and store) what is essentially a pre-compressed signal which you can reconstruct later. As long as the signal sparsity is high (which will probably be the case in your situation) the Shanon-Nyquist constraint can be bent somewhat. The downside is that post-processing to recreate the original signal can be computationally time-intensive. Also, you're probably going to have to develop your own device drivers to manage whatever hardware you're using to sample your signal since the factory drivers probably assume you're interested in adhering to the Nyquist-Shannon constraints. More information can be found here.