There have been countless discussions on Stackoverflow and beyond about FFT and pitch detection.
It's generally accepted that FFT, while fast, is not very accurate for a lot of applications but it's not often explained why.
I'd like to explain my understanding of why this is the case, in the hope that someone smarter than me can correct me and fill in the blanks where I cannot.
FFT converts input data from the time domain to the frequency domain.
Initially, we start with a series of data which, if we were to plot on a graph, would have the amplitude of the sound at a given point in time on the Y axis, and time along the X axis. This is in the time domain.
FFT converts these values of amplitudes at points in time, to amplitudes at different frequencies.
The number of data output from the FFT is the SAME as the number of data input
If we input the amplitudes for 10 points in time (10 samples) FFT will output the amplitudes for 10 different frequencies within those samples (after multiplying the sqrt of the imaginary and real numbers).
Which frequencies is determined by the following:
We call the output from an FFT a bin, the width of each bin is calculated by dividing the sample rate by the number of samples in the FFT:
bin width = Sample Rate(Hz)/FFT Length (n samples)
With some real values, that could be:
bin_width = 44100 / 512 = 86.132
So we have 512 bins from our FFT (remember it's the same no. of data in as out), and each one spans 86.132 Hz.
So for a given bin, we can calculate which frequency it represents by:
Bin Freq (Hz) = Bin number (n) * bin width (Hz)
Using the values from above, the 3rd bin in the FFT output would represent the amplitude at 258.398Hz:
Bin Freq (Hz) = 3 * 86.132 = 258.396Hz
This means that with the given sample rate and buffer size, the FFT output cannot be any more accurate than ± 86.132Hz.
If you require greater accuracy (say 1Hz) you'd have to either reduce the sample rate, or increase the buffer size (or both).
desired bin width: 1Hz = 44100 / 44100 # A buffer size of 44100 would work in this instance
As the buffer size gets closer to the sample rate, the issue of latency becomes more of a problem.
FFT Results per second = Sample Rate / Buffer Size = 44100/44100 = 1 FFT per second
(44100 samples per second, to fill a 44100 sample buffer = 1 full buffer per second).
I realise that there's more to FFT than merely calculating the fundamental frequency (the bin with the highest amplitude), but is my understanding of FFT in Pitch Detection correct so far?
Are there any ways of improving the accuracy of an FFT without sacrificing latency?
N
equal chunks. Depending on the implementation, different choices ofN
are allowed but2
is virtually always allowed.3
and5
are less common.4
is redundant (that's just 2x2), as is6
(2x3). A factor7
starts to become less useful as you can pad up to 8. – MSalters