7
votes

There have been countless discussions on Stackoverflow and beyond about FFT and pitch detection.

It's generally accepted that FFT, while fast, is not very accurate for a lot of applications but it's not often explained why.

I'd like to explain my understanding of why this is the case, in the hope that someone smarter than me can correct me and fill in the blanks where I cannot.


FFT converts input data from the time domain to the frequency domain.

Initially, we start with a series of data which, if we were to plot on a graph, would have the amplitude of the sound at a given point in time on the Y axis, and time along the X axis. This is in the time domain.

FFT converts these values of amplitudes at points in time, to amplitudes at different frequencies.

The number of data output from the FFT is the SAME as the number of data input

If we input the amplitudes for 10 points in time (10 samples) FFT will output the amplitudes for 10 different frequencies within those samples (after multiplying the sqrt of the imaginary and real numbers).

Which frequencies is determined by the following:

We call the output from an FFT a bin, the width of each bin is calculated by dividing the sample rate by the number of samples in the FFT:

bin width = Sample Rate(Hz)/FFT Length (n samples)

With some real values, that could be:

bin_width = 44100 / 512 = 86.132

So we have 512 bins from our FFT (remember it's the same no. of data in as out), and each one spans 86.132 Hz.

So for a given bin, we can calculate which frequency it represents by:

Bin Freq (Hz) = Bin number (n) * bin width (Hz)

Using the values from above, the 3rd bin in the FFT output would represent the amplitude at 258.398Hz:

Bin Freq (Hz) = 3 * 86.132 = 258.396Hz

This means that with the given sample rate and buffer size, the FFT output cannot be any more accurate than ± 86.132Hz.

If you require greater accuracy (say 1Hz) you'd have to either reduce the sample rate, or increase the buffer size (or both).

desired bin width: 1Hz = 44100 / 44100  # A buffer size of 44100 would work in this instance 

As the buffer size gets closer to the sample rate, the issue of latency becomes more of a problem.

FFT Results per second = Sample Rate / Buffer Size = 44100/44100 = 1 FFT per second

(44100 samples per second, to fill a 44100 sample buffer = 1 full buffer per second).

I realise that there's more to FFT than merely calculating the fundamental frequency (the bin with the highest amplitude), but is my understanding of FFT in Pitch Detection correct so far?

Are there any ways of improving the accuracy of an FFT without sacrificing latency?

4
Note that FFT works the best with buffer of size of a power of two.nalply
@nalply: That's the case for some FFT implementations, but certainly not all. FFT implementations work recursively, and require that at each step the remaining data is partitioned in N equal chunks. Depending on the implementation, different choices of N are allowed but 2 is virtually always allowed. 3 and 5 are less common. 4 is redundant (that's just 2x2), as is 6 (2x3). A factor 7 starts to become less useful as you can pad up to 8.MSalters

4 Answers

3
votes

In addition to the nice answer by @HartmutPfitzinger recommending zero padding and interpolation, it's worth pointing out that there are important fundamental limits on the information you can get from the Fourier transform of a time-limited signal extract.

Think about the limiting case of zero padding -- e.g., taking a single sample, then padding it out to 1 sec duration in order to take a Fourier transform with 1 Hz resolution. It's clear that a very short fragment of signal simply doesn't contain the information about periodicity. Intuitively, we need a fragment longer than the period in question to be able to say anything about whether the signal really repeats at that period.

We can do better if we have constraints on the shape of our periodic signal. For instance, if we are only looking for single sinusoids (i.e., we know our signal is s(t) = A*cos(w*t + phi)), then we can solve for unknown amplitude A, frequency w, and phase phi using as few as three samples of s(t). However, it's pretty rare that we're looking at signals that exactly fit that formulation. At the very least we expect added noise, but most often we have lots of harmonics, i.e., an unknown, non-sinusoidal periodic waveform.

If you try implement interpolated peak picking and/or zero-padding suggested above, and then look at the results you get as you make your signal excerpt shorter (while keeping the FFT length the same), you'll see the uncertainty (error) growing as the fragment gets shorter - to the point where you'll likely get useless results when the fragment is shorter than around twice the cycle length of the periodicity you're trying to measure.

This illustrates a somewhat counter-intuitive but very fundamental limit: It's hard to decide the frequency of a signal to better than 1/T Hz based an observation shorter than T seconds. This is sometimes called the uncertainty principle, and mathematically it's the same as the Heisenberg uncertainty principle from quantum mechanics.

Finally, one other technique I've used to improve the resolution of discrete Fourier transforms is Instantaneous Frequency, as described in:

Toshihiko Abe, Takao Kobayashi, Satoshi Imai: Robust pitch estimation with harmonics enhancement in noisy environments based on instantaneous frequency. ICSLP 1996 (you can find the PDF online, I've run out of my link allowance).

Frequency is just the derivative of phase with respect to time; it turns out you can use "derivative by parts" to directly calculate the instantaneous frequency in each FFT bin by combining the real and imaginary parts of two FFTs using different window functions (one the derivative of the other). For a Matlab implementation, see

http://labrosa.ee.columbia.edu/matlab/chroma-ansyn/ifgram.m

or in Python:

https://github.com/bmcfee/librosa/blob/master/librosa/core.py#L343

3
votes

Concerning your 1st question ("is my understanding of FFT in Pitch Detection correct so far?") I would say yes, but I would like to point to a pitfall:

Using the values from above, the 3rd bin in the FFT output would represent the amplitude at 258.398Hz:

Bin Freq (Hz) = 3 * 86.132 = 258.396Hz

Be aware that the 0th bin represents 0 Hz. This means that the bin representing 3 * 86.132 = 258.396Hz, is in the 4th position of the resulting array.

And to complete this index pitfall, if you have an FFT of 512 points (=fftsize), the index value 256 represents the Nyquist frequency (= sample frequency / 2). This means you always get fftsize/2+1 bins representing the real frequency spectrum, i.e. in your case 257 bins.

Concerning your 2nd question, there are two wide-spread and simple methods to increase the frequency detection precision:

  1. Zero-padding (See e.g. some answers why zero-padding is used)

  2. Parabolic Interpolation (See e.g. the 1st answer)

Finally, a not-asked answer: It is absolutely recommended to apply a window function, not only because of it is the premise for parabolic interpolation but also because it reduces the amplitude of significant artificial side lobes.

3
votes

You need to first understand what 'pitch' really is. When a single note is made on a guitar or piano, what we hear is not just one frequency of sound vibration, but a composite of multiple sound vibrations occurring at different mathematically related frequencies. The elements of this composite of vibrations at differing frequencies are referred to as harmonics or partials. For instance, if we press the Middle C key on the piano, the individual frequencies of the composite's harmonics will start at 261.6 Hz as the fundamental frequency, 523 Hz would be the 2nd Harmonic, 785 Hz would be the 3rd Harmonic, 1046 Hz would be the 4th Harmonic, etc. The later harmonics are integer multiples of the fundamental frequency, 261.6 Hz ( ex: 2 x 261.6 = 523, 3 x 261.6 = 785, 4 x 261.6 = 1046 ).

Below, at GitHub.com, is the C++ source code for an unusual two-stage algorithm that I devised which can do Realtime Pitch Detection on polyphonic MP3 files while being played on Windows. This free application (PitchScope Player, available on web) is frequently used to detect the notes of a guitar or saxophone solo upon a MP3 recording. You could download the executable for Windows to see my algorithm at work on a mp3 file of your choosing. The algorithm is designed to detect the most dominant pitch (a musical note) at any given moment in time within a MP3 or WAV music file. Note onsets are accurately inferred by a change in the most dominant pitch (a musical note) at any given moment during the MP3 recording.

I use a modified DFT Logarithmic Transform (similar to a FFT) to first detect these possible Harmonics by looking for frequencies with peak levels (see diagram below). Because of the way that I gather data for my modified Log DFT, I do NOT have to apply a Windowing Function to the signal, nor do add and overlap. And I have created the DFT so its frequency channels are logarithmically located in order to directly align with the frequencies where harmonics are created by the notes on a guitar, saxophone, etc.

My Pitch Detection Algorithm is actually a two stage process: a) First the ScalePitch is detected ('ScalePitch' has 12 possible pitch values: {E, F, F#, G, G#, A, A#, B, C, C#, D, D#} ) b) and after ScalePitch is determined, then the Octave is calculated by examining all the harmonics for the 4 possible Octave-Candidate notes. The algorithm is designed to detect the most dominant pitch (a musical note) at any given moment in time within a polyphonic MP3 file. That usually corresponds to the notes of an instrumental solo. Those interested in the C++ source code for my Two Stage Pitch Detection algorithm might want to start at the Estimate_ScalePitch() function within the SPitchCalc.cpp file at GitHub.com.

https://github.com/CreativeDetectors/PitchScope_Player

https://en.wikipedia.org/wiki/Transcription_(music)#Pitch_detection

enter image description here

1
votes

Generally, increasing the length of the FFT not only increases the latency but can also make it difficult to detect the frequency since it is probably not constant.
A common practice it to use overlapping and windowing, take a look at this: http://en.wikipedia.org/wiki/Spectral_density_estimation

There are a few method of improving the accuracy of the estimated frequency, after the peak detection. For example, by interpolation on the Fourier coefficients.
Take a look at section 2 here or section 1.3 here