4
votes

I am working on a tool to compare two wave files for similarity in their waveforms. Ex, I have a wave file of duration 1min and i make another wave file using the first one but have made each 5sec data at an interval of 5sec to 0. Now my software will tell that there is waveform difference at time interval 5sec to 10sec, 15sec to 20sec, 25sec to 30 sec and so on...

As of now, with initial development, this is working fine. Following are 3 test sets:

  1. I have two wave files with sampling rate of 960Hz, mono, with no of data samples as 138551 (arnd 1min 12sec files). I am using 128 point FFT (splitting file in 128 samples chunk) and results are good.

  2. When I use the same algorithm on wave files of sampling rate 48KHz, 2-channel with no of data samples 6927361 for each channel (arnd 2min 24 sec file), the process becomes too slow. When I use 4096 point FFT, the process is better.

  3. But, the 4096 point FFT on files of 22050Hz, 2-channels with number of data samples 55776 for each channel (arnd 0.6sec file) gives very poor results. In this case 128 point FFT gives good result.

So, I am confused on how to decide the length of FFT so that my results are good in each case.

I guess the length should depend on number of samples and sampling rate. Please give your inputs on this.

Thanks

2
There are 100 things that you have to learn before you attempt something like this, to name the few - FFT windowing, sound resampling (sample rate changing), current techniques for audio fingerprinting, ... Determining correct ALGORITHM for sound matching should be FIRST and it should come from the REQUIREMENTS of your sound matching.Daniel Mošmondor
@DanielMošmondor Right now, I am not worried about different sampling rates of files as one of my input file is made using the other one. So, sampling rate is least of my concern. I am just comparing the waveforms and trying to find how similar they are by taking FFT and comparing the magnitude of the freq components. The problem I am facing is deciding in the length of FFT, to keep it fixed or dependent on sampling rate and total no of samples (or length of wav file)Garfield

2 Answers

5
votes

The length of the FFT, N, will determine the resolution in the frequency domain:

resolution (Hz) = sample_rate (Hz) / N

So for example in case (1) you have resolution = 960 / 128 = 7.5 Hz. SO each bin in the resulting FFT (or presumably the power spectrum derived from this) will be 7.5 Hz wide, and you will be able to differentiate between frequency components which are at least this far apart.

Since you don't say what kind of waveforms these are, or what the purpose of your application is, it's hard to know what kind of resolution you need.

One important further point - many people using FFT for the first time are unaware that in general you need to apply a window function prior to the FFT to avoid spectral leakage.

1
votes

I have to say I have found your question very cryptic. I think you should look into Short-time Fourier transform. The reason I say this is because you are looking at quite a large amount of samples if you use a sampling frequency of 44.1KhZ over 2mins with 2 channels. One fft across the entire amount will take quite a while indeed, not to mention the estimate will be biased as the signals mean and variance will change drastically over the whole duration. To avoid this you want to frame the time-domain signal first, these frames can be as small as 20ms-40ms (commonly used for speech) and often overlapping (Welch method of Spectral Estimation). Then you apply a window function such as Hamming or Hanning window to reduce spectral leakage and calculate an N-Point fft for each frame. Where N is the next power of two above the number of samples in that frame. For example:

  1. Fs = 8Khz, single channel;
  2. time = 120sec;
  3. no_samples = time * Fs = 960000 ;
  4. frame length T_length= 20ms;
  5. frame length in samples N_length = 160;
  6. frame overlap T_overlap= 10ms;
  7. frame overlap in samples N_overlap= 80;
  8. Num of frames N_frames = (no_samples - (N_length-N_overlap))/N_overlap = 11999;
  9. FFT length = 256;

So you will be processing 11999 frames in total, but your FFT length will be small. You will only need an FFT length of 256 (next power of two above frame length 160). Most algorithms that implement the fft require the signal length and fft length to be the same. All you have to do is append zeros to your framed signal up until 256. So pad each frame with x amount of zeros, where x = FFT_length-N_length. My latest android app does this on recorded speech and uses the short-time FFT data to display the Spectrogram of speech and also performs various spectral modification and filtering, its called Speech Enhancement for Android