4
votes

I want to reliably convert both recorded audio (through microphone) and processed audio (WAV file) to the same discretized representations in Python using specgram.

My process is as follows:

  1. get raw samples (read from file or stream from mic)
  2. perform some normalization (???)
  3. perform FFT with windowing to generate spectrogram (plotting freq vs. time with amplitude peaks)
  4. discretize peaks in audio then memorize

Basically, by the time I get to the last discretization process I want to as reliably as possible come to the same value in freq/time/amplitude space for same song.

My problem is how do I account for volume (ie, the amplitudes of the samples) being different in recorded and WAV-read audio?

My options for normalization (maybe?):

  • Divide all samples in window by mean before FFT
  • Detrend all samples in window before FFT
  • Divide all samples in window by max amplitude sample value (sensitive to noise and outliers) before FFT
  • Divide all amplitudes in spectrogram by mean

How should I tackle this problem? I have almost no signal processing knowledge or experience.

1
since the noise will probably never have the highest amplitudes, you can 1) divide each sample by its respective maximum amplitue before the FFT; 2) then multiply by the target common amplitude also before FFT; 3) perform the FFT in the normalized sample - Saullo G. P. Castro
I am not sure I understand the question, but if all you care about is the peak locations in the spectrogram, there is no need to normalize amplitude. - Bjorn Roche
@SaulloCastro: what is a "target common amplitude"? - lollercoaster
@BjornRoche: my understanding is that the spectrogram represents the amplitude of the signal (which is tied to the volume) as a function of time of the window and frequency. if that thinking is correct, wouldn't it make sense to normalize amplitude to account for different scalar multiples of volume? - lollercoaster
@lollercoaster By target common amplitude I meant the common amplitude that you want to have at the end, the common volume... - Saullo G. P. Castro

1 Answers

3
votes

The spectra of the WAV file and the recorded audio are never going to have exactly the same shape because the audio data from the microphone source undergoes additional disturbances on it's way to your computer. These disturbances could be equalized out, but that's probably more work than you want to do.

As far as normalization goes, I'd recommend scaling the microphone signal's spectrum so that its energy matches that of the WAV file's spectrum (where "energy" is the sum of the squared magnitude of FFT coefficients).

Now, you mentioned that you want the signals' spectrograms to be as similar as possible. Since a spectrogram is a plot of a signal's spectrum over time, you might want to experiment with renormalizing at each time interval vs. just normalizing once over the entire audio recording.