I want to reliably convert both recorded audio (through microphone) and processed audio (WAV file) to the same discretized representations in Python using specgram.
My process is as follows:
- get raw samples (read from file or stream from mic)
- perform some normalization (???)
- perform FFT with windowing to generate spectrogram (plotting freq vs. time with amplitude peaks)
- discretize peaks in audio then memorize
Basically, by the time I get to the last discretization process I want to as reliably as possible come to the same value in freq/time/amplitude space for same song.
My problem is how do I account for volume (ie, the amplitudes of the samples) being different in recorded and WAV-read audio?
My options for normalization (maybe?):
- Divide all samples in window by mean before FFT
- Detrend all samples in window before FFT
- Divide all samples in window by max amplitude sample value (sensitive to noise and outliers) before FFT
- Divide all amplitudes in spectrogram by mean
How should I tackle this problem? I have almost no signal processing knowledge or experience.