0
votes

I am looking to understand various spectrograms for audio analysis. I want to convert an audio file into 10 second chunks, generate spectrograms for each and use a CNN model to train on top of those images to see if they are good or bad.

I have looked at linear, log, mel, etc and read somewhere that mel based spectrogram is best to be used for this. But with no proper verifiable information. I have used the simple following code to generate mel spectrogram.

y,sr= librosa.core.load(r'C:\Users\Tej\Desktop\NoiseWork\NoiseOnly\song.wav')
S = librosa.feature.melspectrogram(y=y, sr=sr)
librosa.display.specshow(librosa.power_to_db(S, ref=np.max))

My question is which spectrogram best represents features of an audio file for training with CNN? I have used linear but some audio files the linear spectrogram seems to be the same

2
You will need to specify the task in greater detail. "to see if they are good or bad" is not a task. What are your target labels? Tempo? Beats? Chords? Key? General tags? Audio quality?... Depending on the task you'd choose an appropriate spectrogram.Hendrik
Well how about Noise. Continuous, Intermittent, Impulsive and such others. Popping, Static, Crackling are few more. I am trying to get samples of those individual noise samples so i can augment and create more samples for myself.Teja8484
I'm not convinced a spectrogram is the best signal representation for this kind of problem, as the noise you describe is not harmonic in nature. Instead, you might be better off using Conv1D with large strides on a downsampled signal (time domain, not frequency domain). After all, you are looking for outliers (cracks), not harmonic/percussive patterns. Just an idea. That said, if you go the spectrum route, cracks show up as vertical lines (over all freq bands), because they are percussive in nature, so Mel or linear does not matter—hop-size (time resolution) is the deciding factor. Good luck.Hendrik
Can you upload two of the problematic spectrograms?Jon Nordby

2 Answers

3
votes

To add to what has been stated, I recommend reading through A Comparison of Audio Signal Preprocessing Methods for Deep Neural Networks on Music Tagging by Keunwoo Choi, György Fazekas, Kyunghyun Cho, and Mark Sandler.

For their data, they achieved nearly identical classification accuracy between simple STFTs and melspectrograms. So melspectrograms seem to be the clear winner for dimension reduction if you don't mind the preprocessing. The authors also found, as jonner mentions, that log-scaling (essentially converting amplitude to a db scale) improves accuracy. You can easily do this with Librosa (using your code) like this:

y,sr= librosa.core.load(r'C:\Users\Tej\Desktop\NoiseWork\NoiseOnly\song.wav')
S = librosa.feature.melspectrogram(y=y, sr=sr)
S_db = librosa.core.power_to_db(S)

As for normalization after db-scaling, that seems hit or miss depending on your data. From the paper above, the authors found nearly no difference using various normalization techniques for their data.

One last thing that should be mentioned is a somewhat new method called Per-Channel Energy Normalization. I recommend reading Per-Channel Energy Normalization: Why and How by Vincent Lostanlen, Justin Salamon, Mark Cartwright, Brian McFee, Andrew Farnsworth, Steve Kelling, and Juan Pablo Bello. Unfortunately, there are some parameters that need adjusting depending on the data, but in many cases seems to do as well as or better than logmelspectrograms. You can implement it in Librosa like this:

y,sr= librosa.core.load(r'C:\Users\Tej\Desktop\NoiseWork\NoiseOnly\song.wav')
S = librosa.feature.melspectrogram(y=y, sr=sr)
S_pcen = librosa.pcen(S)

Although, like I mentioned, there are parameters within pcen that need adjusting! Here is Librosa's documentation on PCEN to get you started if you are interested.

1
votes

Log-scaled mel-spectrograms is the current "standard" for use with Convolutional Neural Networks. It was the most commonly used in Audio Event Detection and Audio Scene Classification literature between 2015-2018.

To be more invariant to amplitude changes, normalized is usually applied. Either to entire clips or the windows being classified. Mean/std normalization works fine, generally.

But from the perspective of a CNN, there is relatively small difference between the different spectrometer variations. So this is unlikely to fix your issue if two or more spectrograms are basically the same.