I'm developing a way to compare two spectrograms and score their similarity. I have been thinking for a long time how to do so, how to pick the whole model/approach. Audioclips I'm using to make spectrograms are recordings from android phone, i convert them from .m4a to .wav and then process them to plot the spectrogram, all in python.
All audio recordings have same length
Thats something that really help because all the data can then be represented in the same dimensional space.
I filtered the audio using Butterworth Bandpass Filter, which is commonly used in voice filtering thanks to its steady behavior in the persisted part of signal. As cutoff freq i used 400Hz
and 3500Hz
After this procedure the output looks like this
My first idea was to find region of interest using OpenCV on that spectrogram, so i filtered color and get this output, which can be roughly use to get the limits of the signal, but that will make every clip different lenght and i perhaps dont want that to happen
Now to get to my question - i was thinking about embedding those spectrograms to multidimensional points and simply score their accuracy as the distance to the most accurate sample, which would be visualisable thanks to dimensionality reduction in some cluster-like space. But thats seems to plain, doesnt involve training and thus making it hard to verify. SO
Is there any possibility to use Convolution Neural Network, or combination of networks like CNN -> delayed NN to embed this spectogram to multidim point and thus making it possible to not compare them directly but comparing output of the network?
If there is anything i missed in this question please comment, i would fix that right away, thank you very much for your time.
Josef K.
EDIT:
After tip from Nikolay Shmyrev i switched to using the Mel spectrogram:
That looks much more promising, but my question remains almost the same, can i use pretrained CNN models, like VGG16 to embed those spectrograms to tensors and thus being able to compare them ?? And if so, how? Just remove last fully connected layer and Flatten it instead?