3
votes

I'm developing a way to compare two spectrograms and score their similarity. I have been thinking for a long time how to do so, how to pick the whole model/approach. Audioclips I'm using to make spectrograms are recordings from android phone, i convert them from .m4a to .wav and then process them to plot the spectrogram, all in python.

All audio recordings have same length

Thats something that really help because all the data can then be represented in the same dimensional space.

I filtered the audio using Butterworth Bandpass Filter, which is commonly used in voice filtering thanks to its steady behavior in the persisted part of signal. As cutoff freq i used 400Hz and 3500Hz

After this procedure the output looks like this Filtered

My first idea was to find region of interest using OpenCV on that spectrogram, so i filtered color and get this output, which can be roughly use to get the limits of the signal, but that will make every clip different lenght and i perhaps dont want that to happenmask

Now to get to my question - i was thinking about embedding those spectrograms to multidimensional points and simply score their accuracy as the distance to the most accurate sample, which would be visualisable thanks to dimensionality reduction in some cluster-like space. But thats seems to plain, doesnt involve training and thus making it hard to verify. SO

Is there any possibility to use Convolution Neural Network, or combination of networks like CNN -> delayed NN to embed this spectogram to multidim point and thus making it possible to not compare them directly but comparing output of the network?

If there is anything i missed in this question please comment, i would fix that right away, thank you very much for your time.

Josef K.

EDIT:

After tip from Nikolay Shmyrev i switched to using the Mel spectrogram:mel

That looks much more promising, but my question remains almost the same, can i use pretrained CNN models, like VGG16 to embed those spectrograms to tensors and thus being able to compare them ?? And if so, how? Just remove last fully connected layer and Flatten it instead?

2

2 Answers

1
votes

In my opinion, and according to Yann Lecun, when you target speech recognition with Deep Neural Network you have two obligations:

  • You need to use a Recurrent Neural Network in order to have the memory ability (memory is really important for speech recognition...)

and

  • you will need a lot of training data

You may try to use RNN on tensorflow, but you definitely need a lot of training data.

If you don't want (or can't) find or generate a lot training data, you have forget the deep learning to solve this ...

In that case (forget deep learning) you may take a look of how Shazam work (based on fingerprint algorithm)

1
votes

You can use CNN of course, tensorflow has special classes for that for example as many other frameworks. You simply convert your image to a tensor and apply the network and as a result you get lower-dimensional vector you can compare.

You can train your own CNN too.

For best accuracy it is better to scale lower frequencies (bottom part) and compress higher frequencies in your picture since lower frequencies have more importance. You can read about Mel Scale for more information