I am trying learn deep learning and specifically using convolutional neural networks. I'd like to apply a simple network on some audio data. Now, as far as I understand CNNs are often used for image and object recognition, and therefore when using audio people often use the spectrogram (specifically mel-spectrogram) instead of the signal in the time-domain. My question is, is it better to use an image (i.e. RGB or greyscale values) of the spectrogram as the input to the network, or should I use the 2d magnitude values of the spectrogram directly? Does it even make a difference?
Thank you.