Input data for convolutional neural network

Question

I am trying learn deep learning and specifically using convolutional neural networks. I'd like to apply a simple network on some audio data. Now, as far as I understand CNNs are often used for image and object recognition, and therefore when using audio people often use the spectrogram (specifically mel-spectrogram) instead of the signal in the time-domain. My question is, is it better to use an image (i.e. RGB or greyscale values) of the spectrogram as the input to the network, or should I use the 2d magnitude values of the spectrogram directly? Does it even make a difference?

Thank you.

You might find this helpful: Convolutional Neural Network (CNN) for Audio. — rrao
thanks @rrao, I have seen this already and it doesn't answer my question really. I also disagree with the answer you referenced to, the only thing that spectrograms "throw away" is the phase information. — nevos

Prune Prune · Accepted Answer · 2016-06-20T19:23:19

The spectrogram is a lovely representation, especially for describing the process. Functionally, it's merely a simplification of the input data that adds no information, and loses a smidgen of accuracy -- which probably doesn't matter. The preprocessing doesn't buy you anything, so just use the 2d data and let the CNN take things from there.

Input data for convolutional neural network

2 Answers