Generally we will insert max-pooling layers between convolution layers. The main idea is to "summarize" the features in conv. layers. But it's hard to decide when to insert. I have some questions behind this:
how to decide how many conv. layers until we insert a max-pooling. and what's the effect of too many/few conv. layers
as max-pooling will reduce the size. so if we want to use very deep network, we can not do many maxpooling otherwise the size is too small. For example, the MNIST only have 28x28 input, but I do see some people use very deep network to experiment with it, so they might end up with very small size? actually when the size is too small (extreme case, 1x1), its' just like a fully-connected layer, and it seems doing convolution on them doesn't make any sense.
I know there is no golden role but I just want to figure the basic intuition behind this, so that I can make reasonable choice when implement a network


