I kind of understand how we convert fully-connected to convolutional layer according cs231n:
FC->CONV conversion. Of these two conversions, the ability to convert an FC layer to a CONV layer is particularly useful in practice. Consider a ConvNet architecture that takes a 224x224x3 image, and then uses a series of CONV layers and POOL layers to reduce the image to an activations volume of size 7x7x512 (in an AlexNet architecture that we’ll see later, this is done by use of 5 pooling layers that downsample the input spatially by a factor of two each time, making the final spatial size 224/2/2/2/2/2 = 7). From there, an AlexNet uses two FC layers of size 4096 and finally the last FC layers with 1000 neurons that compute the class scores. We can convert each of these three FC layers to CONV layers as described above: ...
however, I was reading a paper using fully convolutional regression network to predict density map, in their description of the architecture, they claimed that the middle layer(e.g. the top row, A and B are just two different models) from 12x12x128 to 12x12x512 is fully-connected but implemented as convolution:

What I don't understand is, in cs231n, the output of the convolution implementation should be a vector with dimension like 1x1x4096, how can the paper have output dimension like 12x12x512 for their FC as convolution implementation?