Good and efficient deep architectures are introduced quite frequently, so I suppose any answer you'll get here will have a short expiration date.
Regarding "feature extraction time", I suppose you mean the duration of the forward pass only - you are not interested in training time. This time will also depend on the layer which you want to extract: the deeper the layer, the longer the time (for the same net) since it requires more computations to get deeper into any specific net. However, for different nets it often takes different times to reach the same "depth" since the computations at each depth are different.
Nevertheless, roughly when Oxford VGG lab intorduced VGG_CNN_S, Google labs came up with GoogLeNet: it is a very deep architecture for recognition, but with an extra effort on keeping the computational burden within reason. It's worth giving it a try.