The docs say numpy's FFT is based on FFTPACK.
In the FFTPACK docs I find the following:
subroutine rffti(n,wsave)
subroutine rffti initializes the array wsave which is used in both
rfftf and rfftb. the prime factorization of n together with a
tabulation of the trigonometric functions are computed and stored in
wsave.
The standard Cooley-Tukey algorithm is "radix-2 with decimation in time", which recursively reduces the computation of an FFT of size 2*n into 2 FFTs of size n, plus n FFTs of size 2. There is a general factorization version of the same algorithm that turns an FFT of size m*ninto n FFTs of size m plus m FFTs of size n. The fact that the preparation routines in FFTPACK compute the prime factorization of the input size, seems to indicate this is what they are doing. So unless you go for a prime number of elements, or your number of elements has a very large prime factor, you should still get a pretty good speed-up.
A few years back I blogged about both the radix-2 and general factorization versions of Cooley-Tukey's algorithmn. Reading those may help understand what seems to be going on inside NumPy. The following picture, taken from there, depicts a CT FFT:
