Definitely get rid of pow(). A large part of optimization of this depends on how you are using it. Are you doing this once for very large N and it takes too long? Or, more likely, are you doing this many times in a tight loop?
If you are using very large N (>1000 or so), there are highly optimized numerical libraries that can do this. BLAS, for example, has an *nrm2 function that will compute the euclidean norm (dnrm2, snrm2, cnrm2, znrm2, depending on the data type [single, double, complex single, complex double]). GotoBLAS is probably the fastest out there for certain processor architectures. MKL features Intel's hand tuned BLAS implementation, but it's not free. Finally, ATLAS is a self-tuning library implementing BLAS.
If you have a tight loop with small or not quite large N, then you may have to do some hand-tuning to get it faster. You could turn on auto-vectorization with the -O3 or -ftree-vectorize compiler flags. You could also vectorize by hand, but it can be painful to learn how to do this.
You could do loop unrolling (that is, divide N up into chunks of, say, 4 and explicitly write out the computation for 4 successive values inside the for loop body. This has the effect of tricking the compiler to use more registers for immediate computation---and registers are the fastest form of memory that you have to work with. Also, you may be able to take advantage of prefetching (reading a stretch of data with one memory access call).
Another thing to do in this situation is to try overwriting one of your inputs. That is, maybe you could get away with writing the output into p or q. This helps because the positions of p that you compute will still be in the cache when you are ready to write. Caches often won't write the data to memory unless they absolutely have to---one reason is that the cache line is needed and we need to kick the last one out. You use fewer cache lines by writing to one of your inputs.
There are a half million other things to try, but I think I'll stop here. Good luck!