The reason for the relatively poor performance of CUDA shared memory when using larger arrays may have to do with the fact that each multiprocessor has a limited amount of available shared memory.
Each multiprocessor hosts several processors; for modern devices, typically 32, the number of threads in a warp. This means that, in the absence of divergence or memory stalls, the average processing rate is 32 instructions per cycle (latency is high due to pipelining).
CUDA schedules several blocks to a multiprocessor. Each block consists of several warps. When a warp stalls on a global memory access (even coalesced accesses have high latency), other warps are processed. This effectively hides the latency, which is why high-latency global memory is acceptable in GPUs. To effectively hide latency, you need enough extra warps to execute until the stalled warp can continue. If all warps stall on memory accesses, you can no longer hide the latency.
Shared memory is allocated to blocks in CUDA, and stored on a singly multiprocessor on the GPU device. Each multiprocessor has a relatively small, fixed amount of shared memory space. CUDA cannot schedule more blocks to multiprocessors than the multiprocessors can support in terms of shared memory and register usage. In other words, if the amount of shared memory on a multiprocessor is X and each block requires Y shared memory, CUDA will schedule no more than floor(X/Y) blocks at a time to each multiprocessor (it might be less since there are other constraints, such as register usage).
Ergo, by increasing shared memory usage of a block, you might be reducing the number of active warps - the occupancy - of your kernel, thereby hurting performance. You should look into your kernel code by compiling with the -Xptxas="-v" flag; this should give you register and shared & constant memory usage for each kernel. Use this data and your kernel launch parameters, as well as other required information, in the most recent version of the CUDA Occupancy Calculator to determine whether you might be affected by occupancy.
EDIT:
To address the other part of your question, assuming no shared memory bank conflicts and perfect coalescing of global memory accesses... there are two dimensions to this answer: latency and bandwidth. The latency of shared memory will be lower than that of global memory, since shared memory is on-chip. The bandwidth will be much the same. Ergo, if you are able to hide global memory access latency through coalescing, there is no penalty (note: the access pattern is important here, in that shared memory allows for potentially more diverse access patterns with little to no performance loss, so there can be benefits to using shared memory even if you can hide all the global memory latency).