No. You can (and should) have multiple work groups per CU and more than one thread per processing element. Each CU can hold up to 40 wavefronts of 64 threads each, so the maximum number of parallel threads is 44*40*64=112640. However, you can often not use all these threads. Other resources might limit the maximum possible number of threads per CU. There is only a limited number of registers per CU and each wavefront uses too many of them, the maximum number of parallel wavefronts is lower.
Each work group is executed on the same CU, as this allows access to a shared memory (LDS) and easy synchronization between the different wavefronts of each workgroup. You can choose the work-group size within certain limits. There is a hard limit (more doesn't work) of 256 threads per work-group and a soft-limit (reduced performance if you are using less) of wavefront size / 64 threads per work group. Your work-group size should also be a multiple of the wavefront size, so 64,128,192 and 256 are the most common choices for work-group size. Everything else reduces the potential peak performance, however, depending on your problem a different work-group size might still be better than forcing a problem into one of choices.
Because each work group can only use up to 256 threads each, multiple workgroups can be executed on each CU in parallel. If you use the maximum workgroup size of 256 threads, you need at least 112640/256=440 work groups in order to use all threads of the GPU. If you have more work groups, up to 440 of them will execute in parallel and the remaining groups will be executed once one of the older groups is finished. If you have less work groups, not all threads will be occupied, which can lead to decreased performance. If you pick smaller work-groups, you will need more of them, e.g: 1760 work-groups with a work-group size of 64.
Using too much of the shared memory (LDS) can also limit the number of work-groups per CU.
The processing elements execute the instructions. Under optimal conditions one instruction can be started per cycle.