Not sure I understand some of that. IOCP typically does not use hEvent field in the OVL struct. I/O completion is signaled by queueing a completion message to the 'completion port', (ie. a queue). You seem to be using the hEvent field for some 'unusual' extra signaling to manage a single send data buffer and OVL block.
Obviously, I don't have the whole story from your post, but it looks to me that you are making heavy work for yourself on the tx side and serialising the sends will strangle performance:)
Do you HAVE to use the same OVL/buffer object for succcessive sends? What I usually do is use a different OVL/buffer for each send and just queue it up immediately. The kernel will send the buffers in sequence and return a completion message for each one. There is no problem with multiple IOCP tx requests on a socket - that's what the OVL block is for - to link them together inside the kernel stack.
There is an issue with having multiple IOCP receive requests for a socket outstanding - it can happen that two pool threads get completion packets for the same socket at the same time and so possibly resulting in out-of-order processing. Fixing that issue 'properly' requires something like an incrementing sequence-number in each rx buffer/OVL object issued and a critical-section and buffer-list in each socket object to 'save up' out-of-order buffers until all the earlier ones have been processed. I have a suspicion that many IOCP servers just dodge this issue by only having one rx IOCP request in at a time, (probably at the expense of performance).
Getting through a lot of buffers in this way could be somewhat taxing if they are being continually constructed and destroyed, so I don't normally bother and just create a few thousand of them at startup and push them, (OK, pointers to them), onto a producer-consumer 'pool queue', popping them off when a tx or rx is required and pushing them back on again. In the case of tx, this would happen when a send completion message is picked up by one of the IOCP pool threads. In the case of rx, it would happen when a pool thread, (or some other thread that has had the object queued to it by a pool thread), has processed it and no longer needs it.
Ahh.. you want to send exactly the same content to the list of sockets - like a chat server type thingy.
OK. So how about one buffer and multiple OVL blocks? I have not tried it, but don't see why it would not work. In the single buffer object, keep an atomic reference count of how many overlapped send requests you have sent out in your 'send to all clients' loop. When you get the buffers back in the completion packets, decrement the refCount towards zero and delete/repool the buffer when you get down to 0.
I think that should work, (?).