GPU Gems 2 Chapter 46 describes sorting on the GPU.
Consider how you might use the Bubble-sort algorithm:
By accessing all N pixels of the column to be sorted per each output pixel, it will be very slow, since each output pixel reads N (with N^2 accesses - heavily affected by caching) and that's a lot of memory access. But it will work, and this code may actually make the most sense.
If you access only 2 input pixels per output pixel then you can effectively "swap" a pair of adjacent pixels on each pass. This may require up to N passes to displace a pixel N steps away. (You can count swaps and "break" early by using an occlusion query)
By combining techniques you can sample k pixels at a time, allowing a pixel to move up to k steps per pass, thus requiring N/k passes and make more efficient use of resources.
But the real high-performance solutions come from implementing smarter algorithms than just bubble sort. The GPU Gems article suggests a bitonic merge sort, and even heapsort and quicksort are possible with much cleverness.
But there's a lot to know to successfully craft an efficient implementation for a GPU algorithm--more than can be quickly discussed generally here. And you may want to go to a CUDA implementation rather than glsl. But then there's bank conflicts, warps, caching, sampling, swizzling. There's no quick answer.
I suggest you give it a try and then ask a more specific question when you get stuck.