This actually sounds like two separate issues (at least your proposed solution does). I will give a couple of pointers regarding both issues since it is not entirely clear what you are trying to accomplish.
1. Redundant State Changes / Draw Calls
You can always queue up your render commands and then sort them (do not worry, sorting makes it sound more complicated/expensive than it actually is) by texture / shader / other expensive state before you actually do the drawing.
What you would really do is create different categories to put drawing commands into depending on what texture it requires, whether it is translucent or opaque, etc. and then run through the categories in a systematic fashion after you have received all of the drawing commands necessary to complete your frame. The only real sorting would occur at insertion time and because the buckets are relatively small would be far less expensive than if you tried to sort a random mess of commands when it came time to draw.
This is how high-performance game engines have worked virtually since the dawn of Quake. The idea is to minimize texture changes and draw calls as much as possible. In the old days draw calls themselves had a lot of expense (requiring vertex array memory to be copied from CPU to GPU and kernel-mode context switches in some APIs), they are still expensive these days but for different reasons. If you can combine as many order-independent draw operations as possible into single calls you will often dramatically improve performance.
In fact, PowerVR does something similar to this at the hardware level. It waits for all draw commands and then divides the screen up into tiles where it can go about determining which commands are redundant (e.g. hidden surfaces) and culls them before it has to rasterize anything. It reduces memory/power consumption as long as the draw operations are not order-dependent (e.g. alpha blended).
2. Inefficient / Non-Persistent use of GPU Storage
In the worst case, you can always consider packing your textures into an atlas. This way you do not have break draw calls apart in order to swap bound textures, you just have to compute your texture coordinates more intelligently.
To that end, you often print the same text across multiple frames in GUIs. You can easily write your software to cache rendered strings / formatted paragraphs / etc. as textures. You can extend this to entire GUI windows if you are clever, and only re-pack the portion of the texture that stores a rendered window when something in it has to be re-drawn.