First off interrupts are not needed for this sort of thing, you can poll the timer, no need to overkill with an interrupt. yes reasons why those examples use interrupts, but that doesnt mean that is the only way to use a timer.
Guy Sirton's answer is sound, but I prefer assembler as I can control it exactly to the clock cycle (so long as there are no interrupts or other items that get in the way). A timer is usually easier though as the code is a bit more portable (change the processor clocks frequency and you have to re-tune the loop, with a timer, sometimes all you have to do is change the init code to use a different prescaler, or change the one line looking for the computed count), and allows for interrupts and such things in the system.
In this case though you are talking about 12mhz, and one microsecond, that is 12 instructions yes? Put in 12 nops. Or branch to some assembler with like 10 nops or 8, whatever it comes out to compensate for the pipeline flush on the two branches. A timer and interrupts is going to burn more than 12 instruction cycles in overhead. Even polling the timer in a loop is going to be sloppy. A counter loop would work too, you need to understand the branch costs though and tune for that:
delay_one_ms:
mov r0,#3
wait:
sub r0,#1 @cortex-m3 means thumb/thumb2 and gas complains about subs.
bne wait
nop @might need some nops to tune the loop accurately
nop
bx lr
Call this function, what 30 million times in a loop using a gpio led or uart output and a stop watch and see that the blinks are 30 seconds apart.
ldr r4,=uart_tx_register_address
mov r5,#0x55
again:
ldr r6,=24000000
str r5,[r4]
top:
bl delay_one_ms
sub r6,#1
bne top
str r5,[r4]
b again
Actually since I assumed 2 clocks per branch, the test loop has 3 clocks, the delay is assumed to be a total 12 clocks so 15 clocks per loop, 30 seconds is 30,000,000 microseconds, ideally 30million loops, but I needed 12/15ths the number of loops to compensate. This is far easier if you have an oscilloscope whose timebase is somewhat accurate, or at least as accurate as you want this delay.
I have not studied ARM's branch costs myself otherwise I would comment on that. It is likely two or three clocks. So the mov is one, the sub is one times the number of loops the bne is lets say two times the number of loops. Two for the branch to get here two for the return. 5+(3*loops)+nops=12. (3*loops)+nops=7 loops is 2 and nops is 1, yes? I think stringing a number of nops together is far easier:
delay_one_ms:
nop
nop
nop
nop
nop
nop
nop
nop
bx lr
You might have to burn a few more instructions temporarily disabling interrupts, if you use them. If you are looking for "at least" one microsecond then dont worry about it.