how does burst-mode DMA speed up data transfer between main memory and I/O devices?
There is no "speed up" as you allege, nor is any "speed up" typically necessary/possible. The data transfer is not going to occur any faster than the slower of the source or destination.
The DMA controller will consolidate several individual memory requests into occasional burst requests, so the benefit of burst mode is reduced memory contention due to a reduction in the number of memory arbitrations.
Burst mode combined with a wide memory word improves memory bandwidth utilization. For example, with a 32-bit wide memory, four sequential byte reads consolidated into a single burst could result in only one memory access cycle.
Before the transfer completes, CPU tasks that need the bus will be suspended.
The concept of "task" does not exist at this level of operations. There is no "suspension" of anything. At most the CPU has to wait (i.e. insertion of wait states) to gain access to memory.
However, in each instruction cycle, the fetch cycle has to reference the main memory.
Not true. A hit in the instruction cache will make a memory access unnecessary.
Therefore, during the transfer, the CPU will be idle doing no work, which is essentially the same as being occupied by the transferring work, under interrupt-driven IO.
Faulty assumption for every cache hit.
Apparently you are misusing the term "interrupt-driven IO" to really mean programmed I/O using interrupts.
Equating a wait cycle or two to the execution of numerous instructions of an interrupt service routine for programmed I/O is a ridiculous exaggeration.
And "interrupt-driven IO" (in its proper meaning) does not exclude the use of DMA.
In my understanding, the cycle stealing mode is essentially the same.
Then your understanding is incorrect.
If the benefits of DMA are so minuscule or nonexistent as you allege, then how do you explain the existence of DMA controllers, and the preference of using DMA over programmed I/O?
Does burst mode DMA make a difference by skipping the fetch and decoding cycles needed when using interrupt-driven I/O and thus accomplish one transfer per clock cycle instead of one instruction cycle and thus speed the process
Comparing DMA to "interrupt-driven I/O" is illogical. See this.
Programmed I/O using interrupts requires a lot more than just the one instruction that you allege.
I'm unfamiliar with any CPU that can read a device port, write that value to main memory, bump the write pointer, and check if the block transfer is complete all with just a single instruction.
And you're completely ignoring the ISR code (e.g. save and then restore processor state) that is required to be executed for each interrupt (that the device would issue for requesting data).