Locality matters even for DRAM itself, even discounting caching. A burst write of 64B contiguous bytes for a dirty cache-line is a lot faster than 16 writes of 4B to 16 different addresses. Or to put it another way, writing back an entire cache line is not much slower than writing back just a few changed bytes in a cache line.
What Every Programmer Should Know About Memory, by Ulrich Drepper, explains a lot of stuff about avoiding memory bottlenecks when programming. He includes some details of DRAM addressing. DRAM controllers have to select a row, and then select a column. Accessing another virtual memory page can also cause a TLB miss.
DRAM does have a burst-transfer command for transferring a sequential chunk of data. (Obviously designed for the benefit of CPUs writing back cache lines). The memory system in modern computers is optimized for the usage-pattern of writing whole cache lines, because that's what almost always happens.
Cache lines are the unit at which CPUs track dirty-or-not. It would be possible to track dirtyness with a smaller line size than present-or-not cache lines, but that would take extra transistors and isn't worth it. The multiple levels of cache are set up to transfer whole cache lines around, so they can be as fast as possible when a whole cache line needs to be read.
There are so-called non-temporal reads/writes (movnti/movntdqa
) that bypass the cache. These are for use with data that won't be touched again until it would have been evicted from the cache anyway (hence the non-temporal). They are a bad idea for data that could benefit from caching, but would let you write 4 bytes to memory, rather than a whole cache line. Depending on the MTRR for that memory range, the write might or might not be subject to write-combining. (This is relevant for memory-mapped i/o regions, where two adjacent 4B writes isn't the same as one 8B write.)
The algorithm that only touches two cache lines certainly has the advantage on that score, unless it takes a lot more computation, or especially branching, to figure out which memory to write. Maybe ask a different question if you want help deciding. (see the links at https://stackoverflow.com/tags/x86/info, esp Agner Fog's guides, for info that will help you decide for yourself.)
See Cornstalks' answer for warnings about the dangers of having multiple threads on different CPUs touching the same memory. This can lead to bigger slowdowns than just extra writes for a single-threaded program.