2
votes

I am benchmarking the following copy function (not fancy!) with a high size argument (~1GB):

void copy(unsigned char* dst, unsigned char* src, int count)
{
    for (int i = 0; i < count; ++i)
    { 
         dst[i] = src[i];
    }
}

I built this code with GCC 6.2, with -O3 -march=native -mtune-native, on a Xeon E5-2697 v2.

Just for you to look at the assembly generated by gcc on my machine, I paste here the assembly generated in the inner loop:

movzx ecx, byte ptr [rsi+rax*1]
mov byte ptr [rdi+rax*1], cl
add rax, 0x1
cmp rdx, rax
jnz 0xffffffffffffffea

Now, as my LLC is ~25MB and I am copying ~1GB, it makes sense that this code is memory bound. perf confirms this with a high number of stalled frontend cycles:

        6914888857      cycles                    #    2,994 GHz                    
        4846745064      stalled-cycles-frontend   #   70,09% frontend cycles idle   
   <not supported>      stalled-cycles-backend   
        8025064266      instructions              #    1,16  insns per cycle        
                                                  #    0,60  stalled cycles per insn

My first question is about 0.60 stalled cycles per instruction. This seems like a very low number for such code that access LLC/DRAM all the time as the data is not cached. As LLC latency is 30cycles and main memory around 100 cycles, how is this achieved?

My second question is related; it seems that the prefetcher is doing a relatively good job (not surprising, it's an array, but still): we hit 60% of the time the LLC instead of DRAM. Still, what is the reason for it to fail the other times? Which bandwidth/part of the uncore made this prefetcher fails to accomplish its task?

          83788617      LLC-loads                                                    [50,03%]
          50635539      LLC-load-misses           #   60,43% of all LL-cache hits    [50,04%]
          27288251      LLC-prefetches                                               [49,99%]
          24735951      LLC-prefetch-misses                                          [49,97%]

Last, but not least: I know that Intel can pipeline instructions; is it also the case for such mov with memory operands?

Thank you very much!

2
For question one: memory latency plays a lessened role when you're writing/reading that much sequential data; the throughput of the RAM is what's important.IGarFieldI
I think you need restrict for vectorization to be able to kick in (but with restrict, my gcc recognizes this as memcpy and generates a jump to that function). The prefetch misses are probably due to crossing page boundaries.PSkocik
@PSkocik: I am very surprised that gcc would recognize this function as equivalent to memcpy since the count argument has type int instead of size_t. It should at least generate a test such as if (count <= 0) return; or some other code to account for negative size values.chqrlie
@chqrlie: yes, of course compilers have to emit a test/branch in case count is signed negative, and on 64-bit ISAs, zero-extend it to size_t. PSKocik just meant that they don't actually auto-vectorize the loop, they use memcpy for the real work instead. godbolt.org/z/NGJi1e has gcc and clang output for x86-64, ARM, AArch64 and MIPS.Peter Cordes
@AntiClimacus: how did you allocate your src and dst? If they've never been touched, all src pages will be copy-on-write mapped to the same physical zero page. And whether they're static or dynamic can maybe affect whether the OS uses transparent hugepages.Peter Cordes

2 Answers

3
votes

TL;DR: There are a total of 5 uops in the unfused domain (See: Micro fusion and addressing modes). The loop stream detector on Ivy Bridge cannot allocate uops across the loop body boundaries (See: Is performance reduced when executing loops whose uop count is not a multiple of processor width?), so it takes two cycles to allocate one iteration. The loop actually runs at 2.3c/iter on dual socket Xeon E5-2680 v2 (10 cores per socket vs. your 12), so that is close to the best that can be done given the front-end bottleneck.

The prefetchers have performed very well, and most of the time the loop is not memory-bound. Copying 1 byte per 2 cycles is very slow. (gcc did a poor job, and should have given you a loop that could run at 1 iteration per clock. Without profile-guided optimization, even -O3 doesn't enable -funroll-loops, but there are tricks it could have used (like counting a negative index up toward zero, or indexing the load relative to the store and incrementing a destination pointer) that would have brought the loop down to 4 uops.)

The extra .3 cycles per iteration slower than the front-end bottleneck on average is probably coming from stalls when prefetching fails (maybe at page boundaries), or maybe from page faults and TLB misses in this test that runs over statically-initialized memory in the .data section.


There are two data dependencies in the loop. First, the store instruction (specifically the STD uop) depends on the result of the load instruction. Second, both the store and load instructions depend on add rax, 0x1. In fact, add rax, 0x1 depends on itself as well. Since the latency of add rax, 0x1 is one cycle, an upper bound on performance of the loop is 1 cycle per iteration.

Since the store (STD) depends on the load, it cannot be dispatched from the RS until the load completes, which takes at least 4 cycles (in case of an L1 hit). In addition, there is only one port that can accept STD uops yet up to two loads can complete per cycle on Ivy Bridge (especially in the case the two loads are to lines that are resident in the L1 cache and no bank conflict occurs), resulting in additional contention. However, RESOURCE_STALLS.ANY shows that the RS actual never gets full. IDQ_UOPS_NOT_DELIVERED.CORE counts the number of issue slots that were not utilized. This is equal to 36% of all slots. The LSD.CYCLES_ACTIVE event shows that the LSD is used to deliver uops most of the time. However, LSD.CYCLES_4_UOPS/LSD.CYCLES_ACTIVE =~ 50% shows that in about 50% of the cycles, less than 4 uops are delivered to the RS. The RS will not get full because of the sub-optimal allocation throughput.

The stalled-cycles-frontend count corresponds to UOPS_ISSUED.STALL_CYCLES, which counts allocation stalls due to both frontend stalls and backend stalls. I don't understand how UOPS_ISSUED.STALL_CYCLES is related to the number of cycles and other events.

The LLC-loads count includes:

  • All demand load requests to the L3 irrespective of whether the request hits or misses in the L3 and, in case of a miss, irrespective of the source of data. This also includes demand load requests from the page walking hardware. It's not clear to me whether load requests from the next-page prefetcher are counted.
  • All hardware prefetch data read requests generated by an L2 prefetcher where the target line is to be placed in the L3 (i.e., in the L3 or both in the L3 and L2, but not only in the L2). Hardware L2 prefetcher data read requests where the line is to be placed only in the L2 are not included. Note that the L1 prefetchers' requests go to the L2 and influence and may trigger the L2 prefetchers, i.e., they don't skip the L2.

LLC-load-misses is a subset of LLC-loads and includes only those events that missed in the L3. Both are counted per core.

There is an important difference between counting requests (cache-line granularity) and counting load instructions or load uops (using MEM_LOAD_UOPS_RETIRED.*). Both the L1 and L2 caches squash load requests to the same cache line, so multiple misses in the L1 may result in a single request to the L3.

Optimal performance can be achieved if all stores and loads hit in the L1 cache. Since the size of a buffer you used is 1GB, the loop can cause a maximum of 1GB/64 =~ 17M L3 demand load requests. However, your LLC-loads measurement, 83M, is much larger, probably due to code other than the loop you've shown in the question. Another possible reason is that you forgot to use the :u suffix to count only user-mode events.

My measurements on both IvB and HSW show that LLC-loads:u is negligible compared to 17M. However, most of the L3 loads are misses (i.e., LLC-loads:u =~ LLC-loads-misses:u). CYCLE_ACTIVITY.STALLS_LDM_PENDING shows that the overall impact of loads on performance is negligible. In addition, my measurements show that the loop runs at 2.3c/iter on IvB (vs. 1.5c/iter on HSW), which suggests that one load is issued every 2 cycles. I think that the sub-optimal allocation throughput is the main reason for this. Note that 4K aliasing conditions (LD_BLOCKS_PARTIAL.ADDRESS_ALIAS) are almost non-existent. All of this means that the prefetchers have done a pretty good job at hiding the memory access latency for most loads.


Counters on IvB that can be used to evaluate the performance of hardware prefetchers:

Your processor has two L1 data prefetchers and two L2 data prefetchers (one of them can prefetch both into the L2 and/or the L3). A prefetcher may not be effective for the following reasons:

  • A triggering condition has not been satisfied. This is typically because an access pattern has not been recognized yet.
  • The prefetcher has been triggered but the prefetch was to a useless line.
  • The prefetcher has been triggered to a useful line but the line got replaced before being used.
  • The prefetcher has been triggered to a useful line but the demand requests have already reached the cache and missed. This means that the demand requests were issued faster than the ability of the prefetcher to react in a timely manner. This can happen in your case.
  • The prefetcher has been triggered to a useful line (that doesn't exist in the cache), but the request had to be dropped because no MSHR was available to hold the request. This can happen in your case.

The number of demand misses at the L1, L2, and L3 are good indicators of how well the prefetchers have performed. All the L3 misses (as counted by LLC-load-misses) are also necessarily L2 misses, so the number of L2 misses is larger than LLC-load-misses. Also all of the demand L2 misses are necessarily L1 misses.

On Ivy Bridge, you can use the LOAD_HIT_PRE.HW_PF and CYCLE_ACTIVITY.CYCLES_* performance events (in addition to the miss events) to know more about how the prefetchers have performed and evaluate their impact on performance. It's important to measure CYCLE_ACTIVITY.CYCLES_* events because even if the miss counts were seemingly high, that doesn't necessarily mean that misses are the main cause of performance degradation.

Note that the L1 prefetchers cannot issue speculative RFO requests. Therefore, most writes that reach the L1 will actually miss, requiring the allocation of an LFB per cache line at the L1 and potentiality other levels.


The code I used follows.

BITS 64
DEFAULT REL

section .data
bufdest:    times COUNT db 1 
bufsrc:     times COUNT db 1

section .text
global _start
_start:
    lea rdi, [bufdest]
    lea rsi, [bufsrc]

    mov rdx, COUNT
    mov rax, 0

.loop:
    movzx ecx, byte [rsi+rax*1]
    mov byte [rdi+rax*1], cl
    add rax, 1
    cmp rdx, rax
    jnz .loop

    xor edi,edi
    mov eax,231
    syscall
2
votes

My first question is about 0.60 stalled cycles per instruction. This seems like a very low number for such code that access LLC/DRAM all the time as the data is not cached. As LLC latency is 30cycles and main memory around 100 cycles, how is this achieved?

My second question is related; it seems that the prefetcher is doing a relatively good job (not surprising, it's an array, but still): we hit 60% of the time the LLC instead of DRAM. Still, what is the reason for it to fail the other times? Which bandwidth/part of the uncore made this prefetcher fails to accomplish its task?

With prefetchers. Specifically, depending on which CPU it is, there may be a "TLB prefetcher" fetching virtual memory translations, plus a cache line prefetcher that's fetching data from RAM into L3, plus an L1 or L2 prefetcher fetching data from L3.

Note that caches (e.g. L3) work on physical addresses, its hardware prefetcher works on detecting and prefetching sequential accesses to physical addresses, and because of virtual memory management/paging the physical accesses are "almost never" sequential at page boundaries. For this reason the prefetcher stops prefetching at page boundaries and probably takes three "non-prefetched" accesses to start prefetching from the next page.

Also note that if RAM was slower (or the code was faster) the prefetcher wouldn't be able to keep up and you'd stall more. For modern multi-core machines, the RAM is often fast enough to keep up with one CPU, but can't keep up with all CPUs. What this means is that outside of "controlled testing conditions" (e.g. when the user is running 50 processes at the same time and all CPUs are pounding RAM) your benchmark will be completely wrong. There are also things like IRQs, task switches and page faults that can/will interfere (especially when the computer is under load).

Last, but not least: I know that Intel can pipeline instructions; is it also the case for such mov with memory operands?

Yes; but a normal mov involving memory (e.g. mov byte ptr [rdi+rax*1], cl) will also be restricted by the "write ordered with store forwarding" memory ordering rules.

Note that there's many ways to speed up the copy, including using non-temporal stores (to deliberately break/bypass the memory ordering rules), using rep movs (which is specially optimised to work on whole cache lines where possible), using much larger pieces (e.g. AVX2 copying 32 bytes at a time), doing the prefetching yourself (especially at page boundaries), and doing cache flushing (so that caches still contain useful things after the copy has been done).

However, it's far better to do the opposite - deliberately make large copies very slow, so that the programmer notices that they suck and are "forced" to try to find a way to avoid doing the copy. It can cost 0 cycles to avoid copying 20 MiB, which is significantly faster than the "least worst" alternative.