High core-to-core latency from c2clat benchmark

Recently I’m trying to make sense of the c2clat (core-to-core latency) benchmark result from Jeff Geerling’s sbc-review on DC ROMA AI PC, which has the EIC7702 SoC (basically the 2-die version of the EIC7700 in P550). I rerun the c2clat on my P550, and interestingly, the latency numbers are highly sensitive to the version of GCC used to build the binary. See my test results:

I tracked down the difference between gcc 13.2 and 13.3 to this set of patches: [PATCH v5 00/11] RISC-V: Implement ISA Manual Table A.6 Mappings

And in particular, this patch [PATCH v5 09/11] RISC-V: Weaken mem_thread_fence The patch is intended to relax the memory fences, not strengthen it. E.g., the generate code has some fences such as fence iorw,iorw relaxed into fence r,rw Take this code block as example:

for (int n = 0; n < 100; ++n) {
  while (seq1.load(std::memory_order_acquire) != n)
  ;
  seq2.store(n, std::memory_order_release);
}

And the corresponding assembly:

.L169:
    ld     a4,56(s0)    # a4 <- &seq2
    fence  rw,w         # fence for store_release below
    sw     a5,0(a4)     # seq2 = i
    addiw  a5,a5,1      # i++
    beq	   a5,a0,.L151	# if (i == 100)
.L146:
    ld     a4,48(s0)    # a4 <- &seq1
    lw     a4,0(a4)     # a4 = seq1
-   fence  iorw,iorw    # fence for load_acquire above
+   fence  r,rw
    bne    a4,a5,.L146  # if (a4 == i)
    j      .L169

Given that the fence has been relaxed, you’d think we get better results, but in reality it’s 10x worse!!! Based on my testing, the io fence is not the issue. I can relax the fence iorw,iorw to fence rw,rw, and get exactly the same latency number. It’s the relax from fence rw,rw to fence r,rw that caused the issue. The relaxation is correct, and it properly represent the load_acquire semantic, but it appears that EIC7700 doesn’t play nice with this change. My theory is that the store of seq2 sits in the store buffer for too long, before it can get flushed, causing the high latency. When there’s the w fence, it helps speeding up the flush. Does this has something to do with the core itself? Or the interconnect? I run the same exact binary (statically complied) on a Starfive JH7110, and there’s zero difference before and after that patch’s introduced, and the latency numbers are overall a lot better than EIC7700. Please check if there’s anything that can be done to mitigate this problem. Thanks.

Bo

Hi Bo - great job digging into the issue and provided this detailed analysis. To answer your question, Yes- inside the P550 core, if nothing causes the store to drain, then we are permitted to buffer it for a finite amount of time, which we do to allow for additional store combining opportunities. For our more recent products after P550, we simplify the fence variations and make them behave the same, which is easier for synthesis timing at higher clock frequencies, and this will also recover the performance issue you so eloquently pointed out.