High core-to-core latency from c2clat benchmark

Recently I’m trying to make sense of the c2clat (core-to-core latency) benchmark result from Jeff Geerling’s sbc-review on DC ROMA AI PC, which has the EIC7702 SoC (basically the 2-die version of the EIC7700 in P550). I rerun the c2clat on my P550, and interestingly, the latency numbers are highly sensitive to the version of GCC used to build the binary. See my test results:

I tracked down the difference between gcc 13.2 and 13.3 to this set of patches: [PATCH v5 00/11] RISC-V: Implement ISA Manual Table A.6 Mappings

And in particular, this patch [PATCH v5 09/11] RISC-V: Weaken mem_thread_fence The patch is intended to relax the memory fences, not strengthen it. E.g., the generate code has some fences such as fence iorw,iorw relaxed into fence r,rw Take this code block as example:

for (int n = 0; n < 100; ++n) {
  while (seq1.load(std::memory_order_acquire) != n)
  ;
  seq2.store(n, std::memory_order_release);
}

And the corresponding assembly:

.L169:
    ld     a4,56(s0)    # a4 <- &seq2
    fence  rw,w         # fence for store_release below
    sw     a5,0(a4)     # seq2 = i
    addiw  a5,a5,1      # i++
    beq	   a5,a0,.L151	# if (i == 100)
.L146:
    ld     a4,48(s0)    # a4 <- &seq1
    lw     a4,0(a4)     # a4 = seq1
-   fence  iorw,iorw    # fence for load_acquire above
+   fence  r,rw
    bne    a4,a5,.L146  # if (a4 == i)
    j      .L169

Given that the fence has been relaxed, you’d think we get better results, but in reality it’s 10x worse!!! Based on my testing, the io fence is not the issue. I can relax the fence iorw,iorw to fence rw,rw, and get exactly the same latency number. It’s the relax from fence rw,rw to fence r,rw that caused the issue. The relaxation is correct, and it properly represent the load_acquire semantic, but it appears that EIC7700 doesn’t play nice with this change. My theory is that the store of seq2 sits in the store buffer for too long, before it can get flushed, causing the high latency. When there’s the w fence, it helps speeding up the flush. Does this has something to do with the core itself? Or the interconnect? I run the same exact binary (statically complied) on a Starfive JH7110, and there’s zero difference before and after that patch’s introduced, and the latency numbers are overall a lot better than EIC7700. Please check if there’s anything that can be done to mitigate this problem. Thanks.

Bo

Hi Bo - great job digging into the issue and provided this detailed analysis. To answer your question, Yes- inside the P550 core, if nothing causes the store to drain, then we are permitted to buffer it for a finite amount of time, which we do to allow for additional store combining opportunities. For our more recent products after P550, we simplify the fence variations and make them behave the same, which is easier for synthesis timing at higher clock frequencies, and this will also recover the performance issue you so eloquently pointed out.

Update: the mentioned gcc patch series [PATCH v5 00/11] RISC-V: Implement ISA Manual Table A.6 Mappings is necessary for correctness. Thus, you must use gcc 13.3+ or your distro’s gcc that has this patchset applied, even though it can negatively affect core-to-core latency. I had a separate discussion with Revy and other folks, and Revy reported hitting actual concurrency bugs on the field due to the wrong locking code generated by older gcc. Good news is that I’ve found a knob in the P550 core that forces all fence instruction to be upgraded to a full fence, i.e., fence iorw, iorw. The knob is documented as bit 14 forceFenceFullMatch of CSR 0x7c2. Enabling this bit will strengthen all fence and restore the core-to-core latency to be on par with code generated by gcc 13.2. I suspect that it may regress performance elsewhere, but hey, if your workload is so sensitive to c2c latency, it might be beneficial. You’ll need to benchmark the whole thing w/ and w/o the knob to make final decisions, but at least there’s a choice. If you want to try it, just grab the upstream OpenSBI (v1.8.1 as of today), and configure the ESWIN_EIC770X_FEAT1_CFG with 0x4080 opensbi/platform/generic/eswin/Kconfig at v1.8.1 · riscv-software-src/opensbi · GitHub