High core-to-core latency from c2clat benchmark

ganboing · November 25, 2025, 9:38am

Recently I’m trying to make sense of the c2clat (core-to-core latency) benchmark result from Jeff Geerling’s sbc-review on DC ROMA AI PC, which has the EIC7702 SoC (basically the 2-die version of the EIC7700 in P550). I rerun the c2clat on my P550, and interestingly, the latency numbers are highly sensitive to the version of GCC used to build the binary. See my test results:

github.com/geerlingguy/sbc-reviews

DC-ROMA AI PC - RISC-V Mainboard II

opened 10:31PM - 10 Oct 25 UTC

geerlingguy

testing in progress

![Image](https://github.com/user-attachments/assets/a6a95610-3cea-4ef6-b429-c494…c90394a1) ## Basic information - Board URL (official): https://store.deepcomputing.io/products/dc-roma-ai-pc-risc-v-mainboard-ii-for-framework-laptop-13 - Board purchased from: Provided for review, by DeepComputing - Board purchase date: 2025-10-09 - Board specs (as tested): 32GB RAM (16GB for CPU), 512GB SSD, 8-core 1.8 GHz SiFive P550 CPU - Board price (as tested): $899 (standalone Mainboard starts at $349 for 32GB/No SSD) ![Image](https://github.com/user-attachments/assets/81001459-c622-4495-815a-ac47ce3f725e) ## Linux/system information ``` # output of `screenfetch` $ screenfetch ./+o+- roma@roma yyyyy- -yyyyyy+ OS: Ubuntu 24.04 noble ://+//////-yyyyyyo Kernel: riscv64 Linux 6.6.92-eic7x-2025.07 .++ .:/++++++/-.+sss/` Uptime: 9m .:++o: /++++++++/:--:/- Packages: 1954 o:+o+:++.`..```.-/oo+++++/ Shell: bash 5.2.21 .:+o:+o/. `+sssoo+/ Disk: 25G / 468G (6%) .++/+:+oo+o:` /sssooo. CPU: Unknown @ 8x 1.8GHz /+++//+:`oo+o /::--:. GPU: \+/+o+++`o++o ++////. RAM: 1529MiB / 14682MiB .++.o+++oo+:` /dddhhh. .+.o+oo:. `oddhhhh+ \+.++o+o``-````.:ohdhhhhh+ `:o+++ `ohhhhhhhhyo++os: .o:`.syhhhhhhh/.oo++o` /osyyyyyyo++ooo+++/ ````` +oo+++o\: `oo++. # output of `uname -a` Linux roma 6.6.92-eic7x-2025.07 #2025.09.25.06.45+ SMP Thu Sep 25 06:53:10 UTC 2025 riscv64 riscv64 riscv64 GNU/Linux ``` ## Benchmark results ### CPU - Geekbench 6: (174 single / 640 multi - https://browser.geekbench.com/v6/cpu/14421987) - 17.759 Gflops at 32.5W, for **0.55 Gflops/W** ([geerlingguy/top500-benchmark](https://github.com/geerlingguy/top500-benchmark) [HPL result](https://github.com/geerlingguy/top500-benchmark/issues/77)) ### Power - Sleep power draw (at wall): 2.2 W - Idle power draw (charging, battery 80%): 41.4 W - Idle power draw (at wall, battery 100%): 25.1 W - Maximum simulated power draw (`stress-ng --matrix 0`): 31.7 W - During Geekbench multicore benchmark: 32.1 W - During `top500` HPL benchmark: 32.9 W ### Disk #### ZHITAI TiPlus7100 512GB | Benchmark | Result | | -------------------------- | ------ | | iozone 4K random read | 60.31 MB/s | | iozone 4K random write | 126.92 MB/s | | iozone 1M random read | 1042.36 MB/s | | iozone 1M random write | 1306.80 MB/s | | iozone 1M sequential read | 1076.06 MB/s | | iozone 1M sequential write | 1301.79 MB/s | ### Network `iperf3` results: ## WLAN (WiFi 6, built-in Intel AX200) - `iperf3 -c $SERVER_IP`: 637 Mbps - `iperf3 -c $SERVER_IP --reverse`: 293 Mbps - `iperf3 -c $SERVER_IP --bidir`: 510 Mbps up, 141 Mbps down (Be sure to test all interfaces, noting any that are non-functional.) ## GPU The device includes a `PowerVR A-Series AXM-8-256`, with some precompiled drivers for OpenGL and Vulkan, but compatibility seemed a little hit or miss... ### glmark2 `glmark2-es2-wayland` results: ``` ======================================================= glmark2 2023.01 ======================================================= OpenGL Information GL_VENDOR: Imagination Technologies GL_RENDERER: PowerVR A-Series AXM-8-256 GL_VERSION: OpenGL ES 3.2 build 24.2@6643903 Surface Config: buf=32 r=8 g=8 b=8 a=8 depth=24 stencil=0 samples=0 Surface Size: 800x600 windowed ======================================================= [build] use-vbo=false: FPS: 205 FrameTime: 4.884 ms [build] use-vbo=true: FPS: 475 FrameTime: 2.106 ms [texture] texture-filter=nearest: FPS: 525 FrameTime: 1.907 ms [texture] texture-filter=linear: FPS: 542 FrameTime: 1.846 ms [texture] texture-filter=mipmap: FPS: 534 FrameTime: 1.875 ms [shading] shading=gouraud: FPS: 471 FrameTime: 2.126 ms [shading] shading=blinn-phong-inf: FPS: 493 FrameTime: 2.029 ms [shading] shading=phong: FPS: 513 FrameTime: 1.953 ms [shading] shading=cel: FPS: 475 FrameTime: 2.107 ms [bump] bump-render=high-poly: FPS: 546 FrameTime: 1.832 ms [bump] bump-render=normals: FPS: 548 FrameTime: 1.825 ms [bump] bump-render=height: FPS: 535 FrameTime: 1.871 ms [effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 492 FrameTime: 2.035 ms [effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 602 FrameTime: 1.661 ms [pulsar] light=false:quads=5:texture=false: FPS: 536 FrameTime: 1.868 ms [desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 116 FrameTime: 8.670 ms [desktop] effect=shadow:windows=4: FPS: 956 FrameTime: 1.046 ms [buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 243 FrameTime: 4.121 ms [buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 242 FrameTime: 4.144 ms [buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 362 FrameTime: 2.763 ms [ideas] speed=duration: FPS: 693 FrameTime: 1.443 ms [jellyfish] <default>: FPS: 1729 FrameTime: 0.579 ms [terrain] <default>: FPS: 117 FrameTime: 8.606 ms [shadow] <default>: FPS: 1505 FrameTime: 0.665 ms [refract] <default>: FPS: 196 FrameTime: 5.122 ms [conditionals] fragment-steps=0:vertex-steps=0: FPS: 2185 FrameTime: 0.458 ms [conditionals] fragment-steps=5:vertex-steps=0: FPS: 2212 FrameTime: 0.452 ms [conditionals] fragment-steps=0:vertex-steps=5: FPS: 1473 FrameTime: 0.679 ms [function] fragment-complexity=low:fragment-steps=5: FPS: 2198 FrameTime: 0.455 ms [function] fragment-complexity=medium:fragment-steps=5: FPS: 2294 FrameTime: 0.436 ms [loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 2337 FrameTime: 0.428 ms [loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 2297 FrameTime: 0.435 ms [loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 2287 FrameTime: 0.437 ms ======================================================= glmark2 Score: 936 ======================================================= ``` ### vkmark `vkmark` results: ``` $ DISPLAY=:0 vkmark --debug Debug: WindowSystemLoader: Looking in /usr/lib/riscv64-linux-gnu/vkmark for window system plugins Debug: WindowSystemLoader: Loading options from /usr/lib/riscv64-linux-gnu/vkmark/kms.so... ok Debug: WindowSystemLoader: Loading options from /usr/lib/riscv64-linux-gnu/vkmark/wayland.so... ok Debug: WindowSystemLoader: Loading options from /usr/lib/riscv64-linux-gnu/vkmark/xcb.so... ok Debug: WindowSystemLoader: Probing /usr/lib/riscv64-linux-gnu/vkmark/kms.so... succeeded with priority 255 Debug: WindowSystemLoader: Probing /usr/lib/riscv64-linux-gnu/vkmark/wayland.so... succeeded with priority 255 Authorization required, but no authorization protocol specified Debug: WindowSystemLoader: Probing /usr/lib/riscv64-linux-gnu/vkmark/xcb.so... succeeded with priority 0 Debug: WindowSystemLoader: Selected window system plugin /usr/lib/riscv64-linux-gnu/vkmark/kms.so (best match) Debug: KMSWindowSystemPlugin: Using legacy modesetting Segmentation fault (core dumped) ``` With a version [compiled from source](https://github.com/geerlingguy/sbc-reviews/issues/76), I got: ``` Error: No suitable Vulkan physical devices found ``` Here is the vulkaninfo for the board: <details> <summary>Click to expand `vulkaninfo`</summary> ``` ========== VULKANINFO ========== Vulkan Instance Version: 1.3.275 Instance Extensions: count = 23 ------------------------------- VK_EXT_acquire_drm_display : extension revision 1 VK_EXT_acquire_xlib_display : extension revision 1 VK_EXT_debug_report : extension revision 10 VK_EXT_debug_utils : extension revision 2 VK_EXT_direct_mode_display : extension revision 1 VK_EXT_display_surface_counter : extension revision 1 VK_EXT_surface_maintenance1 : extension revision 1 VK_EXT_swapchain_colorspace : extension revision 4 VK_KHR_device_group_creation : extension revision 1 VK_KHR_display : extension revision 23 VK_KHR_external_fence_capabilities : extension revision 1 VK_KHR_external_memory_capabilities : extension revision 1 VK_KHR_external_semaphore_capabilities : extension revision 1 VK_KHR_get_display_properties2 : extension revision 1 VK_KHR_get_physical_device_properties2 : extension revision 2 VK_KHR_get_surface_capabilities2 : extension revision 1 VK_KHR_portability_enumeration : extension revision 1 VK_KHR_surface : extension revision 25 VK_KHR_surface_protected_capabilities : extension revision 1 VK_KHR_wayland_surface : extension revision 6 VK_KHR_xcb_surface : extension revision 6 VK_KHR_xlib_surface : extension revision 6 VK_LUNARG_direct_driver_loading : extension revision 1 Instance Layers: count = 2 -------------------------- VK_LAYER_MESA_device_select Linux device selection layer 1.3.211 version 1 VK_LAYER_MESA_overlay Mesa Overlay layer 1.3.211 version 1 Devices: ======== GPU0: apiVersion = 1.3.277 driverVersion = 1.598.191 vendorID = 0x1010 deviceID = 0x30010101 deviceType = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU deviceName = PowerVR A-Series AXM-8-256 driverID = DRIVER_ID_IMAGINATION_PROPRIETARY driverName = PowerVR A-Series Vulkan Driver driverInfo = 24.2@6643903 conformanceVersion = 1.3.8.1 deviceUUID = 33302033-2034-3038-2031-303100000000 driverUUID = 36363433-3930-3300-0000-000000000000 ``` </details> ### GravityMark GravityMark results: ``` 1. Download the latest version of GravityMark: https://gravitymark.tellusim.com 2. Run `chmod +x [downloaded_filename].run` 3. Run `sudo ./[downloaded_filename].run` and press `y` to accept the terms. 4. Open the link it prints, and run the Benchmark defaults, changing to 720p resolution and 50,000 asteroids. ``` Note: These benchmarks require an active display on the device. Not all devices may be able to run `glmark2-es2`, so in that case, make a note and move on! ### AI / LLM Inference `ollama` LLM model inference results: #### NPU Inference | System | CPU/GPU | Model | Eval Rate | Power (Peak) | | :--- | :--- | :--- | :--- | :--- | | DC-ROMA Mainboard II (8-core RISC-V) | NPU | deepseek-r1:7b | 4.9 Tokens/s | 38.9 W | #### CPU Inference | System | CPU/GPU | Model | Eval Rate | Power (Peak) | | :--- | :--- | :--- | :--- | :--- | | DC-ROMA Mainboard II (8-core RISC-V) | CPU | deepseek-r1:1.5b | 0.59 Tokens/s | 32.0 W | | DC-ROMA Mainboard II (8-core RISC-V) | CPU | llama3.2:3b | 0.31 Tokens/s | 30.6 W | More results: https://github.com/geerlingguy/ai-benchmarks/issues/28 ## Memory `tinymembench` results: <details> <summary>Click to expand memory benchmark result</summary> ``` tinymembench v0.4.10 (simple benchmark for memory throughput and latency) ========================================================================== == Memory bandwidth tests == == == == Note 1: 1MB = 1000000 bytes == == Note 2: Results for 'copy' tests show how many bytes can be == == copied per second (adding together read and writen == == bytes would have provided twice higher numbers) == == Note 3: 2-pass copy means that we are using a small temporary buffer == == to first fetch data into it, and only then write it to the == == destination (source -> L1 cache, L1 cache -> destination) == == Note 4: If sample standard deviation exceeds 0.1%, it is shown in == == brackets == ========================================================================== C copy backwards : 4818.5 MB/s (0.2%) C copy backwards (32 byte blocks) : 4826.0 MB/s C copy backwards (64 byte blocks) : 4842.2 MB/s (0.2%) C copy : 4800.1 MB/s C copy prefetched (32 bytes step) : 4825.5 MB/s (32.0%) C copy prefetched (64 bytes step) : 864.5 MB/s (0.2%) C 2-pass copy : 717.7 MB/s C 2-pass copy prefetched (32 bytes step) : 716.7 MB/s C 2-pass copy prefetched (64 bytes step) : 716.8 MB/s C fill : 7789.3 MB/s (28.7%) C fill (shuffle within 16 byte blocks) : 7794.9 MB/s C fill (shuffle within 32 byte blocks) : 7837.7 MB/s (0.4%) C fill (shuffle within 64 byte blocks) : 7804.5 MB/s --- standard memcpy : 4335.0 MB/s standard memset : 7837.1 MB/s (0.2%) ========================================================================== == Memory latency test == == == == Average time is measured for random memory accesses in the buffers == == of different sizes. The larger is the buffer, the more significant == == are relative contributions of TLB, L1/L2 cache misses and SDRAM == == accesses. For extremely large buffer sizes we are expecting to see == == page table walk with several requests to SDRAM for almost every == == memory access (though 64MiB is not nearly large enough to experience == == this effect to its fullest). == == == == Note 1: All the numbers are representing extra time, which needs to == == be added to L1 cache latency. The cycle timings for L1 cache == == latency can be usually found in the processor documentation. == == Note 2: Dual random read means that we are simultaneously performing == == two independent memory accesses at a time. In the case if == == the memory subsystem can't handle multiple outstanding == == requests, dual random read has the same timings as two == == single reads performed one after another. == ========================================================================== block size : single random read / dual random read, [MADV_NOHUGEPAGE] 1024 : 0.0 ns / 0.0 ns 2048 : 0.0 ns / 0.0 ns 4096 : 0.0 ns / 0.0 ns 8192 : 0.0 ns / 0.0 ns 16384 : 0.0 ns / 0.0 ns 32768 : 0.0 ns / 0.0 ns 65536 : 2.5 ns / 3.9 ns 131072 : 3.8 ns / 5.2 ns 262144 : 7.3 ns / 10.5 ns 524288 : 15.8 ns / 22.3 ns 1048576 : 20.2 ns / 26.5 ns 2097152 : 22.8 ns / 28.3 ns 4194304 : 51.9 ns / 79.3 ns 8388608 : 117.8 ns / 165.3 ns 16777216 : 153.1 ns / 194.1 ns 33554432 : 172.3 ns / 208.8 ns 67108864 : 185.7 ns / 222.5 ns block size : single random read / dual random read, [MADV_HUGEPAGE] 1024 : 0.0 ns / 0.0 ns 2048 : 0.0 ns / 0.0 ns 4096 : 0.0 ns / 0.0 ns 8192 : 0.0 ns / 0.0 ns 16384 : 0.0 ns / 0.0 ns 32768 : 0.0 ns / 0.0 ns 65536 : 2.5 ns / 3.9 ns 131072 : 3.8 ns / 5.2 ns 262144 : 4.7 ns / 6.1 ns 524288 : 12.1 ns / 16.4 ns 1048576 : 16.1 ns / 19.4 ns 2097152 : 17.7 ns / 20.3 ns 4194304 : 45.2 ns / 66.7 ns 8388608 : 106.8 ns / 146.5 ns 16777216 : 137.7 ns / 168.9 ns 33554432 : 153.1 ns / 175.6 ns 67108864 : 164.9 ns / 183.9 ns ``` </details> ### Core to Core Memory Latency <img width="1920" height="1080" alt="Image" src="https://github.com/user-attachments/assets/693b859d-02a9-4646-86c4-d7425c416831" /> See discussion about [memory access improvements on this system](https://github.com/ThomasKaiser/sbc-bench/issues/125#issuecomment-3396615704). ## `sbc-bench` results See: https://github.com/ThomasKaiser/sbc-bench/issues/125 ## Phoronix Test Suite Results from [pi-general-benchmark.sh](https://gist.github.com/geerlingguy/570e13f4f81a40a5395688667b1f79af): - pts/encode-mp3: TODO sec - pts/x264 4K: TODO fps - pts/x264 1080p: TODO fps - pts/phpbench: 104733 - pts/build-linux-kernel (defconfig): 2852.438 sec

I tracked down the difference between gcc 13.2 and 13.3 to this set of patches: [PATCH v5 00/11] RISC-V: Implement ISA Manual Table A.6 Mappings

And in particular, this patch [PATCH v5 09/11] RISC-V: Weaken mem_thread_fence The patch is intended to relax the memory fences, not strengthen it. E.g., the generate code has some fences such as fence iorw,iorw relaxed into fence r,rw Take this code block as example:

for (int n = 0; n < 100; ++n) {
  while (seq1.load(std::memory_order_acquire) != n)
  ;
  seq2.store(n, std::memory_order_release);
}

And the corresponding assembly:

.L169:
    ld     a4,56(s0)    # a4 <- &seq2
    fence  rw,w         # fence for store_release below
    sw     a5,0(a4)     # seq2 = i
    addiw  a5,a5,1      # i++
    beq	   a5,a0,.L151	# if (i == 100)
.L146:
    ld     a4,48(s0)    # a4 <- &seq1
    lw     a4,0(a4)     # a4 = seq1
-   fence  iorw,iorw    # fence for load_acquire above
+   fence  r,rw
    bne    a4,a5,.L146  # if (a4 == i)
    j      .L169

Given that the fence has been relaxed, you’d think we get better results, but in reality it’s 10x worse!!! Based on my testing, the io fence is not the issue. I can relax the fence iorw,iorw to fence rw,rw, and get exactly the same latency number. It’s the relax from fence rw,rw to fence r,rw that caused the issue. The relaxation is correct, and it properly represent the load_acquire semantic, but it appears that EIC7700 doesn’t play nice with this change. My theory is that the store of seq2 sits in the store buffer for too long, before it can get flushed, causing the high latency. When there’s the w fence, it helps speeding up the flush. Does this has something to do with the core itself? Or the interconnect? I run the same exact binary (statically complied) on a Starfive JH7110, and there’s zero difference before and after that patch’s introduced, and the latency numbers are overall a lot better than EIC7700. Please check if there’s anything that can be done to mitigate this problem. Thanks.

Bo

dconn · December 3, 2025, 8:28pm

Hi Bo - great job digging into the issue and provided this detailed analysis. To answer your question, Yes- inside the P550 core, if nothing causes the store to drain, then we are permitted to buffer it for a finite amount of time, which we do to allow for additional store combining opportunities. For our more recent products after P550, we simplify the fence variations and make them behave the same, which is easier for synthesis timing at higher clock frequencies, and this will also recover the performance issue you so eloquently pointed out.

Topic		Replies	Views
Coremark benchmark degradation on HiFive Unmatched HiFive Unmatched	17	4773	October 8, 2021
Low 1 core STREAM bandwidth HiFive Premier P550	17	517	September 2, 2025
Poor Dhrystone performance Freedom E300	53	13871	May 20, 2018
Memory access is too slow HiFive Unmatched	12	4134	January 6, 2022
Timing issues in the Arduino IDE HiFive1 Rev B	21	4895	January 27, 2017

High core-to-core latency from c2clat benchmark

Related topics