Recently I’m trying to make sense of the c2clat (core-to-core latency) benchmark result from Jeff Geerling’s sbc-review on DC ROMA AI PC , which has the EIC7702 SoC (basically the 2-die version of the EIC7700 in P550). I rerun the c2clat on my P550, and interestingly, the latency numbers are highly sensitive to the version of GCC used to build the binary. See my test results:
opened 10:31PM - 10 Oct 25 UTC
testing in progress

## Basic information
- Board URL (official): https://store.deepcomputing.io/products/dc-roma-ai-pc-risc-v-mainboard-ii-for-framework-laptop-13
- Board purchased from: Provided for review, by DeepComputing
- Board purchase date: 2025-10-09
- Board specs (as tested): 32GB RAM (16GB for CPU), 512GB SSD, 8-core 1.8 GHz SiFive P550 CPU
- Board price (as tested): $899 (standalone Mainboard starts at $349 for 32GB/No SSD)

## Linux/system information
```
# output of `screenfetch`
$ screenfetch
./+o+- roma@roma
yyyyy- -yyyyyy+ OS: Ubuntu 24.04 noble
://+//////-yyyyyyo Kernel: riscv64 Linux 6.6.92-eic7x-2025.07
.++ .:/++++++/-.+sss/` Uptime: 9m
.:++o: /++++++++/:--:/- Packages: 1954
o:+o+:++.`..```.-/oo+++++/ Shell: bash 5.2.21
.:+o:+o/. `+sssoo+/ Disk: 25G / 468G (6%)
.++/+:+oo+o:` /sssooo. CPU: Unknown @ 8x 1.8GHz
/+++//+:`oo+o /::--:. GPU:
\+/+o+++`o++o ++////. RAM: 1529MiB / 14682MiB
.++.o+++oo+:` /dddhhh.
.+.o+oo:. `oddhhhh+
\+.++o+o``-````.:ohdhhhhh+
`:o+++ `ohhhhhhhhyo++os:
.o:`.syhhhhhhh/.oo++o`
/osyyyyyyo++ooo+++/
````` +oo+++o\:
`oo++.
# output of `uname -a`
Linux roma 6.6.92-eic7x-2025.07 #2025.09.25.06.45+ SMP Thu Sep 25 06:53:10 UTC 2025 riscv64 riscv64 riscv64 GNU/Linux
```
## Benchmark results
### CPU
- Geekbench 6: (174 single / 640 multi - https://browser.geekbench.com/v6/cpu/14421987)
- 17.759 Gflops at 32.5W, for **0.55 Gflops/W** ([geerlingguy/top500-benchmark](https://github.com/geerlingguy/top500-benchmark) [HPL result](https://github.com/geerlingguy/top500-benchmark/issues/77))
### Power
- Sleep power draw (at wall): 2.2 W
- Idle power draw (charging, battery 80%): 41.4 W
- Idle power draw (at wall, battery 100%): 25.1 W
- Maximum simulated power draw (`stress-ng --matrix 0`): 31.7 W
- During Geekbench multicore benchmark: 32.1 W
- During `top500` HPL benchmark: 32.9 W
### Disk
#### ZHITAI TiPlus7100 512GB
| Benchmark | Result |
| -------------------------- | ------ |
| iozone 4K random read | 60.31 MB/s |
| iozone 4K random write | 126.92 MB/s |
| iozone 1M random read | 1042.36 MB/s |
| iozone 1M random write | 1306.80 MB/s |
| iozone 1M sequential read | 1076.06 MB/s |
| iozone 1M sequential write | 1301.79 MB/s |
### Network
`iperf3` results:
## WLAN (WiFi 6, built-in Intel AX200)
- `iperf3 -c $SERVER_IP`: 637 Mbps
- `iperf3 -c $SERVER_IP --reverse`: 293 Mbps
- `iperf3 -c $SERVER_IP --bidir`: 510 Mbps up, 141 Mbps down
(Be sure to test all interfaces, noting any that are non-functional.)
## GPU
The device includes a `PowerVR A-Series AXM-8-256`, with some precompiled drivers for OpenGL and Vulkan, but compatibility seemed a little hit or miss...
### glmark2
`glmark2-es2-wayland` results:
```
=======================================================
glmark2 2023.01
=======================================================
OpenGL Information
GL_VENDOR: Imagination Technologies
GL_RENDERER: PowerVR A-Series AXM-8-256
GL_VERSION: OpenGL ES 3.2 build 24.2@6643903
Surface Config: buf=32 r=8 g=8 b=8 a=8 depth=24 stencil=0 samples=0
Surface Size: 800x600 windowed
=======================================================
[build] use-vbo=false: FPS: 205 FrameTime: 4.884 ms
[build] use-vbo=true: FPS: 475 FrameTime: 2.106 ms
[texture] texture-filter=nearest: FPS: 525 FrameTime: 1.907 ms
[texture] texture-filter=linear: FPS: 542 FrameTime: 1.846 ms
[texture] texture-filter=mipmap: FPS: 534 FrameTime: 1.875 ms
[shading] shading=gouraud: FPS: 471 FrameTime: 2.126 ms
[shading] shading=blinn-phong-inf: FPS: 493 FrameTime: 2.029 ms
[shading] shading=phong: FPS: 513 FrameTime: 1.953 ms
[shading] shading=cel: FPS: 475 FrameTime: 2.107 ms
[bump] bump-render=high-poly: FPS: 546 FrameTime: 1.832 ms
[bump] bump-render=normals: FPS: 548 FrameTime: 1.825 ms
[bump] bump-render=height: FPS: 535 FrameTime: 1.871 ms
[effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 492 FrameTime: 2.035 ms
[effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 602 FrameTime: 1.661 ms
[pulsar] light=false:quads=5:texture=false: FPS: 536 FrameTime: 1.868 ms
[desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 116 FrameTime: 8.670 ms
[desktop] effect=shadow:windows=4: FPS: 956 FrameTime: 1.046 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 243 FrameTime: 4.121 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 242 FrameTime: 4.144 ms
[buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 362 FrameTime: 2.763 ms
[ideas] speed=duration: FPS: 693 FrameTime: 1.443 ms
[jellyfish] <default>: FPS: 1729 FrameTime: 0.579 ms
[terrain] <default>: FPS: 117 FrameTime: 8.606 ms
[shadow] <default>: FPS: 1505 FrameTime: 0.665 ms
[refract] <default>: FPS: 196 FrameTime: 5.122 ms
[conditionals] fragment-steps=0:vertex-steps=0: FPS: 2185 FrameTime: 0.458 ms
[conditionals] fragment-steps=5:vertex-steps=0: FPS: 2212 FrameTime: 0.452 ms
[conditionals] fragment-steps=0:vertex-steps=5: FPS: 1473 FrameTime: 0.679 ms
[function] fragment-complexity=low:fragment-steps=5: FPS: 2198 FrameTime: 0.455 ms
[function] fragment-complexity=medium:fragment-steps=5: FPS: 2294 FrameTime: 0.436 ms
[loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 2337 FrameTime: 0.428 ms
[loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 2297 FrameTime: 0.435 ms
[loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 2287 FrameTime: 0.437 ms
=======================================================
glmark2 Score: 936
=======================================================
```
### vkmark
`vkmark` results:
```
$ DISPLAY=:0 vkmark --debug
Debug: WindowSystemLoader: Looking in /usr/lib/riscv64-linux-gnu/vkmark for window system plugins
Debug: WindowSystemLoader: Loading options from /usr/lib/riscv64-linux-gnu/vkmark/kms.so... ok
Debug: WindowSystemLoader: Loading options from /usr/lib/riscv64-linux-gnu/vkmark/wayland.so... ok
Debug: WindowSystemLoader: Loading options from /usr/lib/riscv64-linux-gnu/vkmark/xcb.so... ok
Debug: WindowSystemLoader: Probing /usr/lib/riscv64-linux-gnu/vkmark/kms.so... succeeded with priority 255
Debug: WindowSystemLoader: Probing /usr/lib/riscv64-linux-gnu/vkmark/wayland.so... succeeded with priority 255
Authorization required, but no authorization protocol specified
Debug: WindowSystemLoader: Probing /usr/lib/riscv64-linux-gnu/vkmark/xcb.so... succeeded with priority 0
Debug: WindowSystemLoader: Selected window system plugin /usr/lib/riscv64-linux-gnu/vkmark/kms.so (best match)
Debug: KMSWindowSystemPlugin: Using legacy modesetting
Segmentation fault (core dumped)
```
With a version [compiled from source](https://github.com/geerlingguy/sbc-reviews/issues/76), I got:
```
Error: No suitable Vulkan physical devices found
```
Here is the vulkaninfo for the board:
<details>
<summary>Click to expand `vulkaninfo`</summary>
```
==========
VULKANINFO
==========
Vulkan Instance Version: 1.3.275
Instance Extensions: count = 23
-------------------------------
VK_EXT_acquire_drm_display : extension revision 1
VK_EXT_acquire_xlib_display : extension revision 1
VK_EXT_debug_report : extension revision 10
VK_EXT_debug_utils : extension revision 2
VK_EXT_direct_mode_display : extension revision 1
VK_EXT_display_surface_counter : extension revision 1
VK_EXT_surface_maintenance1 : extension revision 1
VK_EXT_swapchain_colorspace : extension revision 4
VK_KHR_device_group_creation : extension revision 1
VK_KHR_display : extension revision 23
VK_KHR_external_fence_capabilities : extension revision 1
VK_KHR_external_memory_capabilities : extension revision 1
VK_KHR_external_semaphore_capabilities : extension revision 1
VK_KHR_get_display_properties2 : extension revision 1
VK_KHR_get_physical_device_properties2 : extension revision 2
VK_KHR_get_surface_capabilities2 : extension revision 1
VK_KHR_portability_enumeration : extension revision 1
VK_KHR_surface : extension revision 25
VK_KHR_surface_protected_capabilities : extension revision 1
VK_KHR_wayland_surface : extension revision 6
VK_KHR_xcb_surface : extension revision 6
VK_KHR_xlib_surface : extension revision 6
VK_LUNARG_direct_driver_loading : extension revision 1
Instance Layers: count = 2
--------------------------
VK_LAYER_MESA_device_select Linux device selection layer 1.3.211 version 1
VK_LAYER_MESA_overlay Mesa Overlay layer 1.3.211 version 1
Devices:
========
GPU0:
apiVersion = 1.3.277
driverVersion = 1.598.191
vendorID = 0x1010
deviceID = 0x30010101
deviceType = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
deviceName = PowerVR A-Series AXM-8-256
driverID = DRIVER_ID_IMAGINATION_PROPRIETARY
driverName = PowerVR A-Series Vulkan Driver
driverInfo = 24.2@6643903
conformanceVersion = 1.3.8.1
deviceUUID = 33302033-2034-3038-2031-303100000000
driverUUID = 36363433-3930-3300-0000-000000000000
```
</details>
### GravityMark
GravityMark results:
```
1. Download the latest version of GravityMark: https://gravitymark.tellusim.com
2. Run `chmod +x [downloaded_filename].run`
3. Run `sudo ./[downloaded_filename].run` and press `y` to accept the terms.
4. Open the link it prints, and run the Benchmark defaults, changing to 720p resolution and 50,000 asteroids.
```
Note: These benchmarks require an active display on the device. Not all devices may be able to run `glmark2-es2`, so in that case, make a note and move on!
### AI / LLM Inference
`ollama` LLM model inference results:
#### NPU Inference
| System | CPU/GPU | Model | Eval Rate | Power (Peak) |
| :--- | :--- | :--- | :--- | :--- |
| DC-ROMA Mainboard II (8-core RISC-V) | NPU | deepseek-r1:7b | 4.9 Tokens/s | 38.9 W |
#### CPU Inference
| System | CPU/GPU | Model | Eval Rate | Power (Peak) |
| :--- | :--- | :--- | :--- | :--- |
| DC-ROMA Mainboard II (8-core RISC-V) | CPU | deepseek-r1:1.5b | 0.59 Tokens/s | 32.0 W |
| DC-ROMA Mainboard II (8-core RISC-V) | CPU | llama3.2:3b | 0.31 Tokens/s | 30.6 W |
More results: https://github.com/geerlingguy/ai-benchmarks/issues/28
## Memory
`tinymembench` results:
<details>
<summary>Click to expand memory benchmark result</summary>
```
tinymembench v0.4.10 (simple benchmark for memory throughput and latency)
==========================================================================
== Memory bandwidth tests ==
== ==
== Note 1: 1MB = 1000000 bytes ==
== Note 2: Results for 'copy' tests show how many bytes can be ==
== copied per second (adding together read and writen ==
== bytes would have provided twice higher numbers) ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
== to first fetch data into it, and only then write it to the ==
== destination (source -> L1 cache, L1 cache -> destination) ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in ==
== brackets ==
==========================================================================
C copy backwards : 4818.5 MB/s (0.2%)
C copy backwards (32 byte blocks) : 4826.0 MB/s
C copy backwards (64 byte blocks) : 4842.2 MB/s (0.2%)
C copy : 4800.1 MB/s
C copy prefetched (32 bytes step) : 4825.5 MB/s (32.0%)
C copy prefetched (64 bytes step) : 864.5 MB/s (0.2%)
C 2-pass copy : 717.7 MB/s
C 2-pass copy prefetched (32 bytes step) : 716.7 MB/s
C 2-pass copy prefetched (64 bytes step) : 716.8 MB/s
C fill : 7789.3 MB/s (28.7%)
C fill (shuffle within 16 byte blocks) : 7794.9 MB/s
C fill (shuffle within 32 byte blocks) : 7837.7 MB/s (0.4%)
C fill (shuffle within 64 byte blocks) : 7804.5 MB/s
---
standard memcpy : 4335.0 MB/s
standard memset : 7837.1 MB/s (0.2%)
==========================================================================
== Memory latency test ==
== ==
== Average time is measured for random memory accesses in the buffers ==
== of different sizes. The larger is the buffer, the more significant ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM ==
== accesses. For extremely large buffer sizes we are expecting to see ==
== page table walk with several requests to SDRAM for almost every ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest). ==
== ==
== Note 1: All the numbers are representing extra time, which needs to ==
== be added to L1 cache latency. The cycle timings for L1 cache ==
== latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
== two independent memory accesses at a time. In the case if ==
== the memory subsystem can't handle multiple outstanding ==
== requests, dual random read has the same timings as two ==
== single reads performed one after another. ==
==========================================================================
block size : single random read / dual random read, [MADV_NOHUGEPAGE]
1024 : 0.0 ns / 0.0 ns
2048 : 0.0 ns / 0.0 ns
4096 : 0.0 ns / 0.0 ns
8192 : 0.0 ns / 0.0 ns
16384 : 0.0 ns / 0.0 ns
32768 : 0.0 ns / 0.0 ns
65536 : 2.5 ns / 3.9 ns
131072 : 3.8 ns / 5.2 ns
262144 : 7.3 ns / 10.5 ns
524288 : 15.8 ns / 22.3 ns
1048576 : 20.2 ns / 26.5 ns
2097152 : 22.8 ns / 28.3 ns
4194304 : 51.9 ns / 79.3 ns
8388608 : 117.8 ns / 165.3 ns
16777216 : 153.1 ns / 194.1 ns
33554432 : 172.3 ns / 208.8 ns
67108864 : 185.7 ns / 222.5 ns
block size : single random read / dual random read, [MADV_HUGEPAGE]
1024 : 0.0 ns / 0.0 ns
2048 : 0.0 ns / 0.0 ns
4096 : 0.0 ns / 0.0 ns
8192 : 0.0 ns / 0.0 ns
16384 : 0.0 ns / 0.0 ns
32768 : 0.0 ns / 0.0 ns
65536 : 2.5 ns / 3.9 ns
131072 : 3.8 ns / 5.2 ns
262144 : 4.7 ns / 6.1 ns
524288 : 12.1 ns / 16.4 ns
1048576 : 16.1 ns / 19.4 ns
2097152 : 17.7 ns / 20.3 ns
4194304 : 45.2 ns / 66.7 ns
8388608 : 106.8 ns / 146.5 ns
16777216 : 137.7 ns / 168.9 ns
33554432 : 153.1 ns / 175.6 ns
67108864 : 164.9 ns / 183.9 ns
```
</details>
### Core to Core Memory Latency
<img width="1920" height="1080" alt="Image" src="https://github.com/user-attachments/assets/693b859d-02a9-4646-86c4-d7425c416831" />
See discussion about [memory access improvements on this system](https://github.com/ThomasKaiser/sbc-bench/issues/125#issuecomment-3396615704).
## `sbc-bench` results
See: https://github.com/ThomasKaiser/sbc-bench/issues/125
## Phoronix Test Suite
Results from [pi-general-benchmark.sh](https://gist.github.com/geerlingguy/570e13f4f81a40a5395688667b1f79af):
- pts/encode-mp3: TODO sec
- pts/x264 4K: TODO fps
- pts/x264 1080p: TODO fps
- pts/phpbench: 104733
- pts/build-linux-kernel (defconfig): 2852.438 sec
I tracked down the difference between gcc 13.2 and 13.3 to this set of patches: [PATCH v5 00/11] RISC-V: Implement ISA Manual Table A.6 Mappings
And in particular, this patch [PATCH v5 09/11] RISC-V: Weaken mem_thread_fence The patch is intended to relax the memory fences, not strengthen it. E.g., the generate code has some fences such as fence iorw,iorw relaxed into fence r,rw Take this code block as example:
for (int n = 0; n < 100; ++n) {
while (seq1.load(std::memory_order_acquire) != n)
;
seq2.store(n, std::memory_order_release);
}
And the corresponding assembly:
.L169:
ld a4,56(s0) # a4 <- &seq2
fence rw,w # fence for store_release below
sw a5,0(a4) # seq2 = i
addiw a5,a5,1 # i++
beq a5,a0,.L151 # if (i == 100)
.L146:
ld a4,48(s0) # a4 <- &seq1
lw a4,0(a4) # a4 = seq1
- fence iorw,iorw # fence for load_acquire above
+ fence r,rw
bne a4,a5,.L146 # if (a4 == i)
j .L169
Given that the fence has been relaxed, you’d think we get better results, but in reality it’s 10x worse!!! Based on my testing, the io fence is not the issue. I can relax the fence iorw,iorw to fence rw,rw, and get exactly the same latency number. It’s the relax from fence rw,rw to fence r,rw that caused the issue. The relaxation is correct, and it properly represent the load_acquire semantic, but it appears that EIC7700 doesn’t play nice with this change. My theory is that the store of seq2 sits in the store buffer for too long, before it can get flushed, causing the high latency. When there’s the w fence, it helps speeding up the flush. Does this has something to do with the core itself? Or the interconnect? I run the same exact binary (statically complied) on a Starfive JH7110, and there’s zero difference before and after that patch’s introduced, and the latency numbers are overall a lot better than EIC7700. Please check if there’s anything that can be done to mitigate this problem. Thanks.
Bo