Low 1 core STREAM bandwidth

HI all,
I have some trouble with 1 core STREAM bandwidth on P550 Primier (EIC7700) and I don’t know that I’m doing wrong.
Maybe somebody can help me ?
One more disclamer that I’m newbie in RISC-V and P550 world.

OS: default, Ubuntu 24.04.2 LTS
Kernel: default, 6.6.77-1-premier #4 SMP PREEMPT_DYNAMIC Thu Apr 10 00:15:20 UTC 2025
Compiler:

clang -v
Ubuntu clang version 18.1.3 (1ubuntu1)
Target: riscv64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Found candidate GCC installation: /usr/bin/../lib/gcc/riscv64-linux-gnu/13
Found candidate GCC installation: /usr/bin/../lib/gcc/riscv64-linux-gnu/14
Selected GCC installation: /usr/bin/../lib/gcc/riscv64-linux-gnu/14

I’ve got STREAM from GitHub - jeffhammond/STREAM: STREAM benchmark

Main parts of Makefile

CC = clang
CFLAGS = -march=rv64gc_zba_zbb -mabi=lp64d -mtune=sifive-u74 -mcmodel=medany -msmall-data-limit=8 -ffunction-sections -fdata-sections -fno-common -ftls-model=local-exec -O3 -falign-functions=4 -mllvm -unroll-count=8 -Wno-unknown-pragmas -Wno-unused-but-set-variable -fopenmp

stream_c.exe: stream.c
        $(CC) $(CFLAGS) stream.c -o stream_c.exe

Freq

cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
1400000
1400000
1400000
1400000
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq
1400000
1400000
1400000
1400000

cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
performance
performance
performance
performance

Run output

make;  OMP_NUM_THREADS=1 taskset -c 1 ./stream_c.exe

-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 1
Number of Threads counted = 1
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 92772 microseconds.
   (= 92772 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            2071.9     0.088507     0.077225     0.093541
Scale:           2092.1     0.088172     0.076479     0.092426
Add:             1193.1     0.203638     0.201155     0.205095
Triad:           1179.1     0.204903     0.203539     0.206480
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

Why so low bandwidth ?
What am I doing wrong ?
Is it expected mem bandwidth for P550 (EIC7700) ?

You are not alone. These are the numbers I produced with gcc w/ the compiler flag -fno-tree-loop-distribute-patterns in order to avoid gcc optimizing the copy to builtin_memcpy:

This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 1
Number of Threads counted = 1
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 89715 microseconds.
   (= 89715 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            1643.8     0.100063     0.097335     0.101276
Scale:           5806.4     0.035647     0.027556     0.046420
Add:             1729.7     0.173100     0.138755     0.187907
Triad:           1412.7     0.188869     0.169891     0.204228

Interestingly the speed of Scale is much faster than Copy. The related assembly code:
For Copy:

    134c:       231c                    fld     fa5,0(a4)
    134e:       0721                    addi    a4,a4,8
    1350:       07a1                    addi    a5,a5,8
    1352:       fef7bc27                fsd     fa5,-8(a5)
    1356:       fed71be3                bne     a4,a3,134c <main._omp_fn.4+0x4c>

For Scale:

    12d4:       231c                    fld     fa5,0(a4)
    12d6:       07a1                    addi    a5,a5,8
    12d8:       0721                    addi    a4,a4,8
    12da:       12e7f7d3                fmul.d  fa5,fa5,fa4
    12de:       fef7bc27                fsd     fa5,-8(a5)
    12e2:       fed719e3                bne     a4,a3,12d4 <main._omp_fn.5+0x54>

It doesn’t make sense to me why a loop with fmul.d is even faster than a loop without.
More interestingly, if I change the simple assignment to a divide by 1.0, so that the Copy uses a fmul.d for the 1.0 division (with the compiler flag -frounding-math -fsignaling-nans so gcc doesn’t optimize it out):

@@ -312,7 +320,7 @@ main()
 #else
 #pragma omp parallel for
        for (j=0; j<STREAM_ARRAY_SIZE; j++)
-           c[j] = a[j];
+           c[j] = a[j] / 1.0;
 #endif
        times[0][k] = mysecond() - times[0][k];    

The performance of Copy is then on par with Scale:

Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            5819.9     0.033835     0.027492     0.039554
Scale:           5744.6     0.035851     0.027852     0.044526
Add:             1465.3     0.175302     0.163794     0.189863
Triad:           1324.8     0.191753     0.181161     0.200913

There’s no unaligned assess in this code, and it runs almost 100% of the time in user mode except for periodic timer interrupts that goes to firmware->kernel and back. I confirmed this execution flow with HW trace. Thus, there must be some micro-arch things going on. Maybe something related to the prefetcher? I hope someone from Sifive can step in and explain it, and is there any tweaks to apply to boost the numbers. To me It seems like a very noticeable HW perf issue.

2 Likes

@BlackS and @ganboing, thanks for bringing it to our attention and sorry about the late response. We are exploring this issue and will provide a detailed answer soon.

1 Like

Hello, @Raza,
do we have any updates ?
Do you have a chance to look to this problem ?
I still fill the pain in my works from such unexpected STREAM results.

Hello @BlackS, we have updated some settings and verified correct STREAM performance. The SW release with these changes is being tested and we hope to release it in a few days pending successful passing of tests.

2 Likes

Hello @BlackS

Below is the link of latest release which will have correct STREAM performance. Please make sure to update bootchain image.

1 Like

Thank you. Will try

I can confirm there’s significant increase on STREAM numbers:

-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           10480.8     0.015288     0.015266     0.015328
Scale:           9877.8     0.016218     0.016198     0.016236
Add:            10560.7     0.022787     0.022726     0.022815
Triad:          10338.2     0.023263     0.023215     0.023313
-------------------------------------------------------------

Seems I guessed correct. The issue is with improper HW prefetcher settings:

0x7c3 controls the private L1 and 0x7c4 controls the private L2. Anyone using this SoC probably wants to redo the benchmarks and see if there’s performance uplift across the board, and if any workload regresses.

1 Like

Backup material. This is what changed in the 2025.07 vendor OpenSBI release regarding HW prefetcher settings

@@ -1,17 +1,17 @@ L1 prefetcher settings
 {
   "name": "L1 Prefetcher CSR",
   "fields": {
     "l1pfEnable": 1,
-    "window": 36,
-    "initialDist": 12,
+    "window": 32,
+    "initialDist": 4,
     "maxAllowedDist": 31,
     "linToExpThrd": 3,
     "qFullnessThrdL1": 14,
-    "hitCacheThrdL1": 2,
-    "hitMSHRThrdL1": 0,
+    "hitCacheThrdL1": 10,
+    "hitMSHRThrdL1": 2,
     "issueBubble": 0,
-    "maxL1PFDist": 0,
+    "maxL1PFDist": 8,
     "forgiveThrd": 0,
-    "numL1PFIssQEnt": 0
+    "numL1PFIssQEnt": 2
   }
 }
@@ -1,10 +1,10 @@ L2 prefetcher settings
 {
   "name": "L1 Prefetcher CSR",
   "fields": {
     "l2pfEnable": 1,
     "qFullnessThrdL2": 15,
-    "hitCacheThrdL2": 20,
+    "hitCacheThrdL2": 31,
     "hitMSHRThrdL2": 4,
     "numL2PFIssQEnt": 2
   }
 }
1 Like

Sorry but one additional question.

Those zeroes in the l1d prefetcher configuration look suspicious.
Does this mean it was effectively disabled?

Just to close the loop.

Thanks a lot
At the first glance looks good.
Now on the same binary I have much better results

OMP_NUM_THREADS=1 taskset -c 1 ./stream_c.exe
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 14614 microseconds.
   (= 14614 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           10439.1     0.015376     0.015327     0.015517
Scale:          10598.2     0.015137     0.015097     0.015283
Add:            10777.3     0.022395     0.022269     0.022641
Triad:          10677.2     0.022610     0.022478     0.022839
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

Thanks a lot for your efforts

Quoting from TRM part1

3.4.1.3 Hardware Prefetcher
The Hardware Prefetcher (HWPF) is a region-based sequential stride prefetcher and is tightly integrated in the LSU. It enables memory-level parallelism (MLP) and hides memory latency. The HWPF relies on patter detection of loads and sends hints that either go to L1 MSHRs or directly to L2. It is comprised of the HWPF module with individual prefetch engines and a HWPF issue queue.
The HWPF uses a single state machine that sends both L1 and L2 prefetches, depending on how far the tail pointer and respective demand pointer are separated.
The first L1 demand request allocates with the line address that consists of the region’s virtual address and offset. The second L1 demand request calculates the stride based on the current and previous addresses. The third L1 demand request confirms the stride and starts sending prefetches equal to the number programmed in the initialDist of the HWPF CSR, up to a maximum of 64 L1 prefetch requests. Each L1 prefetch request establishes an MSHR and sends an AcquireBlock.NtoB request to the downstream cache hierarchy. Each L2 prefetch request establishes an L2 HWPF queue and sends a Hint to L2.

I admit I can’t make sense of any of that. If someone’s familiar with sifive caches and TileLink, feel free to step in.

1 Like

This change was a significant step backwards for my real world application. The performance of an LDPC (Low Density Parity Check) encoder thread went from approximately 70% of a core to over 100% (which breaks the application).

Here’s the code:

In case anyone else wants to try this code, I’ve made a standalone version.

https://www.w6rz.net/ldpcatsc3.cc

https://www.w6rz.net/ldpcatsc3.h

Compile with g++ -O2 ldpcatsc3.cc -o ldpcatsc3

I found this doc from Sifive/Starfive that explains the L2 prefetcher in detail in U74-MC (previous Gen):

See Chapter 13.2.5
Now we have private L1/L2 and shared L3, but I assume some terminology still applies. It’s far more readable than the TRM released by ESWIN.

drmpeg’s code suffered significant perf regression – took ~2x the time to finish. I did some tweak and found that you don’t need that many tweaks to L1/L2 prefetcher to boost the STREAM workload and, in the meantime, penalize drmpeg’s LDPC workload. Based on the original value of CSR 0x7c3 and 0x7c4 before patch hifive-premier-p550: opensbi: Modify CSR registers · sifiveinc/meta-sifive@9759264 · GitHub, increase maxL1PFDist a little bit is all you need to boost STREAM performance. I tried to set it to 2 or 3, and I got pretty good STREAM perf (on par with the new firmware release), and didn’t see noticeable regression with LDPC. ESWIN/Sifive needs to do more testing to make L1/L2 prefetcher settings fitting wider range of workloads. Even better, provide a SBI interface so it can be adjusted without having to flash a new firmware.

FYI: My current setting:

0x7c3: 0x1005c1be649  {
  "reg": "0x7c3",
  "name": "L1 Prefetcher CSR",
  "fields": {
    "l1pfEnable": 1,
    "window": 36,
    "initialDist": 12,
    "maxAllowedDist": 31,
    "linToExpThrd": 3,
    "qFullnessThrdL1": 14,
    "hitCacheThrdL1": 2,
    "hitMSHRThrdL1": 0,
    "issueBubble": 0,
    "maxL1PFDist": 2,
    "forgiveThrd": 0,
    "numL1PFIssQEnt": 0
  }
}
0x7c4: 0x929f  {
  "reg": "0x7c4",
  "name": "L1 Prefetcher CSR",
  "fields": {
    "l2pfEnable": 1,
    "qFullnessThrdL2": 15,
    "hitCacheThrdL2": 20,
    "hitMSHRThrdL2": 4,
    "numL2PFIssQEnt": 2
  }
}

Have you done any PCIe-Host data movement or shared DRAM buffers speed test? to see if this also has a significant impact in this memory communications?

I haven’t done any PCIe related benchmark. This change is only related to HW prefetcher settings, so I don’t expect anything about PCIe to change, at least directly. One thing that could really hurt PCIe perf is that EIC7700 doesn’t use cache coherent DMA on any peripheral. I think this issue has been discussed before.

In short, DMA from PCIe is not directly visible from CPU. The way Linux kernel/driver access the DMA buffer is either:

  1. After DMA completes, flush the cache lines containing the DMA buffer – essentially “pull” the changes done by device
  2. Use the uncached window to directly access DMA buffer in memory (cache bypass)

I think for your workload, you might want to see if 1 is better than 2, because with 2, the CPU is not permitted to do any caching or cache prefetch, so every read/write is literally a memory access. With 1, you pay the penalty up front, perhaps even a little bit more because for a large region, it’s inefficient to flush individual lines, and you’d just flush the whole cache. However, it’ll be much faster later on as cache/prefetcher can kick in.

You may need to dig into the Linux source in order to change the strategy (1 or 2). In general, I think this is a very noticeable short coming of the EIC770x SoC (P550/U84 core). Years ago, Starfive’s JH7100 (U74 core) suffered from the exact same issue, and later on Starfive release the 2nd Gen JH7110 (also U74 core) with cache coherent high speed peripherals PCIe/GMAC. With all these past experiences, yet we have to deal with it all over again. The problem isn’t with Sifive, but with SoC vendors not doing it properly.