Low 1 core STREAM bandwidth

HI all,
I have some trouble with 1 core STREAM bandwidth on P550 Primier (EIC7700) and I don’t know that I’m doing wrong.
Maybe somebody can help me ?
One more disclamer that I’m newbie in RISC-V and P550 world.

OS: default, Ubuntu 24.04.2 LTS
Kernel: default, 6.6.77-1-premier #4 SMP PREEMPT_DYNAMIC Thu Apr 10 00:15:20 UTC 2025
Compiler:

clang -v
Ubuntu clang version 18.1.3 (1ubuntu1)
Target: riscv64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Found candidate GCC installation: /usr/bin/../lib/gcc/riscv64-linux-gnu/13
Found candidate GCC installation: /usr/bin/../lib/gcc/riscv64-linux-gnu/14
Selected GCC installation: /usr/bin/../lib/gcc/riscv64-linux-gnu/14

I’ve got STREAM from GitHub - jeffhammond/STREAM: STREAM benchmark

Main parts of Makefile

CC = clang
CFLAGS = -march=rv64gc_zba_zbb -mabi=lp64d -mtune=sifive-u74 -mcmodel=medany -msmall-data-limit=8 -ffunction-sections -fdata-sections -fno-common -ftls-model=local-exec -O3 -falign-functions=4 -mllvm -unroll-count=8 -Wno-unknown-pragmas -Wno-unused-but-set-variable -fopenmp

stream_c.exe: stream.c
        $(CC) $(CFLAGS) stream.c -o stream_c.exe

Freq

cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
1400000
1400000
1400000
1400000
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq
1400000
1400000
1400000
1400000

cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
performance
performance
performance
performance

Run output

make;  OMP_NUM_THREADS=1 taskset -c 1 ./stream_c.exe

-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 1
Number of Threads counted = 1
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 92772 microseconds.
   (= 92772 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            2071.9     0.088507     0.077225     0.093541
Scale:           2092.1     0.088172     0.076479     0.092426
Add:             1193.1     0.203638     0.201155     0.205095
Triad:           1179.1     0.204903     0.203539     0.206480
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

Why so low bandwidth ?
What am I doing wrong ?
Is it expected mem bandwidth for P550 (EIC7700) ?

You are not alone. These are the numbers I produced with gcc w/ the compiler flag -fno-tree-loop-distribute-patterns in order to avoid gcc optimizing the copy to builtin_memcpy:

This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 1
Number of Threads counted = 1
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 89715 microseconds.
   (= 89715 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            1643.8     0.100063     0.097335     0.101276
Scale:           5806.4     0.035647     0.027556     0.046420
Add:             1729.7     0.173100     0.138755     0.187907
Triad:           1412.7     0.188869     0.169891     0.204228

Interestingly the speed of Scale is much faster than Copy. The related assembly code:
For Copy:

    134c:       231c                    fld     fa5,0(a4)
    134e:       0721                    addi    a4,a4,8
    1350:       07a1                    addi    a5,a5,8
    1352:       fef7bc27                fsd     fa5,-8(a5)
    1356:       fed71be3                bne     a4,a3,134c <main._omp_fn.4+0x4c>

For Scale:

    12d4:       231c                    fld     fa5,0(a4)
    12d6:       07a1                    addi    a5,a5,8
    12d8:       0721                    addi    a4,a4,8
    12da:       12e7f7d3                fmul.d  fa5,fa5,fa4
    12de:       fef7bc27                fsd     fa5,-8(a5)
    12e2:       fed719e3                bne     a4,a3,12d4 <main._omp_fn.5+0x54>

It doesn’t make sense to me why a loop with fmul.d is even faster than a loop without.
More interestingly, if I change the simple assignment to a divide by 1.0, so that the Copy uses a fmul.d for the 1.0 division (with the compiler flag -frounding-math -fsignaling-nans so gcc doesn’t optimize it out):

@@ -312,7 +320,7 @@ main()
 #else
 #pragma omp parallel for
        for (j=0; j<STREAM_ARRAY_SIZE; j++)
-           c[j] = a[j];
+           c[j] = a[j] / 1.0;
 #endif
        times[0][k] = mysecond() - times[0][k];    

The performance of Copy is then on par with Scale:

Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            5819.9     0.033835     0.027492     0.039554
Scale:           5744.6     0.035851     0.027852     0.044526
Add:             1465.3     0.175302     0.163794     0.189863
Triad:           1324.8     0.191753     0.181161     0.200913

There’s no unaligned assess in this code, and it runs almost 100% of the time in user mode except for periodic timer interrupts that goes to firmware->kernel and back. I confirmed this execution flow with HW trace. Thus, there must be some micro-arch things going on. Maybe something related to the prefetcher? I hope someone from Sifive can step in and explain it, and is there any tweaks to apply to boost the numbers. To me It seems like a very noticeable HW perf issue.

2 Likes