Coremark benchmark degradation on HiFive Unmatched

Coremark benchmark is executed on HiFive Unmatched.
Details:
git clone GitHub - eembc/coremark: CoreMark® is an industry-standard benchmark that measures the performance of central processing units (CPU) and embedded microcrontrollers (MCU).
cd coremark
make
./coremark.exe

Coremark/MHz = 2.63

Linux kernel: 5.10.41
Clock speed set : 1.5GHz

We were expecting Coremark/MHz approximately around 5 as per results mentioned in https://en.wikichip.org/wiki/coremark-mhz.

Can you please let us know if anything is a miss on our side ?

Thanks

1 Like

The output of that command here is

$ uname -r
5.13.9-00012-g71cf60a44907
$ ./coremark.exe 
2K performance run parameters for coremark.
CoreMark Size    : 666
Total ticks      : 13309
Total time (secs): 13.309000
Iterations/Sec   : 4508.227515
Iterations       : 60000
Compiler version : GCC10.3.0
Compiler flags   : -O2 -DPERFORMANCE_RUN=1  -lrt
Memory location  : Please put data memory location here
                        (e.g. code in flash, data on heap etc)
seedcrc          : 0xe9f5
[0]crclist       : 0xe714
[0]crcmatrix     : 0x1fd7
[0]crcstate      : 0x8e3a
[0]crcfinal      : 0xbd59
Correct operation validated. See README.md for run and reporting rules.
CoreMark 1.0 : 4508.227515 / GCC10.3.0 -O2 -DPERFORMANCE_RUN=1  -lrt / Heap

Mine is running at 1.4Ghz so Coremark/MHz is 3.2. Still, not close to the 5.1 reported on that page. I wonder under what conditions it was measured.

1 Like

Please see instructions to build Coremark here:
https://sifive.cdn.prismic.io/sifive/05d149d5-967c-4ce3-a7b9-292e747e6582_hifive-unmatched-sw-reference-manual-v1p0.pdf#page=39

1 Like

I was curious so did some runs at different CPU speeds. I did use the debian gcc, not SiFive’s as suggested in the document, but did use the same compile settings (in the second and third table).

  • Default compile of CoreMark (upstream)
    GCC10.3.0 -O2 -DPERFORMANCE_RUN=1 -lrt / Heap
clock Coremark Coremark/MHz
1.0Ghz 3231.539829 3.23
1.1Ghz 3573.342862 3.24
1.2Ghz 3915.937867 3.26
1.3Ghz 4255.017375 3.27
1.4Ghz 4507.888805 3.22
1.5Ghz 4880.032534 3.25
  • Optimized compile of CoreMark (upstream)
    GCC10.3.0 -O2 -fno-common -funroll-loops -finline-functions -funroll-all-loops -falign-functions=8 -falign-jumps=8 -falign-loops=8 -finline-limit=1000 -mtune=sifive-7-series -ffast-math -fno-tree-loop-distribute-patterns --param fsm-scale-path-stmts=3 -DPERFORMANCE_RUN=1 -lrt / Heap
clock Coremark Coremark/MHz
1.0Ghz 3684.145892 3.68
1.1Ghz 4074.426185 3.70
1.2Ghz 4463.621485 3.72
1.3Ghz 4852.013586 3.73
1.4Ghz 5142.109200 3.67
1.5Ghz 5530.695359 3.69
  • Optimized compile of CoreMark (freedom-e-sdk)
    GCC10.3.0 -O2 -fno-common -funroll-loops -finline-functions --param max-inline-insns-auto=20 -falign-functions=4 -falign-jumps=4 -falign-loops=4 --param inline-min-speedup=10 -O2 -fno-common -funroll-loops -finline-functions -funroll-all-loops -falign-functions=8 -falign-jumps=8 -falign-loops=8 -finline-limit=1000 -mtune=sifive-7-series -ffast-math -fno-tree-loop-distribute-patterns --param fsm-scale-path-stmts=3 -DPERFORMANCE_RUN=1 -lrt / Heap
clock Coremark Coremark/MHz
1.0Ghz 4418.587525 4.42
1.1Ghz 4875.274234 4.43
1.2Ghz 5341.361562 4.45
1.3Ghz 5810.882198 4.47
1.4Ghz 6162.119769 4.40
1.5Ghz 6625.308679 4.42
1 Like

Are you using the freedom-e-sdk version of coremark? This has a trick in the core_portme.h file which does
typedef signed int ee_u32;

There is a loop in coremark that uses an unsigned int as an iterator, and because RISC-V always represents a 32-bit value in a 64-bit register as sign-extended, that requires extra zero-extend instructions in a critical loop that hurts performance. And a 32-bit zero extend to 64-bits requires two shift instructions.

There is also an issue that the GCC loop optimizer can’t prove that the iterator won’t overflow in this case (signed int has undefined overflow and unsigned int does not) and an important optimization doesn’t happen. This isn’t a RISC-V problem, this problem happens for all 64-bit targets. If you want good performance for a 64-bit target, don’t write code that uses an unsigned 32-bit int as a loop iterator.

The zba extension adds some new instructions to do arithmetic on 32-bit unsigned values, and zba+zbb adds the missing zero/sign extend instructions. With those two extensions we can generate much better code for coremark, and I think that this trick may not be required anymore for current cores. But Unmatched doesn’t have zba or zbb so still needs the trick.

The coremark rules changed after we started quoting U74.Unmatched coremark results, and this trick is no longer allowed. So we should be saying “best effort” not “best legal” when quoting these coremark values for Unmatched. We already do that for Dhrystone. But maybe we don’t do that on the web site because our current IP Cores have zba+zbb support and hence I think don’t need the trick.

1 Like

I was using the upstream version, up to now. I’ve added a table with the results for the version from the freedom-e-sdk. It’s clearly faster.

Thanks. I’m surprised to see the type of a loop iterator make so much of a difference! (but also I guess microbenchmarks like this are kind of a special case)

There is no point in aligning everything to 4 and then later in the command line aligning them to 8. The latter one will take precedence.

The right setting for U74 is generally to align branch targets to 4. This ensures that two (compressed) instructions can be dispatched and eliminates an I think 1 cycle stall if the jump is to the middle of a 4 byte unit. More than that just wastes time decoding NOPs and wastes L1 cache space.

There might be some benefit to aligning functions to the cache line size, but it would be smaller and can, again, waste cache space. On the other hand, if a hot function follows a cold function then it’s a waste of cache space to have code from the cold function in there too.

I haven’t personally found the -mtune=sifive-7-series to be helpful.

There are default options in the freedom-e-sdk coremark Makefile, and then there are options we suggest adding on top of the default options for best performance on the Unmatched. The default options are meant to be OK across a range of products. And the Unmatched specific ones are for Unmatched. So getting the multiple alignment options is just an accident from how the software is structured.

The best options will depend on the application. A choice that gives best pertformance for one application may not give the best result for another applications. The options we are recommending here for coremark have been tested to be good for coremark and Unmatched.

I tried various flags one by one to determine which have the most effect. I stumbled upon a subset that seems to work well (this is at 1.5 GHz).

ubuntu@riscv64:~/xfer/coremark$ ./coremark.exe
2K performance run parameters for coremark.
CoreMark Size    : 666
Total ticks      : 16308
Total time (secs): 16.308000
Iterations/Sec   : 6745.155752
Iterations       : 110000
Compiler version : GCC10.3.0
Compiler flags   : -O2 -fno-common -funroll-loops -mtune=sifive-7-series -fno-tree-loop-distribute-patterns -falign-functions=8 -falign-jumps=8 -falign-loops=8 -funroll-all-loops -DPERFORMANCE_RUN=1  -lrt
Memory location  : Please put data memory location here
			(e.g. code in flash, data on heap etc)
seedcrc          : 0xe9f5
[0]crclist       : 0xe714
[0]crcmatrix     : 0x1fd7
[0]crcstate      : 0x8e3a
[0]crcfinal      : 0x33ff
Correct operation validated. See README.md for run and reporting rules.
CoreMark 1.0 : 6745.155752 / GCC10.3.0 -O2 -fno-common -funroll-loops -mtune=sifive-7-series -fno-tree-loop-distribute-patterns -falign-functions=8 -falign-jumps=8 -falign-loops=8 -funroll-all-loops -DPERFORMANCE_RUN=1  -lrt / Heap
ubuntu@riscv64:~/xfer/coremark$

This was done on upstream Coremark. Just mkdir linux64 and copy over the software/coremark/linux64 directory from freedom-e-sdk. Edit the Makefile to use linux64.

diff --git a/Makefile b/Makefile
index 3ac6bdd..c0219d1 100644
--- a/Makefile
+++ b/Makefile
@@ -37,7 +37,7 @@ ifneq (,$(findstring FreeBSD,$(UNAME)))
 PORT_DIR=freebsd
 endif
 ifneq (,$(findstring Linux,$(UNAME)))
-PORT_DIR=linux
+PORT_DIR=linux64
 endif
 endif
 ifndef PORT_DIR

Then change the PORT_CFLAGS line in linux64/core_portme.mak to:

PORT_CFLAGS = -O2 -fno-common -funroll-loops -mtune=sifive-7-series -fno-tree-loop-distribute-patterns -falign-functions=8 -falign-jumps=8 -falign-loops=8 -funroll-all-loops

and then build with just make.

This is far more complicated than you need (you don’t need to copy files from anywhere, you can just edit what’s there, and you don’t even need to edit it, just do make PORT_CFLAGS="..." to override it on the command line). This is the old copy-pasta way of doing things that I got fed up enough with and got fixed upstream to share code.

Yeah, I jumped the gun on that post. All that’s needed is to edit posix/core_portme.h

diff --git a/posix/core_portme.h b/posix/core_portme.h
index e49e474..abf9d6a 100644
--- a/posix/core_portme.h
+++ b/posix/core_portme.h
@@ -112,7 +112,7 @@ typedef unsigned short ee_u16;
 typedef signed int     ee_s32;
 typedef double         ee_f32;
 typedef unsigned char  ee_u8;
-typedef unsigned int   ee_u32;
+typedef signed int     ee_u32;
 typedef uintptr_t      ee_ptr_int;
 typedef size_t         ee_size_t;
 /* align an offset to point to a 32b value */

BTW, make PORT_CFLAGS=".." doesn’t work. You have to use make XCFLAGS="..."

make XCFLAGS="-fno-common -funroll-loops -mtune=sifive-7-series -fno-tree-loop-distribute-patterns -falign-functions=8 -falign-jumps=8 -falign-loops=8 -funroll-all-loops"

If you’re a purist, you can edit just the loops that are affected instead of changing the typedef.

diff --git a/core_matrix.c b/core_matrix.c
index 29fd8ab..3a8a969 100644
--- a/core_matrix.c
+++ b/core_matrix.c
@@ -35,13 +35,13 @@ zero). NxN Matrix C - used for the result.
         The actual values for A and B must be derived based on input that is not
 available at compile time.
 */
-ee_s16 matrix_test(ee_u32 N, MATRES *C, MATDAT *A, MATDAT *B, MATDAT val);
-ee_s16 matrix_sum(ee_u32 N, MATRES *C, MATDAT clipval);
-void   matrix_mul_const(ee_u32 N, MATRES *C, MATDAT *A, MATDAT val);
-void   matrix_mul_vect(ee_u32 N, MATRES *C, MATDAT *A, MATDAT *B);
-void   matrix_mul_matrix(ee_u32 N, MATRES *C, MATDAT *A, MATDAT *B);
-void   matrix_mul_matrix_bitextract(ee_u32 N, MATRES *C, MATDAT *A, MATDAT *B);
-void   matrix_add_const(ee_u32 N, MATDAT *A, MATDAT val);
+ee_s16 matrix_test(ee_s32 N, MATRES *C, MATDAT *A, MATDAT *B, MATDAT val);
+ee_s16 matrix_sum(ee_s32 N, MATRES *C, MATDAT clipval);
+void   matrix_mul_const(ee_s32 N, MATRES *C, MATDAT *A, MATDAT val);
+void   matrix_mul_vect(ee_s32 N, MATRES *C, MATDAT *A, MATDAT *B);
+void   matrix_mul_matrix(ee_s32 N, MATRES *C, MATDAT *A, MATDAT *B);
+void   matrix_mul_matrix_bitextract(ee_s32 N, MATRES *C, MATDAT *A, MATDAT *B);
+void   matrix_add_const(ee_s32 N, MATDAT *A, MATDAT val);
 
 #define matrix_test_next(x)      (x + 1)
 #define matrix_clip(x, y)        ((y) ? (x)&0x0ff : (x)&0x0ffff)
@@ -91,7 +91,7 @@ printmatC(MATRES *C, ee_u32 N, char *name)
 ee_u16
 core_bench_matrix(mat_params *p, ee_s16 seed, ee_u16 crc)
 {
-    ee_u32  N   = p->N;
+    ee_s32  N   = p->N;
     MATRES *C   = p->C;
     MATDAT *A   = p->A;
     MATDAT *B   = p->B;
@@ -127,7 +127,7 @@ core_bench_matrix(mat_params *p, ee_s16 seed, ee_u16 crc)
         After the last step, matrix A is back to original contents.
 */
 ee_s16
-matrix_test(ee_u32 N, MATRES *C, MATDAT *A, MATDAT *B, MATDAT val)
+matrix_test(ee_s32 N, MATRES *C, MATDAT *A, MATDAT *B, MATDAT val)
 {
     ee_u16 crc     = 0;
     MATDAT clipval = matrix_big(val);
@@ -178,14 +178,14 @@ matrix_test(ee_u32 N, MATRES *C, MATDAT *A, MATDAT *B, MATDAT val)
    determined at compile time
 */
 ee_u32
-core_init_matrix(ee_u32 blksize, void *memblk, ee_s32 seed, mat_params *p)
+core_init_matrix(ee_s32 blksize, void *memblk, ee_s32 seed, mat_params *p)
 {
-    ee_u32  N = 0;
+    ee_s32  N = 0;
     MATDAT *A;
     MATDAT *B;
     ee_s32  order = 1;
     MATDAT  val;
-    ee_u32  i = 0, j = 0;
+    ee_s32  i = 0, j = 0;
     if (seed == 0)
         seed = 1;
     while (j < blksize)
@@ -235,11 +235,11 @@ core_init_matrix(ee_u32 blksize, void *memblk, ee_s32 seed, mat_params *p)
         Otherwise, reset the accumulator and add 10 to the result.
 */
 ee_s16
-matrix_sum(ee_u32 N, MATRES *C, MATDAT clipval)
+matrix_sum(ee_s32 N, MATRES *C, MATDAT clipval)
 {
     MATRES tmp = 0, prev = 0, cur = 0;
     ee_s16 ret = 0;
-    ee_u32 i, j;
+    ee_s32 i, j;
     for (i = 0; i < N; i++)
     {
         for (j = 0; j < N; j++)
@@ -266,9 +266,9 @@ matrix_sum(ee_u32 N, MATRES *C, MATDAT clipval)
         This could be used as a scaler for instance.
 */
 void
-matrix_mul_const(ee_u32 N, MATRES *C, MATDAT *A, MATDAT val)
+matrix_mul_const(ee_s32 N, MATRES *C, MATDAT *A, MATDAT val)
 {
-    ee_u32 i, j;
+    ee_s32 i, j;
     for (i = 0; i < N; i++)
     {
         for (j = 0; j < N; j++)
@@ -282,9 +282,9 @@ matrix_mul_const(ee_u32 N, MATRES *C, MATDAT *A, MATDAT val)
         Add a constant value to all elements of a matrix.
 */
 void
-matrix_add_const(ee_u32 N, MATDAT *A, MATDAT val)
+matrix_add_const(ee_s32 N, MATDAT *A, MATDAT val)
 {
-    ee_u32 i, j;
+    ee_s32 i, j;
     for (i = 0; i < N; i++)
     {
         for (j = 0; j < N; j++)
@@ -300,9 +300,9 @@ matrix_add_const(ee_u32 N, MATDAT *A, MATDAT val)
    coefficients is applied to the matrix.)
 */
 void
-matrix_mul_vect(ee_u32 N, MATRES *C, MATDAT *A, MATDAT *B)
+matrix_mul_vect(ee_s32 N, MATRES *C, MATDAT *A, MATDAT *B)
 {
-    ee_u32 i, j;
+    ee_s32 i, j;
     for (i = 0; i < N; i++)
     {
         C[i] = 0;
@@ -319,9 +319,9 @@ matrix_mul_vect(ee_u32 N, MATRES *C, MATDAT *A, MATDAT *B)
    scaling.
 */
 void
-matrix_mul_matrix(ee_u32 N, MATRES *C, MATDAT *A, MATDAT *B)
+matrix_mul_matrix(ee_s32 N, MATRES *C, MATDAT *A, MATDAT *B)
 {
-    ee_u32 i, j, k;
+    ee_s32 i, j, k;
     for (i = 0; i < N; i++)
     {
         for (j = 0; j < N; j++)
@@ -341,9 +341,9 @@ matrix_mul_matrix(ee_u32 N, MATRES *C, MATDAT *A, MATDAT *B)
    scaling.
 */
 void
-matrix_mul_matrix_bitextract(ee_u32 N, MATRES *C, MATDAT *A, MATDAT *B)
+matrix_mul_matrix_bitextract(ee_s32 N, MATRES *C, MATDAT *A, MATDAT *B)
 {
-    ee_u32 i, j, k;
+    ee_s32 i, j, k;
     for (i = 0; i < N; i++)
     {
         for (j = 0; j < N; j++)
diff --git a/coremark.h b/coremark.h
index 9c5e406..49da159 100644
--- a/coremark.h
+++ b/coremark.h
@@ -176,7 +176,7 @@ ee_u16 core_bench_state(ee_u32 blksize,
                         ee_u16 crc);
 
 /* matrix benchmark functions */
-ee_u32 core_init_matrix(ee_u32      blksize,
+ee_u32 core_init_matrix(ee_s32      blksize,
                         void *      memblk,
                         ee_s32      seed,
                         mat_params *p);

Obviously, as discussed, this now renders the results invalid.

It should work, I see no reason why it wouldn’t, it’s just a standard assignment in the Makefile so can be overridden on the command line just fine. You’ll just need to give the full value of PORT_CFLAGS, i.e. include -O2. But yes, XCFLAGS is the preferred way to add additional flags.

Editing the source of the benchmark itself is expressly forbidden and invalidates the result, and has always been true (unlike lying about ee_u32). make check will fail.

1 Like

Yup, I neglected the -O2 with make PORT_CFLAGS="...".

According to the rules, yes.

Rules which basically explicitly say “we have deliberately used bad coding style in our benchmark to penalise 64 bit ISAs that don’t zero-extend 32 bit results”.

This is presumably originally a 32 bit benchmark, where it wouldn’t matter whether ee_u32 or ee_s32 was used for those variables. But size_t would have been better practice, and would have worked seamlessly on either flavour of 64 bit ISA.

If they were unbiased they would be willing to accept a patch to use size_t on all machines.

Whether or not you like them doesn’t change whether they exist. Changing the benchmark changes its characteristics and makes it no longer comparable to previous results. Don’t get me wrong, I dislike CoreMark for many reasons, but benchmark rules are rules, not guidelines, and any breach thereof has to be clearly disclosed otherwise the numbers become even more meaningless than these kinds of benchmarks already are. So there is no debate to be had here; of course when I say “renders the results invalid” I mean due to the rules, what else would it mean.

1 Like

Hi,
I have few questions.

  1. As per https://sifive.cdn.prismic.io/sifive/05d149d5-967c-4ce3-a7b9-292e747e6582_hifive-unmatched-sw-reference-manual-v1p0.pdf#page=39
    coremark/MHz is calculated for frequency of 1 GHz. This calculation is for U74 or U54? Because as per FU740-C000 manual U74 operates at 1.8GHz.

  2. Also, I executed coremark with Upstream & Freedom-e-sdk source. The results I got are,

upstream: coremark/MHz= 4447/1500=2.96
freedom-e-sdk: coremark/MHz= 5020.76/1500=3.34

but as per https://en.wikichip.org/wiki/coremark-mhz , coremark/MHz for U74 should be 5.1. How to get the reason of degradation?

Thanks

The page you refer to says 1.2 GHz not 1.8 GHz.

But the machine you are using could be running at 1.0 or 1.4 or even 1.5 GHz (as mine is). If you don’t know how it is configured (by uboot) then you don’t know what to divide by.

A good way to check your clock rate is “perf stat /bin/ls” which will give a clock rate estimate on the cycles line. You might need to use sudo, depending on the distro.

The numbers we publish in the Software Reference Manual are best effort numbers, and can only be reproduced if you do everything exactly the same way as they were originally done. You should not expect to be able to reproduce these numbers without some work. Unfortunately, the document doesn’t give all of the details to reproduce, e.g. the OS name and version, the compiler version, the board version, etc. You should be able to get ~4.5 with a bit of work though. One of the previous messages in this thread did get that.

This documented was written awhile ago, so it was probably an old OS release with an older compiler version. Compiler optimization options are sensitive to compiler versions, and the best options for one compiler version may not be good for the next compiler version. So using the stated optimizer options with the wrong compiler could reduce performance. You might need the SiFive freedom-u-sdk 2021.03 release to reproduce. This incidentally runs at 1GHz.

This may have also been one of the pre-release boards. It is possible that the pre-release boards have a slightly different performance profile than the release boards.

Also note that the performance of a board is not the same as the performance of an IP Core. The u74 numbers we publish are generated via simulation and emulation (e.g. verilator and fpga). The speed of a board will depend on the SoC design and the board design, and the choice of peripherals on the SoC and board. The unmatched board isn’t designed to show the max performance of the u74 core. It is intended to show that the u74 works, and provide a good software development platform.