Low benchmarking scores

I ran both the Dhrystone and CoreMark benchmarks on my HiFive1 Board using Freedom Studio v20180122 with GCC. For Dhrystone, I was getting an average frequency of 260.5 MHz, 714285.6 Dhrystones/Sec, and 1.56 DMIPS/MHz, which is slightly lower than the reported 1.61 MHz in the docs for the HiFive1 board. For CoreMark, I had an average frequency of 262 MHz and 702 iterations/sec, so 2.68 Coremarks/MHz, which is lower than the reported 2.73 CM/Mhz. I was wondering if there is any specific reason to why the benchmark results aren’t quite measuring up to the reported statistics, such as specific compiler optimizations that I may not be using. Any insight would be helpful, thank you!

For what it’s worth I just ran Dhrystone and got 735294.1 at a reported 271490744 Hz, which works out to 1.54 DMIPS/MHz, so we’re both getting 3% - 4% slower than 1.61.

Your 2.68 Coremarks/MHz is about 1.8% slow.

We’re observed this too, and definitely got the claimed numbers with the gcc and newlib we were using at some point in the past, but we haven’t checked exactly what has changed. The change is so small that it’s probably not worse gcc code generation in an inner loop. Maybe it’s a slightly different layout in memory/cache due to different code or link order in NewLib.

One thing we do know is that on the HiFive1 code compiled with the C extension (16 bit instructions) runs with slightly lower performance than pure 32 bit code. Using the C extension is the default.

I just did the following in my freedom-e-sdk:

make software PROGRAM=dhrystone LINK_TARGET=dhrystone RISCV_ARCH=rv32im
make upload PROGRAM=dhrystone

I got 775193.8 at 269667860 Hz which is 1.636 DMIPS/MHz.

This is 1.6% higher than the advertised 1.61.

The hardware engineers have told me the instruction fetch/decode has been improved on the E51/U54 and production versions of the E31 so this several percent difference between C and non-C performance should be greatly reduced or eliminated.

(I previously had some unnecessary rebuilding in this message … too late on a Friday afternoon!)

Thanks for the feedback! I just ran the same commands in freedom-e-sdk for Dhrystone, and got 757575.7 Dhrystones/sec at 261413273 Hz which I calculated to be about 1.65 DMIPS/Mhz. I’d definitely be interested to know why the 32 bit code has slightly better performance than the standard C extension. Do you know if it is possible to run the CoreMark benchmark with the 32 bit code to see if that improves performance as well?

I didn’t try CoreMark, but you can do that yourself if you want.

I gave all the information I have about why it happens in the previous message.

Purely as speculation you can imagine that in code such as…

if (foo) i++;

… you’re going to have one instruction in the body of the “if”, and this instruction can be compiled as a 16 bit instruction.

If that instruction is in the first half of a 32-bit word then if foo is false the branch to the first instruction after the if will be into the middle of the word. Depending on how the microarchitecture works this might cause a slight stall. (I really don’t know whether it does or not)

You could align the code following the if, but in this case the i++ will be followed by a NOP which, again, will in a simple microcontroller implementation take extra time to execute if foo is true.

A really smart assembler might decide not to compress the instruction for the i++, but we don’t currently have that smart assembler :slight_smile:

Is getting 25% - 30% smaller code worth an occasional couple of percent slower code? Only you can decide. You could always compile just your most critical functions with 16 bit instructions disabled if you want.

I’ll note that a prominent competitor’s compressed instruction set is also often a little slower than their uncompressed one, for example because the uncompressed instruction set has predication built in to every instruction, while the compressed instruction set needs an extra instruction to signal predication on the following few instructions. :slight_smile:

Thank you, this helped a lot!!

I found a minute to try coremark.

271469773 Hz Iterations/Sec   : 721.126761 CM/MHz 2.66 rv32imac
270481490 Hz Iterations/Sec   : 360.563380 CM/MHz 1.33 rv32im

I was a little bit surprised and shocked that not using compressed instructions exactly halved the performance!

The only plausible reason I could see for this is that not using compressed instructions made the hot part of the code exceed the size of the 16 KB icache. The total code size increased from 72262 bytes to 97496 bytes (25234 bytes difference), so that’s certainly plausible.

As an experiment, halving the speed of the SPI interface (used to load icache from the program code in flash) lowered the performance of the rv32im version to about 0.7 coremarks/MHz.

The conclusion is that 16 KB of icache is enough for Coremark when using the compressed instruction set, but not without it.