Just got my HiFive1 up and running, and I’ve already played around with some blink Arduino sketches and the programs that come with the Freedom E SDK.
I’ve run the Dhrystone example program, and this is what I’m getting:
core freq at 269418496 Hz
Dhrystone Benchmark, Version 2.1 (Language: C)
variables redacted…
Microseconds for one run through Dhrystone: 1314.6
Dhrystones per Second: 760.6
This is extraordinarily slow. I checked this again with a Dhrystone Arduino sketch, and I’m getting better results: 38412.07 Dhrystones per second for the SiFive 1 with the 256 MHz PLL. Still, this isn’t good
Just for comparison, a 16MHz Arduino Micro runs about 18000 Dhrystones per second.
Obviously, the SiFive can do better, but what’s the solution? Is this a compiler optimization error, or is this deeper into the toolchain?
The Dhrystone program in the Freedom-E SDK does not correctly report its own performance because it makes an incorrect assumption about the timer’s frequency. I’ll look into this Tuesday and hope to resolve it shortly thereafter.
Internally we’ve measured around 1.6 DMIPS/MHz (i.e., about 1000x faster than is being reported).
Finally got all the details right to run the dhrystone example code in the SDK on my new HiFive1 (various issues building toolchain and figuring out what udev rules needed to be on Fedora25). After increasing number of runs by factor of 100, to 150,000,000:
Microseconds for one run through Dhrystone: 39.5
Dhrystones per Second: 25303.6
Don’t know if the timer is properly calibrated, so this may be bogus.
Here is a quick patch.
--- software/dhrystone/dhry_stubs.c.org 2017-01-02 10:57:54.843417150 +0900
+++ software/dhrystone/dhry_stubs.c 2017-01-02 14:50:10.592062860 +0900
@@ -8,7 +8,7 @@
{
long t;
asm volatile ("csrr %0, mcycle" : "=r" (t));
- return t / (get_cpu_freq() / 1000);
+ return t / get_cpu_freq();
}
// set the number of dhrystone iterations
get_cpu_freq() is defined in bsp/env/freedom-e300-hifive1/init.c and it returns a value calculated with a relatively short loop using RTC in E31 core. I think using RTC as follows is straightforward and more accurate. (Actually it gives very stable dhrystone results.)
uint32_t mtime_lo(void); // defined in bsp/env/init.c
// return the cycle counter as though it were the current time
long time(void)
{
return mtime_lo() / 32768;
}
There is a tradeoff here. Using the RTC makes the dhrystones/sec figure more accurate, but using the cycle counter makes the DMIPS/MHz figure more accurate, since the clock frequency estimate cancels out in the arithmetic. Nevertheless, using the RTC seems like the “right” thing to do.
Running dhrystone on my Arty I am getting 183486.2 / 1757 / 65 = 1.61 DMIPS/MHz :
core freq at 65000000 Hz
Dhrystone Benchmark, Version 2.1 (Language: C)
Program compiled without ‘register’ attribute
Please give the number of runs through the benchmark:
Execution starts, 100000000 runs through Dhrystone
Execution ends
Final values of the variables used in the benchmark:
Int_Glob: 5
should be: 5
Bool_Glob: 1
should be: 1
Ch_1_Glob: A
should be: A
Ch_2_Glob: B
should be: B
Arr_1_Glob[8]: 7
should be: 7
Arr_2_Glob[8][7]: 100000010
should be: Number_Of_Runs + 10
Ptr_Glob->
Ptr_Comp: -2147472312
should be: (implementation-dependent)
Discr: 0
should be: 0
Enum_Comp: 2
should be: 2
Int_Comp: 17
should be: 17
Str_Comp: DHRYSTONE PROGRAM, SOME STRING
should be: DHRYSTONE PROGRAM, SOME STRING
Next_Ptr_Glob->
Ptr_Comp: -2147472312
should be: (implementation-dependent), same as above
Discr: 0
should be: 0
Enum_Comp: 1
should be: 1
Int_Comp: 18
should be: 18
Str_Comp: DHRYSTONE PROGRAM, SOME STRING
should be: DHRYSTONE PROGRAM, SOME STRING
Int_1_Loc: 5
should be: 5
Int_2_Loc: 13
should be: 13
Int_3_Loc: 7
should be: 7
Enum_Loc: 1
should be: 1
Str_1_Loc: DHRYSTONE PROGRAM, 1’ST STRING
should be: DHRYSTONE PROGRAM, 1’ST STRING
Str_2_Loc: DHRYSTONE PROGRAM, 2’ND STRING
should be: DHRYSTONE PROGRAM, 2’ND STRING
Microseconds for one run through Dhrystone: 5.4
Dhrystones per Second: 183486.2
I played with several compiler options on my Hifive 1 card. I’d like to share with you.
I used the freedom-sdk checked out Jan. 10th. I only changed optimazation flag in Makefiles.
So you should be able to reproduce the result easily.
I put my hypotheses in parens. I welcome your comments.
Thank you.
Dhrystone
CFLAGS := -Os -DTIME -fno-inline -fno-builtin-printf -Wno-implicit -march=rv32imac
text data bss dec hex filename
11672 1076 12296 25044 61d4 dhrystone
Dhrystones per Second: 38167.9
CFLAGS := -O2 -DTIME -fno-inline -fno-builtin-printf -Wno-implicit -march=rv32imac
text data bss dec hex filename
11880 1076 12296 25252 62a4 dhrystone
Dhrystones per Second: 632911.3
CFLAGS := -O2 -DTIME -fno-inline -fno-builtin-printf -Wno-implicit -march=rv32imac -falign-functions=4 -falign-jumps=4 -falign-loops=4
text data bss dec hex filename
11900 1076 12296 25272 62b8 dhrystone
Dhrystones per Second: 724637.6
CFLAGS := -O2 -DTIME -fno-inline -fno-builtin-printf -Wno-implicit -march=rv32imac -falign-functions=4 -falign-jumps=4 -falign-loops=4 -funroll-loops -finline-functions --param max-inline-insns-auto=20
text data bss dec hex filename
12052 1076 12296 25424 6350 dhrystone
Dhrystones per Second: 714285.6
CFLAGS := -O2 -DTIME -fno-inline -fno-builtin-printf -Wno-implicit -march=rv32ima
text data bss dec hex filename
12612 1076 12296 25984 6580 dhrystone
Dhrystones per Second: 746268.6
CFLAGS := -O2 -DTIME -fno-inline -fno-builtin-printf -Wno-implicit -march=rv32ima -funroll-loops -finline-functions --param max-inline-insns-auto=20
text data bss dec hex filename
12028 1076 12296 25400 6338 dhrystone
Dhrystones per Second: 704225.3
-Os (-O also) gives about 20-times poor performance (I cannot explain how can this happen…)
Alignment options affects for RV32IMAC and gives a number closed to RV32IMA. (Dhrystone has small basic blocks. So the ratio of branch is high. So the options helps…?)
loop-unrolling and inlining does not help. (It just raises I$ miss ratio…?)
CoreMark
CoreMark 1.0 : 728.100114 / GCC6.1.0 -O2 -fno-common -funroll-loops -finline-functions --param max-inline-insns-auto=20 -falign-functions=4 -falign-jumps=4 -falign-loops=4 / STACK
text data bss dec hex filename
56480 2268 2148 60896 ede0 coremark
CoreMark 1.0 : 727.686185 / GCC6.1.0 -march=rv32imac -O2 -fno-common -funroll-loops -finline-functions --param max-inline-insns-auto=20 -falign-functions=4 -falign-jumps=4 -falign-loops=4 / STACK
text data bss dec hex filename
56496 2268 2148 60912 edf0 coremark
CoreMark 1.0 : 726.035167 / GCC6.1.0 -march=rv32imac -O2 -fno-common -funroll-loops -finline-functions --param max-inline-insns-auto=20 / STACK
text data bss dec hex filename
56220 2268 2148 60636 ecdc coremark
CoreMark 1.0 : 603.773585 / GCC6.1.0 -march=rv32imac -O2 -fno-common -falign-functions=4 -falign-jumps=4 -falign-loops=4 / STACK
text data bss dec hex filename
43772 2268 2148 48188 bc3c coremark
CoreMark 1.0 : 600.656969 / GCC6.1.0 -march=rv32imac -O2 -fno-common / STACK
text data bss dec hex filename
43526 2268 2148 47942 bb46 coremark
CoreMark 1.0 : 45.428734 / GCC6.1.0 -march=rv32imac -O -fno-common / STACK
text data bss dec hex filename
42736 2268 2148 47152 b830 coremark
CoreMark 1.0 : 44.901252 / GCC6.1.0 -march=rv32imac -Os -fno-common / STACK
text data bss dec hex filename
42288 2268 2148 46704 b670 coremark
CoreMark 1.0 : 401.505646 / GCC6.1.0 -march=rv32ima -O2 -fno-common -funroll-loops -finline-functions --param max-inline-insns-auto=20 -falign-functions=4 -falign-jumps=4 -falign-loops=4 / STACK
text data bss dec hex filename
61106 2268 2148 65522 fff2 coremark
CoreMark 1.0 : 401.379743 / GCC6.1.0 -march=rv32ima -O2 -fno-common -funroll-loops -finline-functions --param max-inline-insns-auto=20 / STACK
text data bss dec hex filename
61050 2268 2148 65466 ffba coremark
CoreMark 1.0 : 629.921260 / GCC6.1.0 -march=rv32ima -O2 -fno-common / STACK
text data bss dec hex filename
46758 2268 2148 51174 c7e6 coremark
Again -Os (-O also) gives very poor performance. (x13 poor. Why?)
Alignment options does not helps for RV32IMAC as for Dhrystone. (Coremark has longer basic blocks than Dhystone. So the options does not helps…?)
loop-unrolling and inlining helps for RV32IMAC but not for RV32IMA. (The code cannot fit in I$ on RV32IMA…?)
I wonder what the interaction between asking for inlining, but then saying -O or -Os is?
One thing to keep in mind is that as soon as you get over 16 KB of code, there starts to be a real possibility that three hot functions scattered in random places will all map to the same place in the cache. That’s another benefit that inlining has that is seldom a factor these days on x86 with I believe 8-ways on the L1 caches since at least Nehalem.
(ARM also typically only has 2-ways on the L1 caches, same as E310)
I also suspected the I-cache first.
But Dhrystone is infamous for its small size, 12KB, in this case. It fits in 8KB x 2 I-cache.
There must be another reason, as you wrote “… as soon as you get over 16 KB of code…”.
I took a look of disassemble code but I could not find any problem.
If I had an RTL simulation environment, I want to see a waveform of the 2nd loop where I$ should hits.
I am really a dummy here but, do not both -O and -Os ignore any -O2 optimizations which “typically increase code size” and/or those that take " a great deal of compilation time" ?
Just looking at the different file sizes in your performance listing, is it not obvious (or at least speculative) that there are some “code size increasing” optimizations performed int the -O2 that are missing in the -O/-Os runs?
Naturally, though looking at the first two dhrystone results, which differ only in -Os vs -O2, it’s only 208 bytes difference.
We don’t know what percent change that is in the dhrystone code itself, as most of the size will be the standard library/runtime, which won’t have been recompiled or changed in size.