Poor Dhrystone performance


#1

Just got my HiFive1 up and running, and I’ve already played around with some blink Arduino sketches and the programs that come with the Freedom E SDK.

I’ve run the Dhrystone example program, and this is what I’m getting:

core freq at 269418496 Hz
Dhrystone Benchmark, Version 2.1 (Language: C)

variables redacted…

Microseconds for one run through Dhrystone: 1314.6
Dhrystones per Second: 760.6

This is extraordinarily slow. I checked this again with a Dhrystone Arduino sketch, and I’m getting better results: 38412.07 Dhrystones per second for the SiFive 1 with the 256 MHz PLL. Still, this isn’t good

Just for comparison, a 16MHz Arduino Micro runs about 18000 Dhrystones per second.

Obviously, the SiFive can do better, but what’s the solution? Is this a compiler optimization error, or is this deeper into the toolchain?


HiFive1 Getting Started Guide and Other Docs are available!
(Andrew Waterman) #2

Hi,

The Dhrystone program in the Freedom-E SDK does not correctly report its own performance because it makes an incorrect assumption about the timer’s frequency. I’ll look into this Tuesday and hope to resolve it shortly thereafter.

Internally we’ve measured around 1.6 DMIPS/MHz (i.e., about 1000x faster than is being reported).

Andrew


#3

Finally got all the details right to run the dhrystone example code in the SDK on my new HiFive1 (various issues building toolchain and figuring out what udev rules needed to be on Fedora25). After increasing number of runs by factor of 100, to 150,000,000:

Microseconds for one run through Dhrystone: 39.5
Dhrystones per Second: 25303.6

Don’t know if the timer is properly calibrated, so this may be bogus.


#4
Here is a quick patch.

    --- software/dhrystone/dhry_stubs.c.org 2017-01-02 10:57:54.843417150 +0900
    +++ software/dhrystone/dhry_stubs.c     2017-01-02 14:50:10.592062860 +0900
    @@ -8,7 +8,7 @@
     {
       long t;
       asm volatile ("csrr %0, mcycle" : "=r" (t));
    -  return t / (get_cpu_freq() / 1000);
    +  return t / get_cpu_freq();
     }
     
     // set the number of dhrystone iterations

get_cpu_freq() is defined in bsp/env/freedom-e300-hifive1/init.c and it returns a value calculated with a relatively short loop using RTC in E31 core. I think using RTC as follows is straightforward and more accurate. (Actually it gives very stable dhrystone results.)

    uint32_t mtime_lo(void);  // defined in bsp/env/init.c

    // return the cycle counter as though it were the current time
    long time(void)
    {
      return mtime_lo() / 32768;
    }

Here is the result.

Dhrystones per Second:                      714285.6 

714285.6 / 1757 = 406.5 VAX MIPS (DMIPS)
714285.6 / 1757 / 261 = 1.56 DMIPS/MHz

This matches with Andrew’s measurement.


FYI, here is the result with the C extension (rv32imac).

Dhrystones per Second:                      681818.1 
681818.1/ 1757 / 261 = 1.49 DMIPS/MHz

Performance of HiFive1 vs fpga dev kits
(Andrew Waterman) #5

There is a tradeoff here. Using the RTC makes the dhrystones/sec figure more accurate, but using the cycle counter makes the DMIPS/MHz figure more accurate, since the clock frequency estimate cancels out in the arithmetic. Nevertheless, using the RTC seems like the “right” thing to do.


(Andrew Waterman) #6

I’ve PR’d a change to the SDK to use the RTC for timing. Thanks, folks.


(Donnie Agema) #7

Running dhrystone on my Arty I am getting 183486.2 / 1757 / 65 = 1.61 DMIPS/MHz :slight_smile: :

core freq at 65000000 Hz

Dhrystone Benchmark, Version 2.1 (Language: C)

Program compiled without ‘register’ attribute

Please give the number of runs through the benchmark:
Execution starts, 100000000 runs through Dhrystone
Execution ends

Final values of the variables used in the benchmark:

Int_Glob: 5
should be: 5
Bool_Glob: 1
should be: 1
Ch_1_Glob: A
should be: A
Ch_2_Glob: B
should be: B
Arr_1_Glob[8]: 7
should be: 7
Arr_2_Glob[8][7]: 100000010
should be: Number_Of_Runs + 10
Ptr_Glob->
Ptr_Comp: -2147472312
should be: (implementation-dependent)
Discr: 0
should be: 0
Enum_Comp: 2
should be: 2
Int_Comp: 17
should be: 17
Str_Comp: DHRYSTONE PROGRAM, SOME STRING
should be: DHRYSTONE PROGRAM, SOME STRING
Next_Ptr_Glob->
Ptr_Comp: -2147472312
should be: (implementation-dependent), same as above
Discr: 0
should be: 0
Enum_Comp: 1
should be: 1
Int_Comp: 18
should be: 18
Str_Comp: DHRYSTONE PROGRAM, SOME STRING
should be: DHRYSTONE PROGRAM, SOME STRING
Int_1_Loc: 5
should be: 5
Int_2_Loc: 13
should be: 13
Int_3_Loc: 7
should be: 7
Enum_Loc: 1
should be: 1
Str_1_Loc: DHRYSTONE PROGRAM, 1’ST STRING
should be: DHRYSTONE PROGRAM, 1’ST STRING
Str_2_Loc: DHRYSTONE PROGRAM, 2’ND STRING
should be: DHRYSTONE PROGRAM, 2’ND STRING

Microseconds for one run through Dhrystone: 5.4
Dhrystones per Second: 183486.2


(Donnie Agema) #8

I have run the coremark benchmark on the Arty:

Uploading…

Has anyone run the coremark benchmark on the HiFive1?


(Andrew Waterman) #9

I ran CoreMark on the HiFive1 a while back, and I measured just over 700 iterations/sec at 260 MHz.


#10

Hi,

I played with several compiler options on my Hifive 1 card. I’d like to share with you.

I used the freedom-sdk checked out Jan. 10th. I only changed optimazation flag in Makefiles.
So you should be able to reproduce the result easily.

I put my hypotheses in parens. I welcome your comments.

Thank you.

Dhrystone

CFLAGS := -Os -DTIME -fno-inline -fno-builtin-printf -Wno-implicit -march=rv32imac
   text       data        bss        dec        hex    filename
  11672       1076      12296      25044       61d4    dhrystone
Dhrystones per Second:                      38167.9 

CFLAGS := -O2 -DTIME -fno-inline -fno-builtin-printf -Wno-implicit -march=rv32imac
   text       data        bss        dec        hex    filename
  11880       1076      12296      25252       62a4    dhrystone
Dhrystones per Second:                      632911.3 

CFLAGS := -O2 -DTIME -fno-inline -fno-builtin-printf -Wno-implicit -march=rv32imac -falign-functions=4 -falign-jumps=4 -falign-loops=4
   text       data        bss        dec        hex    filename
  11900       1076      12296      25272       62b8    dhrystone
Dhrystones per Second:                      724637.6 

CFLAGS := -O2 -DTIME -fno-inline -fno-builtin-printf -Wno-implicit -march=rv32imac -falign-functions=4 -falign-jumps=4 -falign-loops=4 -funroll-loops -finline-functions --param max-inline-insns-auto=20
   text       data        bss        dec        hex    filename
  12052       1076      12296      25424       6350    dhrystone
Dhrystones per Second:                      714285.6 

CFLAGS := -O2 -DTIME -fno-inline -fno-builtin-printf -Wno-implicit -march=rv32ima
   text       data        bss        dec        hex    filename
  12612       1076      12296      25984       6580    dhrystone
Dhrystones per Second:                      746268.6 

CFLAGS := -O2 -DTIME -fno-inline -fno-builtin-printf -Wno-implicit -march=rv32ima -funroll-loops -finline-functions --param max-inline-insns-auto=20
   text       data        bss        dec        hex    filename
  12028       1076      12296      25400       6338    dhrystone
Dhrystones per Second:                      704225.3 
  • -Os (-O also) gives about 20-times poor performance (I cannot explain how can this happen…)
  • Alignment options affects for RV32IMAC and gives a number closed to RV32IMA. (Dhrystone has small basic blocks. So the ratio of branch is high. So the options helps…?)
  • loop-unrolling and inlining does not help. (It just raises I$ miss ratio…?)

CoreMark

CoreMark 1.0 : 728.100114 / GCC6.1.0 -O2 -fno-common -funroll-loops -finline-functions --param max-inline-insns-auto=20 -falign-functions=4 -falign-jumps=4 -falign-loops=4 / STACK
   text       data        bss        dec        hex    filename
  56480       2268       2148      60896       ede0    coremark

CoreMark 1.0 : 727.686185 / GCC6.1.0 -march=rv32imac -O2 -fno-common -funroll-loops -finline-functions --param max-inline-insns-auto=20 -falign-functions=4 -falign-jumps=4 -falign-loops=4 / STACK
   text       data        bss        dec        hex    filename
  56496       2268       2148      60912       edf0    coremark

CoreMark 1.0 : 726.035167 / GCC6.1.0 -march=rv32imac -O2 -fno-common -funroll-loops -finline-functions --param max-inline-insns-auto=20  / STACK
   text       data        bss        dec        hex    filename
  56220       2268       2148      60636       ecdc    coremark

CoreMark 1.0 : 603.773585 / GCC6.1.0 -march=rv32imac -O2 -fno-common -falign-functions=4 -falign-jumps=4 -falign-loops=4 / STACK
   text       data        bss        dec        hex    filename
  43772       2268       2148      48188       bc3c    coremark

CoreMark 1.0 : 600.656969 / GCC6.1.0 -march=rv32imac -O2 -fno-common  / STACK
   text       data        bss        dec        hex    filename
  43526       2268       2148      47942       bb46    coremark

CoreMark 1.0 : 45.428734 / GCC6.1.0 -march=rv32imac -O -fno-common / STACK
   text       data        bss        dec        hex    filename
  42736       2268       2148      47152       b830    coremark

CoreMark 1.0 : 44.901252 / GCC6.1.0 -march=rv32imac -Os -fno-common / STACK
   text       data        bss        dec        hex    filename
  42288       2268       2148      46704       b670    coremark

CoreMark 1.0 : 401.505646 / GCC6.1.0 -march=rv32ima -O2 -fno-common -funroll-loops -finline-functions --param max-inline-insns-auto=20 -falign-functions=4 -falign-jumps=4 -falign-loops=4 / STACK
   text       data        bss        dec        hex    filename
  61106       2268       2148      65522       fff2    coremark

CoreMark 1.0 : 401.379743 / GCC6.1.0 -march=rv32ima -O2 -fno-common -funroll-loops -finline-functions --param max-inline-insns-auto=20  / STACK
   text       data        bss        dec        hex    filename
  61050       2268       2148      65466       ffba    coremark

CoreMark 1.0 : 629.921260 / GCC6.1.0 -march=rv32ima -O2 -fno-common  / STACK
   text       data        bss        dec        hex    filename
  46758       2268       2148      51174       c7e6    coremark
  • Again -Os (-O also) gives very poor performance. (x13 poor. Why?)
  • Alignment options does not helps for RV32IMAC as for Dhrystone. (Coremark has longer basic blocks than Dhystone. So the options does not helps…?)
  • loop-unrolling and inlining helps for RV32IMAC but not for RV32IMA. (The code cannot fit in I$ on RV32IMA…?)

(Bruce Hoult) #11

Very interesting.

I wonder what the interaction between asking for inlining, but then saying -O or -Os is?

One thing to keep in mind is that as soon as you get over 16 KB of code, there starts to be a real possibility that three hot functions scattered in random places will all map to the same place in the cache. That’s another benefit that inlining has that is seldom a factor these days on x86 with I believe 8-ways on the L1 caches since at least Nehalem.

(ARM also typically only has 2-ways on the L1 caches, same as E310)


#12

Bruce,

Thank you for your comment.

I also suspected the I-cache first.
But Dhrystone is infamous for its small size, 12KB, in this case. It fits in 8KB x 2 I-cache.
There must be another reason, as you wrote “… as soon as you get over 16 KB of code…”.

I took a look of disassemble code but I could not find any problem.
If I had an RTL simulation environment, I want to see a waveform of the 2nd loop where I$ should hits.


(Donnie Agema) #13

I am really a dummy here but, do not both -O and -Os ignore any -O2 optimizations which “typically increase code size” and/or those that take " a great deal of compilation time" ?


#14

Donnie Agema,

I guess you are referencing the GCC Manual. Don’t miss the following sentence;

If you use multiple -O options, with or without level numbers, the last such option is the one that is effective.

From: https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/Optimize-Options.html#Optimize-Options

From dhrystone/Makefile

DHRY_CFLAGS := -O2 -DTIME -fno-inline -fno-builtin-printf -Wno-implicit -march=rv32ima
...
CFLAGS := -Os -fno-common
...
$(DHRY_OBJS): %.o: %.c $(HEADERS)
        $(CC) $(CFLAGS) $(DHRY_CFLAGS) -c -o $@ $<

-Os is ignored in this case.


(Donnie Agema) #15

Just looking at the different file sizes in your performance listing, is it not obvious (or at least speculative) that there are some “code size increasing” optimizations performed int the -O2 that are missing in the -O/-Os runs?


(Bruce Hoult) #16

Naturally, though looking at the first two dhrystone results, which differ only in -Os vs -O2, it’s only 208 bytes difference.

We don’t know what percent change that is in the dhrystone code itself, as most of the size will be the standard library/runtime, which won’t have been recompiled or changed in size.


(Donnie Agema) #17

I don`t understand where there are multiple -O there. I can recongnize only one, but again, I’m a dummy here.:confused:

Ah! I see it now. Sorry.


#18

Now I understand what Donnie meant by reading your post. Sorry for my pointless answer.

We don’t know what percent change that is in the dhrystone code itself…

I measured the sizes but I lost the exact numbers. Dhrystone (dhry_1.o + dhry_2.o) was about 3KB and dhry_printf.o was about 1KB.

BTW why dhry_print.c is included in this SDK? The coremark uses printf in newlibc.


#19

Hi Andrew,

Do you use gcc-6.1.0 from sifive riscv-gnu-toolchain?
What option do you apply? The same as CFLAGS in Makefile (https://github.com/sifive/freedom-e-sdk/blob/master/software/coremark/Makefile)?

Thanks!


#20

Hi sflin,

Thanks for your Dhrystone and CoreMark results with several compiler options.
What is the fequency of your Freedom E300 board configured?
Thanks!