Poor Dhrystone performance

jkilbride · June 5, 2017, 4:23pm

Yes, I appreciate that. But since I have cloned the “standard” code from GitHub, built the code using the provided source and make files and reflashed the board, I was expecting to get the same results as everyone else.

Am I missing something here?

mwachs5 · June 5, 2017, 4:25pm

You should be able to use everything “out of the box”. We’re re-testing out the current versions on our end to see if something got out of sync.

Also can you clarify if you are using HiFive1 or one of the Arty dev images?

jkilbride · June 5, 2017, 4:26pm

Thanks, I am using HiFive1

dagema · June 5, 2017, 5:13pm

I think what sflin meant was to use -O2, not remove it. With -O2 you should see better performance than with -Os.

jkilbride · June 6, 2017, 7:36am

Hi Donnie, sorry, badly worded message on my part. I tried various combinations of these flags with no real difference. I note that Megan has offered to try an out of the box test.

mwachs5 · June 6, 2017, 8:35am

Hi @jkilbride, we are seeing the same timing as you. It’s likely to do with bumping the GCC version, will update this thread when we root-cause the issue.

brucehoult · June 6, 2017, 8:50am

If the clock speed is correct then such a 16x slowdown can I think only be data loads from flash.

It’s the real “glass jaw” of this design with no data cache.

zhenbohu · June 18, 2017, 12:33pm

Hi, Megan

I noticed the new version of the Freedom Studio is released, and also the new GCC versions bumped? is this poor performance issue resolved in this new version?

Thanks
Bob

DrewatSiFive · June 23, 2017, 6:56pm

Hi zhenbohu,

The new release of GCC does not fix dhrystone performance. I will keep this thread updated to the status when we have more information. Thanks for your patience.

brucehoult · June 23, 2017, 7:56pm

Poor Dhrystone performance on that level (16x slowdown) is almost certainly because the more recent compiler has optimised something into a read-only memory section that used to (with the same C source code) be copied into a read/write section on startup.

That is a good thing on almost every computer, but on the HiFive1 with no data cache and slow SPI flash for program memory it is a very bad thing.

Once this optimisation is performed, I would not expect a newer version of gcc to reverse it.

zhenbohu · June 25, 2017, 1:29pm

Hi, Drew

Thanks very much for your reply. FYI I just reported this issue at the github, link: https://github.com/riscv/riscv-gnu-toolchain/issues/249, hope this issue can be resolved as soon as possible.

Let me copy my text here to share my discovery:

"
Hi,

I was using the RISCV built GCC toolchain several months ago (which is based on GCC 6.1.0 version). Recently I have upgraded my database and use the latest built GCC toolchain (which is based on GCC 7.1.0). But unfortunately after switching to this new version, I found my dhrystone benchmark number decreased very much (from around 1.3DMIPS/MHz to 1.0DMIPS/MHz, about 30% dropped, this is really a big gap).

Since I am a Hardware guys and not a compiler expert, I cannt identify what is the root cause of this degradation, but I just tried to use two different versions to generated the elf, and diff their Dump files, and I found an interesting obvious defects in the code generated by new version of toolchain (7.1.0). Please see the sniplets (for the same function generated by two different version of toolchain) I copied at below:

Old Version of ToolChain generated code (gcc 6.1.0) which have better performance:
800007de <Proc_3>:
800007de: 10000617 auipc a2,0x10000
800007e2: c9262603 lw a2,-878(a2) # 90000470 <Ptr_Glob>
800007e6: c619 beqz a2,800007f4 <Proc_3+0x16>
800007e8: 421c lw a5,0(a2)
800007ea: c11c sw a5,0(a0)
800007ec: 10000617 auipc a2,0x10000
800007f0: c8462603 lw a2,-892(a2) # 90000470 <Ptr_Glob>
800007f4: 0631 addi a2,a2,12
800007f6: 10000597 auipc a1,0x10000
800007fa: c6e5a583 lw a1,-914(a1) # 90000464 <Int_Glob>
800007fe: 4529 li a0,10
80000800: a201 j 80000900 <Proc_7>

New Version of ToolChain generated code (gcc 7.1.0) which have very worse performance:
80000746 <Proc_3>:
80000746: 10000797 auipc a5,0x10000
8000074a: d2a78793 addi a5,a5,-726 # 90000470 <Ptr_Glob>
8000074e: 4390 lw a2,0(a5)
80000750: c601 beqz a2,80000758 <Proc_3+0x12>
80000752: 4218 lw a4,0(a2)
80000754: c118 sw a4,0(a0)
80000756: 4390 lw a2,0(a5)
80000758: 10000797 auipc a5,0x10000
8000075c: d0c78793 addi a5,a5,-756 # 90000464 <Int_Glob>
80000760: 438c lw a1,0(a5)
80000762: 0631 addi a2,a2,12
80000764: 4529 li a0,10
80000766: a86d j 80000820 <Proc_7>

We can see the very obvious defects in the gcc7.1.0 generated code, summarized as below:

*** Problem (1), it is using 3 instructions instead of two instructions to load a word from address, see below code, it is using LW instruction with register a5 plus a zero offset. And I noticed this kind of code sniplet is everywhere across the entire dhrystone.dump file and with very high frequency used. On the contrary, this worse code is not existed in gcc6.1.0 generated code. I guess this bad code is one of the main issue which caused the bad performance.
80000746: 10000797 auipc a5,0x10000
8000074a: d2a78793 addi a5,a5,-726 # 90000470 <Ptr_Glob>
8000074e: 4390 lw a2,0(a5)
…
80000758: 10000797 auipc a5,0x10000
8000075c: d0c78793 addi a5,a5,-756 # 90000464 <Int_Glob>
80000760: 438c lw a1,0(a5)

*** Problem (2), redudant instructions inserted, see below sniplet. This instruction is obviously not needed there, but it just inserted there with no reason. On the contrary, this redudant instruction is not existed in gcc6.1.0 generated code. I guess this bad code is also another issue which caused the bad performance.
80000756: 4390 lw a2,0(a5)

Since from the data I got, the performance is degraded very siginificantly, I dont think this is a minor issue, could you help to identify and resolve this issue? I am not sure if I reported this issue in the right place.

”

Thanks
Bob

DrewatSiFive · November 20, 2017, 4:52pm

Hello All,

We have recently made a few updates to Freedom-E-SDK to addresss the Dhrystone performance issues described in this thread.

The 3 major updates to improve Dhrystone were as follows:

Freedom-E-SDK was updated to use newer versions of the RISC-V Toolchain. Among other improvements, the implementation of memcpy has been improved.
We created a new linker file for Dhrystone. Specifically this linker file moves the read only data from external spi flash to the DTIM. This was the major issue resulting in extremely poor Dhrystone results.
The Dhrystone compilation flags have been updated.

With the updates we are back up around 1.55DMIPS/MHz on the HiFive1 with the possibility of a few more improvements down the road.

Details of the updates can be found in Palmer’s Pull Request:

After updating your Freedom-E-SDK repository the latest version, including the latest version of the toolchain submodules, compile Dhrystone as follows:
> make software BOARD=freedom-e300-hifive1 PROGRAM=dhrystone LINK_TARGET=dhrystone
> make upload BOARD=freedom-e300-hifive1 PROGRAM=dhrystone

Let us know if you have any feedback or questions.

brucehoult · November 20, 2017, 6:06pm

Nailed it (msgs 38, 47, 50)

On the HiFive1 this should be done wherever possible, perhaps even by default, as it is on AVR Arduinos.

It would be better for beginner’s programs and for small quickly ported programs such as benchmarks.

If the user runs out of SRAM as a result then maybe gcc could be taught to recognise the same “progmem” keyword as used on AVR, and put such constants in a different section?

(of course, unlike on the AVR, that doesn’t make you need different instructions to access it)

bruce · May 20, 2018, 4:33am

Just an update.

With the current contents of freedom-e-sdk and a HiFive1 I get 704225.3, which is 400.8 VAX MIPS or 1.57 DMIPS/MHz using:

make software PROGRAM=dhrystone LINK_TARGET=dhrystone

With the following to disable the C extension …

make software PROGRAM=dhrystone LINK_TARGET=dhrystone RISCV_ARCH=rv32im

… I get mostly 751879.6 which is 427.93 VAX MIPS or 1.67 DMIPS/MHz. The result is not entirely stable with 100000000 iterations, and I get 746268.6 almost as often, which is 424.74 VAX MIPS or 1.66 DMIPS/MHz.

Note: I tried to use rv32ima but it fails in linking. It seems not to be one of the newlib multiarches that is built.

Topic		Replies	Views
Timing issues in the Arduino IDE HiFive1 Rev B	21	3981	January 27, 2017
Changing the clock frequency HiFive1 Rev B	2	2134	May 26, 2017
HiFive 1 Arduino performance HiFive1 Rev B	8	2818	February 24, 2017
Arduino performance (again ;o) HiFive1 Rev B	21	3576	January 28, 2019
HiFive1 Rev.B Benchmark HiFive1 Rev B	8	2490	February 21, 2020

Poor Dhrystone performance

Related Topics