Hello. I’m trying to get consistent benchmarks of assembly code on the HiFive1 board. I run into some behavior that I can’t find an explanation for. Let me first show a minimal working example:
a.S:
//nevermind what this does exactly
#define doarithmetic(a,b,c,d, t) \
add a, a, b; \
xor d, d, a; \
slli t, d, 16; \
srli d, d, 16; \
xor d, d, t; \
add c, c, d; \
xor b, b, c; \
slli t, b, 12; \
srli b, b, 20; \
xor b, b, t; \
add a, a, b; \
xor d, d, a; \
slli t, d, 8; \
srli d, d, 24; \
xor d, d, t; \
add c, c, d; \
xor b, b, c; \
slli t, b, 7; \
srli b, b, 25; \
xor b, b, t
.globl getcycles
.align 2
getcycles:
csrr a1, mcycleh
csrr a0, mcycle
csrr a2, mcycleh
bne a1, a2, getcycles
ret
.globl somefunction
.align 2
somefunction:
// n repetitions unrolled
doarithmetic(t0,t1,t2,t3,t4)
doarithmetic(t0,t1,t2,t3,t4)
doarithmetic(t0,t1,t2,t3,t4)
doarithmetic(t0,t1,t2,t3,t4)
doarithmetic(t0,t1,t2,t3,t4)
ret
a.c:
int main(void) {
somefunction();
somefunction();
somefunction();
somefunction();
somefunction();
uint64_t oldcount = getcycles();
somefunction();
uint64_t cyclecount = getcycles()-oldcount;
printf("%d cycles\n", (unsigned int)cyclecount);
}
These are the results that I get for n unrolled repetitions of doarithmetic
:
2 repetitions: 66 cycles
3 repetitions: 91 cycles
4 repetitions: 875 cycles
5 repetitions: 895 cycles
6 repetitions: 146 cycles
7 repetitions: 171 cycles
8 repetitions: 955 cycles
9 repetitions: 975 cycles
10 repetitions: 226 cycles
For n%4 == 0
or 1
something is causing performance issues. I guess it’s related to alignment and instruction fetches from the QSPI flash. For example, for n=4
, I can get rid of this performance penalty by adding 5 NOPs before the RET of somefunction
. Alternatively, I can align somefunction
to an 8-byte boundary and have 1 NOP. For n=5
, 1 NOP seems to be the right value.
Some things I’ve tried:
- These arithmetic instructions have dependencies, but removing them completely does not affect this behavior.
- Removing the conditional branch in
getcycles
does not affect this behavior. - The cycle counts are very stable. This was not a one-time effect.
- I disassembled the binaries to compare them, but couldn’t see obvious unaligned calls, returns or instructions.
- It doesn’t really matter where the NOPs are put. It looks like it’s not related to the start address of
somefunction
orgetcycles
. - With
.option norvc
, I still get the performance penalty forn=4
, but not forn=5
. However, now there’s one for n=6. - I filled the instruction cache and trained the branch predictors. This easily fits in the 16 KiB instruction cache.
- Still I believe there’s an instruction cache miss going on. The cost of the performance penalty depends heavily on
SCKDIV
and the clock frequency gap between the main core and the QSPI flash controller.
Any suggestions for what causes this and how to avoid it?