Hello. I’m trying to get consistent benchmarks of assembly code on the HiFive1 board. I run into some behavior that I can’t find an explanation for. Let me first show a minimal working example:

a.S:

```
//nevermind what this does exactly
#define doarithmetic(a,b,c,d, t) \
add a, a, b; \
xor d, d, a; \
slli t, d, 16; \
srli d, d, 16; \
xor d, d, t; \
add c, c, d; \
xor b, b, c; \
slli t, b, 12; \
srli b, b, 20; \
xor b, b, t; \
add a, a, b; \
xor d, d, a; \
slli t, d, 8; \
srli d, d, 24; \
xor d, d, t; \
add c, c, d; \
xor b, b, c; \
slli t, b, 7; \
srli b, b, 25; \
xor b, b, t
.globl getcycles
.align 2
getcycles:
csrr a1, mcycleh
csrr a0, mcycle
csrr a2, mcycleh
bne a1, a2, getcycles
ret
.globl somefunction
.align 2
somefunction:
// n repetitions unrolled
doarithmetic(t0,t1,t2,t3,t4)
doarithmetic(t0,t1,t2,t3,t4)
doarithmetic(t0,t1,t2,t3,t4)
doarithmetic(t0,t1,t2,t3,t4)
doarithmetic(t0,t1,t2,t3,t4)
ret
```

a.c:

```
int main(void) {
somefunction();
somefunction();
somefunction();
somefunction();
somefunction();
uint64_t oldcount = getcycles();
somefunction();
uint64_t cyclecount = getcycles()-oldcount;
printf("%d cycles\n", (unsigned int)cyclecount);
}
```

These are the results that I get for n unrolled repetitions of `doarithmetic`

:

2 repetitions: 66 cycles

3 repetitions: 91 cycles

4 repetitions: 875 cycles

5 repetitions: 895 cycles

6 repetitions: 146 cycles

7 repetitions: 171 cycles

8 repetitions: 955 cycles

9 repetitions: 975 cycles

10 repetitions: 226 cycles

For `n%4 == 0`

or `1`

something is causing performance issues. I guess it’s related to alignment and instruction fetches from the QSPI flash. For example, for `n=4`

, I can get rid of this performance penalty by adding 5 NOPs before the RET of `somefunction`

. Alternatively, I can align `somefunction`

to an 8-byte boundary and have 1 NOP. For `n=5`

, 1 NOP seems to be the right value.

Some things I’ve tried:

- These arithmetic instructions have dependencies, but removing them completely does not affect this behavior.
- Removing the conditional branch in
`getcycles`

does not affect this behavior. - The cycle counts are very stable. This was not a one-time effect.
- I disassembled the binaries to compare them, but couldn’t see obvious unaligned calls, returns or instructions.
- It doesn’t really matter where the NOPs are put. It looks like it’s not related to the start address of
`somefunction`

or`getcycles`

. - With
`.option norvc`

, I still get the performance penalty for`n=4`

, but not for`n=5`

. However, now there’s one for n=6. - I filled the instruction cache and trained the branch predictors. This easily fits in the 16 KiB instruction cache.
- Still I believe there’s an instruction cache miss going on. The cost of the performance penalty depends heavily on
`SCKDIV`

and the clock frequency gap between the main core and the QSPI flash controller.

Any suggestions for what causes this and how to avoid it?