Alignment, instruction cache and cycle counts

Hello. I’m trying to get consistent benchmarks of assembly code on the HiFive1 board. I run into some behavior that I can’t find an explanation for. Let me first show a minimal working example:

a.S:

//nevermind what this does exactly
#define doarithmetic(a,b,c,d, t) \
    add     a, a, b;    \
    xor     d, d, a;    \
    slli    t, d, 16;   \
    srli    d, d, 16;   \
    xor     d, d, t;    \
    add     c, c, d;    \
    xor     b, b, c;    \
    slli    t, b, 12;   \
    srli    b, b, 20;   \
    xor     b, b, t;    \
    add     a, a, b;    \
    xor     d, d, a;    \
    slli    t, d, 8;    \
    srli    d, d, 24;   \
    xor     d, d, t;    \
    add     c, c, d;    \
    xor     b, b, c;    \
    slli    t, b, 7;    \
    srli    b, b, 25;   \
    xor     b, b, t

.globl getcycles
.align 2
getcycles:
    csrr a1, mcycleh
    csrr a0, mcycle
    csrr a2, mcycleh
    bne a1, a2, getcycles
    ret

.globl somefunction
.align 2
somefunction:
    // n repetitions unrolled
    doarithmetic(t0,t1,t2,t3,t4)
    doarithmetic(t0,t1,t2,t3,t4)
    doarithmetic(t0,t1,t2,t3,t4)
    doarithmetic(t0,t1,t2,t3,t4)
    doarithmetic(t0,t1,t2,t3,t4)
    ret

a.c:

int main(void) {
    somefunction();
    somefunction();
    somefunction();
    somefunction();
    somefunction();
    uint64_t oldcount = getcycles();
    somefunction();
    uint64_t cyclecount = getcycles()-oldcount;
    printf("%d cycles\n", (unsigned int)cyclecount);
}

These are the results that I get for n unrolled repetitions of doarithmetic:

2 repetitions: 66 cycles
3 repetitions: 91 cycles
4 repetitions: 875 cycles
5 repetitions: 895 cycles
6 repetitions: 146 cycles
7 repetitions: 171 cycles
8 repetitions: 955 cycles
9 repetitions: 975 cycles
10 repetitions: 226 cycles

For n%4 == 0 or 1 something is causing performance issues. I guess it’s related to alignment and instruction fetches from the QSPI flash. For example, for n=4, I can get rid of this performance penalty by adding 5 NOPs before the RET of somefunction. Alternatively, I can align somefunction to an 8-byte boundary and have 1 NOP. For n=5, 1 NOP seems to be the right value.

Some things I’ve tried:

  • These arithmetic instructions have dependencies, but removing them completely does not affect this behavior.
  • Removing the conditional branch in getcycles does not affect this behavior.
  • The cycle counts are very stable. This was not a one-time effect.
  • I disassembled the binaries to compare them, but couldn’t see obvious unaligned calls, returns or instructions.
  • It doesn’t really matter where the NOPs are put. It looks like it’s not related to the start address of somefunction or getcycles.
  • With .option norvc, I still get the performance penalty for n=4, but not for n=5. However, now there’s one for n=6.
  • I filled the instruction cache and trained the branch predictors. This easily fits in the 16 KiB instruction cache.
  • Still I believe there’s an instruction cache miss going on. The cost of the performance penalty depends heavily on SCKDIV and the clock frequency gap between the main core and the QSPI flash controller.

Any suggestions for what causes this and how to avoid it?

The core in the HiFive1 is known to have some instruction fetch glitches that have been fixed in later iterations of the E31.

One well known example is that Dhrystone runs significantly faster with C turned off than with it on. From memory it’s something like the difference between 1.56 and 1.61 DMIPS/MHz i.e. nearly 4%

Something you obviously should be doing: call getcycles() and throw the result away once before the call on which you assign the result to oldcount.

Also, for such a short test (obviously much shorter than the 2^32/256e6 = 16 seconds rollover time for mcycle) there no need to do the mcycleh dance. Just get 32 bit counts and subtract them and the answer will be correct whether it’s rolled over or not. There’s no possibility it’s rolled over twice :slight_smile:

The first suggestion is important. The second one isn’t.

Thanks for the quick reply.

Indeed, checking mcycleh wasn’t really necessary for this, but the idea is to apply the same thing to more code and then I might as well read cycle counts properly immediately. The overhead is not going to be significant anyway.

And yes, I definitely should have called getcycles() before. That was stupid. With that added, I actually cannot reproduce this behavior anymore, so let’s call that’s a fix. :slight_smile:

Is there more information somewhere on these instruction fetch glitches that have been fixed?

I don’t think so. Maybe it should be in the errata for that SoC, but it doesn’t cause any wrong results or even serious performance problems.

OK well thanks anyway!