Instruction per Cycle

Hi,
I’m using a E21 Standard Core Trial, programmed into a Arty 100T, and I recently got interested in its performance. Its doc, SiFive E21 Manual v19.05, says:

“The pipeline has a peak execution rate of one instruction per clock cycle.”

Although the word peak is used, I still got amused with the results I got by reading mcycle and minstret.

One example: a memset of 128bytes took 516 instructions and 210879 cycles (!!!). It means the board is executing 0.0024 instructions per cycle. That must be wrong! I tried scaling up and down the tests, but the result kept consistent (no more than 0.003 IPC).

And finally, the technical details. I’m using the hardware performance monitoring from the board, which is described on its document too:

The mcycle CSR holds a count of the number of clock cycles the hart has executed since some arbitrary time in the past. The minstret CSR holds a count of the number of instructions the hart has
retired since some arbitrary time in the past.

On high level I’m doing this:

write_csr(mcycleh, 0);
write_csr(mcycle, 0);
write_csr(minstreth, 0);
write_csr(minstret, 0);`
[some stuff...]
num_cycle = read_csr(mcycle);
num_instr = read_csr(minstret);

Which looks to be correctly compiled to:

csrwi   mcycleh,0
csrwi   mcycle,0
csrwi   minstreth,0
csrwi   minstret,0
[some stuff...]
csrr    a0,mcycle
csrr    a1,minstret

Please, am I missing something? What should be the expected result here?

First of all, it’s not normal to write to those CSRs. Usually you’d read the initial value and then subtract that from the final value.

As for the IPC, that’s for computation instructions and memory loads and stores from TIM/cache, while running instructions that are in a TIM or icache (generally not applicable on E21). For example, if you are using a core without icache, it is vitally important for performance to make sure you are running code that is in ITIIM, not in SPI flash.

What is the exact code you are running in “[some stuff]”, where is the code located, and what memory are you writing to?

Hello Bruce,
Thanks for your response.

I was writing to the CSRs because I tried both ways (reading+subtracting vs resetting) and I saw no difference on the number of instructions/cycles.

But your suggestion on “TIM vs flash” must be the answer. :exploding_head:
Fetching the instructions from SPI might have caused this.

I don’t know how to use one of TIMs for the .text, but it looks like I have to do it.
Please, do you know whether I can flash directly into one of TIMs?
If not, I believe that the .init must copy the .text from SPI to TIM, just like .data is copied; then jump into this new .text. Is it correct? Any examples will be very much appreciated. :slight_smile:

Thank you very much

I believe you could simply change the linker script to load .text into the ITIM instead of flash, but bear in mind that this is volatile, so you’ll lose it if you power off the board. For production use you’d want to store the code in flash but, as you say, copy it to TIM on startup. You’d need to link it for the final execution address in TIM, but then download to SPI. I don’t know enough about linker options to know if this is easy to do or not. @jimw probably knows.

Hello again,

I have got the TIM initialization working, and it greatly improved performance. Thanks a lot @bruce for your continued support.
Unfortunately though it’s still “far” from the expected performance.

Here is an example of increment of a volatile int:

csrr    s3,minstret
csrr    a4,mcycle
lw      a5,8(sp)
addi    a5,a5,1
sw      a5,8(sp)
csrr    s4,minstret
csrr    a0,mcycle

Result when using SPI (.text, .rodata, etc) + TIM (.data, .bss, stack):
Instructions: 5
Cycles: 421084

Result when using TIM for everything (.text, .rodata, .data, .bss, stack):
Instructions: 5
Cycles: 426

So it is a 1000 times better!!
But I still expected the number of cycles and instructions to be closer (closer to 1 cycle/instruction) .

Is there any “better” test to run?
Any other thing that might cause bottlenecks? Under “Machine Hardware Performance Monitor Event Register” I see that many events can be monitored. Would any of them give me a clue of the problem?

Thank you

What happens if you put your lw/addi/sw in a loop and run it 1000 times?

(if before the test you do lw a6,8(sp);addi a6,a6,1000 then you’d only need a single bne a5,a6 to implement the loop)

Wow! I finally reached the performance summit (kind of)!!
Thank your for the suggestion. When I ran this:

li    t0, 0
li    t1, 1000
csrr  s2, minstret
csrr  s4, mcycle
1:
addi  t0, t0, 1
bne   t0, t1, 1b
csrr  s3, minstret
csrr  s5, mcycle

I have got 2002 instructions, 3001 cycles.
For a lesser number of iterations, it got even closer to the 1:1 ratio. :smiley:

Now I want to know what causes the performance to drop.
The first suspect to investigate, given the code above, is branch prediction.
Following with memory performance, cache, etc.

Thanks again @bruce !

edit:
My bad. It was predictable (pun intended) that a loop would not achieve 1:1 performance, given what is clearly stated on section 3.1:

Taken branches and unconditional jumps incur a one-cycle penalty

Yes, the E21 does not have any branch prediction. Instead it has a very short pipeline so that the penalty of flushing the pipeline and fetching the correct instruction after a branch is just one clock cycle.

It will be interesting to see what happened with your program with loads and stores. If the stack is in DTIM then it should be fast (though not 1 IPC).

I’ve added lw or sw to the code above. And I’m also monitoring events on mhpmevent3, so the instruction count increased by 1.

Running them a 1000 times resulted in:

           Instr  Cycle
addi       2003,  3002
addi + lw  3003,  5002
addi + sw  3003,  4002

The load instruction takes one cycle more than a store instruction, and (as I found out by monitoring the events) it is happening because of a 1000 Load-use interlocks. (I need to dive into the details to check if it can be avoided)

Finally, when running again tests a bit more sophisticated the results were good as well. Below the numbers for memory operations over a 1KB buffer:

         Instr  Cycle
memset   4106,  5131
memmove  6154,  8203
memcmp   7180,  11278

Thanks @bruce !

That’s all exactly as expected.

I think if you can find another instruction to put between the load and using the result of the load then the interlock should disappear. At least it does on 3 and 5 series cores, which I am more familiar with.

It’s still a huge mystery how your initial results happened.