Difficulty obtaining metrics on the Freedom E300 Arty

Good morning,

I’m attempting to obtain total cycle counts for a few functions. To do so, I’ve been reading the mcycle CSR before a function call, reading the mcycle CSR after a function call, and calculating the difference between the two values.

The resulting cycle counts were greater than I expected. Therefore, I decided to try this method with a simple function:

#include <stdint.h>

volatile uint32_t a;
volatile uint32_t b;
volatile uint32_t time5;

int main(void)
{
    for(;;) {
        asm volatile ("csrr %0, mcycle" : "=r" (a));
        asm volatile ("nop");
        asm volatile ("nop");
        asm volatile ("nop");
        asm volatile ("nop");
        asm volatile ("nop");
        asm volatile ("nop");
        asm volatile ("nop");
        asm volatile ("nop");
        asm volatile ("nop");
        asm volatile ("nop");
        asm volatile ("nop");
        asm volatile ("nop");
        asm volatile ("nop");
        asm volatile ("nop");
        asm volatile ("nop");
        asm volatile ("nop");
        asm volatile ("nop");
        asm volatile ("nop");
        asm volatile ("nop");
        asm volatile ("nop");
        asm volatile ("csrr %0, mcycle" : "=r" (b));
        time5 = b - a;
        a = 0;
        b = 0;
        asm volatile ("nop");
    }

    return 0;
}

Assuming each nop takes one cycle to execute, this should yield time5 as 20 (plus the number of cycles to read mcycle). Rather than 20, the value yielded was 37.

I also repeated this method with 20000 nop statements (no loops). The results showed that the value in time5 was 2671265, not 20000.

I checked the validity of the cycle counter by replacing the timing and mcycle read statements with the following statements:

(*(volatile uint32_t *) (((0x10012000UL)) + ((0x0C)))) |=  (0x1 << 16) ;
(*(volatile uint32_t *) (((0x10012000UL)) + ((0x0C)))) &=  ~((0x1 << 16)) ; 

These statements pull pin 0 (marked IO0 on the board) high and low, respectively. With 20000 nop statements between them, I used an external oscilloscope to measure the length of each high pulse. I found this value to be approximately 41 milliseconds. In other words, 20000 nop instructions took 41 milliseconds to execute. The inverse of this value, 24.4Hz, was multiplied by 2671265 cycles. The product was 65.2MHz, which is the approximate value of the given clock speed.

As a result, I believe mcycle is returning valid cycle counts. Therefore, it would appear that something is wasting clock cycles, although I’m unsure what it could be.

Other things I tried:

  • Switched optimization levels from -O0 to -O3

  • Used gdb to step by instruction (via stepi)

  • Set a watchpoint in gdb on the $pc (program counter) register. It did not jump to any interrupts or traps.

  • Set breakpoints on functions within the following: init.c syscall.c drivers_sifive/plic.c None were tripped while within the main method.

I wonder if a consistent and repeatable pattern in the growth of wasted clock cycles can be observed when going from 20 nops to 200 nops, 2000 nops and 20000 nops. If so, that might give a clue as to what to consider next.

I like Donnie’s suggestion of trying more intermediate data points.

20 NOPs should take about 20 cycles, plus like you said the CSR reads and pipeline depth. So 37 cycles could be reasonable for 20 NOPs, but the overhead should remain constant as you increase the NOPs.

That being said, 20,000 NOP instructions will take at least 40kB (if they are compressed), which means they won’t fit in the FE310’s 16kB I-Cache. So you are seeing the time included to reload from the cache for the long delay.

My original post was slightly incorrect. Those cycle counts included the toggling of the pin. Without the pin toggling, I get the following values:

  • 20 nop: 24 cycles
  • 200 nop: 204 cycles
  • 2000 nop: 2004 cycles
  • 20000 nop: 2655824 cycles

So for small numbers of nop instructions, the number of cycles makes sense. I’m going to further investigate and try to determine the value between 2000 and 20000 nop’s where the number of cycles starts to drastically increase.

Ah it’s the delay to fetch more instructions? Interesting. Would the best way to count cycles then be to just measure in smaller instruction increments?

It’s because you are running straight through the instruction memory, so you miss every time and have to go read the code from SPI Flash (the hardware does this). And by the time you get to the beginning of the loop, you’ve “evicted” the old code so you have to do it again. In your shorter loops, the actual code fits in a much smaller address space, so it all fits in the cache.

So the best way to count cycles is the way you’ve done it, but this shows the importance of understanding caching effects.

1 Like