Parameters increase benchmarking time

I have been trying to benchmark some assembly that I am writing. To do this I am first calling the function multiple times to fill the instruction cache with it’s instructions. Then I get the cycle count using the following assembly cssr a0, mcycle; ret.

Now here comes the weird part. If I have my assembly as an external function in C without parameters it takes approximately as long as expected (around 60 cycles). However if I add parameters it all of a sudden takes in the 2000 cycles. Comparing the objdump of the elf file doesn’t show anything worrying. The only difference is that it loads values from the stack.

To be clear I added the code below that runs in the expected time, and the code that takes longer.
This takes the expected 60 cycles.

  #include <stdint.h>
  #include <stdio.h>
  
  extern uint32_t getcycles();
  extern uint32_t dosomething();
  
  int main() {
    uint32_t oldcount, newcount, x;
    unsigned char a = 10;
    unsigned char b = 50;
    uint32_t l;
    getcycles();
    dosomething();
    getcycles();
    dosomething();
    getcycles();
    dosomething();
    getcycles();
    dosomething();
    getcycles();
    dosomething();
    getcycles();
    dosomething();
    oldcount = getcycles();
    l = dosomething();
    newcount = getcycles();
    printf("This took %u cycles\n",newcount-oldcount);
    return 0;
  }

And without changing the assembly this takes more than 2000 cycles:

  #include <stdint.h>
  #include <stdio.h>
  
  extern uint32_t getcycles();
  extern uint32_t dosomething(unsigned char a, unsigned char b);
  
  int main() {
    uint32_t oldcount, newcount, x;
    unsigned char a = 10;
    unsigned char b = 50;
    uint32_t l;
    getcycles();
    dosomething(a,b);
    getcycles();
    dosomething();
    getcycles();
    dosomething(a,b);
    getcycles();
    dosomething(a,b);
    getcycles();
    dosomething(a,b);
    getcycles();
    dosomething(a,b);
    oldcount = getcycles();
    l = dosomething(a,b);
    newcount = getcycles();
    printf("This took %u cycles\n",newcount-oldcount);
    return 0;
  }

Does anybody have any idea why this happens?

Kind regards,

mortalAmongstGods (mag)

When you do your final …

oldcount = getcycles();
l = dosomething(a,b);
newcount = getcycles();

… that is code you have never run before (28 bytes I think, if no C instructions are used). It is not in the instruction cache unless you got lucky and it’s entirely in the same 64 byte cache line as the preceding “dosomething(a,b);”. If it’s not then a new icache line will have to be loaded from SPI flash sometime during the execution of those three lines of code and this takes with standard settings something close to 1 us per word. If you delete the lines of code before it one by one you’ll find a point at which it suddenly runs fast.

The solution? Put those three lines of code in a function and call that function to do your warm-up (once should be enough) and then call it again for the actual measurement.

Don’t run any new code in your actual measured test run – and that includes the driver/measurement code in main().

2 Likes

I just tested the solution you suggested and it worked perfectly. Thank you for your help.