Timer resolution

Hi,
We tried with nanosleep on HiFive Unleashed u540 of 1ns and got delay of approx 100-130usec.What could be the cause for this delay?
2) What is the lowest clock resolution that can be achieved?
3)What is the lowest sleep time that can be set?

Thanks!

Hi Pankaj Joshi,

The CPU clock frequency on your HiFive Unleashed FU540 chip is probably close to 1GHz. 1GHz is 10^9 Hz. So the cycle period, being the reciprocal of the clock rate, is 1/10^9 seconds, or 1 ns. The minimum time period that you’re seeking to delay for, then, is probably roughly identical to the the CPU clock rate. That means that using a library call like nanosleep() for this purpose – particularly one that will call into the kernel, and most likely involve an M mode transition in the kernel – is probably not the best approach here.

Assuming you’re designing specifically for the FU540 at 1GHz, if you want to delay for a single CPU cycle, it’s probably best to simply use a single “nop” assembly language instruction.

That written, 100-130µs does seem surprisingly long, even for calling nanosleep(). My assumption would be that the benchmark timing harness that you’re using probably represents a significant fraction of this time. Other simultaneous load on the system could be another factor here.

See also: Cycle Counts on Unleashed Board Seem Innaccurate?

Hi,
i tried to check clock cycles on HiFive Unleashed FU540 i am running following code to read the cycles

#include <stdio.h>

#define read_csr(reg) ({ unsigned long __tmp; \

asm volatile ("csrr %0, " #reg : “=r”(__tmp)); \

__tmp; })

#define write_csr(reg, val) ({ \

asm volatile ("csrw " #reg “, %0” :: “rK”(val)); })

#define swap_csr(reg, val) ({ unsigned long __tmp; \

asm volatile ("csrrw %0, " #reg “, %1” : “=r”(__tmp) : “rK”(val)); \

__tmp; })

#define set_csr(reg, bit) ({ unsigned long __tmp; \

asm volatile ("csrrs %0, " #reg “, %1” : “=r”(__tmp) : “rK”(bit)); \

__tmp; })

#define clear_csr(reg, bit) ({ unsigned long __tmp; \

asm volatile ("csrrc %0, " #reg “, %1” : “=r”(__tmp) : “rK”(bit)); \

__tmp; })

#define rdtime() read_csr(time)

#define rdcycle() read_csr(cycle)

#define rdinstret() read_csr(instret)
int main()

{

for (int i=0; i<10; ++i) {

unsigned long t = rdcycle();

asm("nop");

asm("nop");

asm("nop");

asm("nop");

asm("nop");

asm("nop");

asm("nop");

asm("nop");

asm("nop");

asm("nop");
  ....
  ....
  ....

unsigned long t1 = rdcycle();

printf("cycles: %lu\n", t1 - t);

}

}

#./a.out
cycles: 215
cycles: 142
cycles: 142
cycles: 112
cycles: 112
cycles: 127
cycles: 142
cycles: 142
cycles: 142
cycles: 127

i have include 100 nops which should take 100 cycles and few more cycles for returning from the function(i would like to mention with 1nop() i am getting 14 cycles) but i am not able to understand why it is taking more than that…can someone help us to rectify ??

Thanks!

Use “objdump -dr” to disassemble the code and see what the extra instructions are. I get similar results if I compile with -O0, and better results if I compile with -O2. The difference is that -O0 does not optimize away local variables, and does not allocate them to registers, to make debugging easier. So at -O0 each read_csr macro generates multiple load/store instructions for __tmp accesses, which accounts for the extra cycles. At -O2 the extra instructions go away, and I get 101 cycles after the cache is primed, that is probably one extra cycle for the rdcycle instruction, or maybe I got my count of nops wrong.

Thanks Jim for response.
i have tried with the both -O0 and -O2 and i tried to run same executable multiple that generated with -O0 and -O2 multiple times.And i have observed that everytime cycles are getting varied for the same program
#gcc -O0 cycle_check.c
#./a.out
cycles: 167
cycles: 131
cycles: 112
cycles: 135
cycles: 112
cycles: 112
cycles: 127
cycles: 112
cycles: 112
cycles: 112
root@exaleapsemi:~# ./a.out
cycles: 157
cycles: 157
cycles: 112
cycles: 112
cycles: 112
cycles: 112
cycles: 112
cycles: 123
cycles: 112
cycles: 112

#gcc -O2 cycle_check.c
#./a.out
root@exaleapsemi:~# ./a.out
cycles: 150
cycles: 133
cycles: 101
cycles: 101
cycles: 101
cycles: 101
cycles: 101
cycles: 101
cycles: 101
cycles: 101
root@exaleapsemi:~# ./a.out
cycles: 148
cycles: 133
cycles: 118
cycles: 101
cycles: 101
cycles: 101
cycles: 101
cycles: 101
cycles: 101
cycles: 101
root@exaleapsemi:~# ./a.out
cycles: 150
cycles: 148
cycles: 116
cycles: 101
cycles: 101
cycles: 101
cycles: 101
cycles: 101
cycles: 101
cycles: 101

I assumption is that i should get same clock cycle if ran same executable for multiple times but it seems wrong…why it could be???

Thanks!

Assuming you didn’t disable interrupts, this is likely due to an interrupt handler happening in between (either an hardware interrupt, or e.g. MMU page faults). Cycle-exact timing is not generally possible user-space.

1 Like

The first few runs generate cache misses, the last ones don’t, so are faster. Besides the already mentioned interrupts that you can’t control, there are also other programs running which could generate a context switch in the middle of your program. If you have PIE and/or ASLR then you could have different run-time addresses which could cause different cache behavior, e.g. different sets of cache lines in the program get cache conflicts depending on the run-time load address. If you have I/O, like disk I/O, that is another source of unpredictable timing. So for best results you need to boot in single user mode, disable as many daemons as possible, disable cron jobs, ensure that there is no pie/aslr, maybe just boot one core instead of 4 to be safe, etc. And even then you will still see some variation in results, just not as much. It is simpler to just accept that timing will vary from one run to the next. This is why benchmarks like spec do multiple runs.

1 Like