Inaccurate cycles counting on HiFive1 Rev B using mcycle CSR

Dear SiFive community,

I am trying to write an accurate cycle counting function thanks to the mcycle and minstr CSRs.
Here is my assembly code to measure the instructions/cycles that are required by some code:

.text

.globl getcycles
.align 2
getcycles:
    csrr a1, mcycleh
    csrr a0, mcycle
    ret
.size getcycles,.-getcycles

.globl getinstret
.align 2
getinstret:
    csrr a1, minstreth
    csrr a0, minstret
    ret
.size getinstret,.-getinstret

By simply tweaking the Hello World sample program, I am testing my functions with a simple memset:

#include <stdio.h>
#include <string.h>
#include <stdint.h>

uint64_t getcycles();
uint64_t getinstret();

int main() {
	uint8_t test[16];
    uint64_t oldcyccount = getcycles();
    uint64_t oldinstcount = getinstret();
	memset(test, 0x00, 16);
    uint64_t newcyccount = getcycles()-oldcyccount;
    uint64_t newinstcount = getinstret()-oldinstcount;
    printf("cycles = %d\n",(unsigned int)newcyccount);
    printf("instructions = %d\n", (unsigned int)newinstcount);
    printf("Hello, World !\n");
}

When running this program I get the following output:

cycles = 7271
instructions = 44
Hello World !

The value of newinstcount seems reasonable to me, but I don’t see why a simple memset would requires 7271 cycles? Is there something wrong with my assembly functions?

Thanks in advance.

Cheers

You are probably running code from flash which is slow. It will run faster if you use ITIM and DTIM. See for example Instruction per Cycle.

Thanks @jimw for considering my post.
Indeed running the code from flash might be the reason.
However I tried to run my test on the “example-itim” program in the software repository of the Freedom E SDK

#include <stdio.h>
#include <metal/itim.h>
#include <stdint.h>

__attribute__ ((noinline))
METAL_PLACE_IN_ITIM
uint64_t getcycles();

__attribute__ ((noinline))
METAL_PLACE_IN_ITIM
uint64_t getinstret();

__attribute__ ((noinline))
METAL_PLACE_IN_ITIM
int run_in_itim(int x)
{
	return x * x;
}

int main(void)
{
	int x = 2;
    uint64_t oldcyccount = getcycles();
	int y = run_in_itim(x);
    uint64_t newcyccount = getcycles()-oldcyccount;
    printf("cycles = %d\n",(unsigned int)newcyccount);
	if(y != 4) {
		puts("ITIM test failed\n");
		return 1;
	}
	puts("ITIM test passed\n");
	return 0;
}

which outputs
cycles = 6968
ITIM test passed

Am I doing something wrong? I am just compiling the program with a simple make and then flashing the .hex file using ./upload --hex src/debug/example-itim.hex --jlink JLinkExe

I don’t have a HiFive Rev B so I can’t help you with the details. I just know that you won’t get the right result unless you are using the *TIMs properly. And that others have managed to the right results with a bit of work.

I checked the disassembly file and the run_in_itim function is well running in the ITIM

Disassembly of section .itim:

08000000 <run_in_itim>:
 8000000:	1101                	addi	sp,sp,-32
 8000002:	ce22                	sw	s0,28(sp)
 8000004:	1000                	addi	s0,sp,32
 8000006:	fea42623          	sw	a0,-20(s0)
 800000a:	fec42703          	lw	a4,-20(s0)
 800000e:	fec42783          	lw	a5,-20(s0)
 8000012:	02f707b3          	mul	a5,a4,a5
 8000016:	853e                	mv	a0,a5
 8000018:	4472                	lw	s0,28(sp)
 800001a:	6105                	addi	sp,sp,32
 800001c:	8082                	ret

Disassembly of section .text:

20011000 <main>:
20011000:	7179                	addi	sp,sp,-48
20011002:	d606                	sw	ra,44(sp)
20011004:	d422                	sw	s0,40(sp)
20011006:	1800                	addi	s0,sp,48
20011008:	4789                	li	a5,2
2001100a:	fef42623          	sw	a5,-20(s0)
2001100e:	4791                	li	a5,4
20011010:	fef42423          	sw	a5,-24(s0)
20011014:	2045                	jal	200110b4 <getcycles>
20011016:	fea42023          	sw	a0,-32(s0)
2001101a:	feb42223          	sw	a1,-28(s0)
2001101e:	fec42503          	lw	a0,-20(s0)
20011022:	e7fef097          	auipc	ra,0xe7fef
20011026:	fde080e7          	jalr	-34(ra) # 8000000 <metal_itim_0_memory_start>
2001102a:	fea42423          	sw	a0,-24(s0)
2001102e:	2059                	jal	200110b4 <getcycles>
20011030:	86aa                	mv	a3,a0
20011032:	872e                	mv	a4,a1
20011034:	fe042583          	lw	a1,-32(s0)
20011038:	fe442603          	lw	a2,-28(s0)
2001103c:	40b687b3          	sub	a5,a3,a1
20011040:	853e                	mv	a0,a5
20011042:	00a6b533          	sltu	a0,a3,a0
20011046:	40c70833          	sub	a6,a4,a2
2001104a:	40a80733          	sub	a4,a6,a0
2001104e:	883a                	mv	a6,a4
20011050:	86be                	mv	a3,a5
20011052:	8742                	mv	a4,a6
20011054:	55e1                	li	a1,-8
20011056:	567d                	li	a2,-1
20011058:	00b687b3          	add	a5,a3,a1
2001105c:	853e                	mv	a0,a5
2001105e:	00d53533          	sltu	a0,a0,a3
20011062:	00c70833          	add	a6,a4,a2
20011066:	01050733          	add	a4,a0,a6
2001106a:	883a                	mv	a6,a4
2001106c:	fcf42c23          	sw	a5,-40(s0)
20011070:	fd042e23          	sw	a6,-36(s0)
20011074:	fd842783          	lw	a5,-40(s0)
20011078:	85be                	mv	a1,a5
2001107a:	200107b7          	lui	a5,0x20010
2001107e:	36078513          	addi	a0,a5,864 # 20010360 <use_hfxosc+0x2c>
20011082:	2835                	jal	200110be <iprintf>
20011084:	fe842703          	lw	a4,-24(s0)
20011088:	4791                	li	a5,4
2001108a:	00f70963          	beq	a4,a5,2001109c <main+0x9c>
2001108e:	200107b7          	lui	a5,0x20010
20011092:	37078513          	addi	a0,a5,880 # 20010370 <use_hfxosc+0x3c>
20011096:	2299                	jal	200111dc <puts>
20011098:	4785                	li	a5,1
2001109a:	a039                	j	200110a8 <main+0xa8>
2001109c:	200107b7          	lui	a5,0x20010
200110a0:	38478513          	addi	a0,a5,900 # 20010384 <use_hfxosc+0x50>
200110a4:	2a25                	jal	200111dc <puts>
200110a6:	4781                	li	a5,0
200110a8:	853e                	mv	a0,a5
200110aa:	50b2                	lw	ra,44(sp)
200110ac:	5422                	lw	s0,40(sp)
200110ae:	6145                	addi	sp,sp,48
200110b0:	8082                	ret
	...

200110b4 <getcycles>:
200110b4:	b80025f3          	csrr	a1,mcycleh
200110b8:	b0002573          	csrr	a0,mcycle
200110bc:	8082                	ret

So I really don’t see why so many clock cycles are spent for run_in_itim.
Am I missing something?

That’s weird, indeed, 6968 cycles for a few instructions is not usual.

Can you try to get the results of a series of getcycles() in an array (let’s say 32-64 values), to see how long it takes each call?

BTW, the getcycles() you are using fails when a carry from low to high word occurs between the two instructions. Not that it will make any difference in this case.

Thanks @ilg for the feedback.

I executed the following code:

#include <stdio.h>
#include <metal/itim.h>
#include <stdint.h>

uint64_t getcycles();

__attribute__ ((noinline))
METAL_PLACE_IN_ITIM
int run_in_itim(int x)
{
	return x * x;
}

int main(void)
{
	uint64_t oldcyccount, newcyccount;
	int x = 2;
	int y;
	printf("[");
	for(int i = 0 ; i < 32; i++) {
	    oldcyccount = getcycles();
		y = run_in_itim(x);
	    newcyccount = getcycles()-oldcyccount;
	    printf("%d, ",(unsigned int)newcyccount);
	}
	printf("]\n");
	return 0;
} 

and get [6968, 41, 40, 40, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, ..., 41].

As result the following calls only require 41 (sometimes 40?) cycles, which seems accurate.

How to explain that behavior? Is it due to the branch predictor?

It was not exactly what I suggested, but I’m glad it helped.

I suggested to calibrate your measurements with something like:

int cycles[32];
cycles[0] = getcycles();
cycles[1] = getcycles();
...
cycles[31] = getcycles();

to get an idea how long the measurement takes.

I have no idea why the first measurement is so large, I wouldn’t be surprised to know it is small bug in hardware.

Judging from your dump, it looks like the getcycles call itself is not on the itim. My guess is the attribute used in the c file didn’t apply since it referenced an external symbol exported by your assembly file.

The reason it may be correct in subsequent runs is due to the instruction cache (which is colocated in ITIM).

Try moving the getcycles routine inline in c with the itim decorator (or try placing it manually in the .itim section instead .text) and see if that helps.

Indeed @tincman is right, the attribute in the C file does not apply.

By replacing .text by .section .itim in get_cycles() I ensure that it runs in the ITIM as when disassembling the resulting .elf file I get

Disassembly of section .itim:

08000000 <run_in_itim>:
 8000000:	1101                	addi	sp,sp,-32
 8000002:	ce22                	sw	s0,28(sp)
 8000004:	1000                	addi	s0,sp,32
 8000006:	fea42623          	sw	a0,-20(s0)
 800000a:	fec42703          	lw	a4,-20(s0)
 800000e:	fec42783          	lw	a5,-20(s0)
 8000012:	02f707b3          	mul	a5,a4,a5
 8000016:	853e                	mv	a0,a5
 8000018:	4472                	lw	s0,28(sp)
 800001a:	6105                	addi	sp,sp,32
 800001c:	8082                	ret
	...

08000020 <getcycles>:
 8000020:	b80025f3          	csrr	a1,mcycleh
 8000024:	b0002573          	csrr	a0,mcycle
 8000028:	8082                	ret 

However I get the following output in this case: [2172, 46, 46, 46, 46, ..., 46]. So there is still a huge gap for the first measurement. Moreover I get 46 cycles instead of 40 when get_cycles() is not running in the ITIM.

By investigating further, it seems that the problem is not linked to ITIM or DTIM usage since when the functions are not put in the ITIM (i.e. by removing __attribute__ ((noinline)) METAL_PLACE_IN_ITIM and using the symbol .text for get_cycles()) I get the following disassembly file

Disassembly of section .text:

20010380 <run_in_itim>:
20010380:	1101                	addi	sp,sp,-32
20010382:	ce22                	sw	s0,28(sp)
20010384:	1000                	addi	s0,sp,32
20010386:	fea42623          	sw	a0,-20(s0)
2001038a:	fec42703          	lw	a4,-20(s0)
2001038e:	fec42783          	lw	a5,-20(s0)
20010392:	02f707b3          	mul	a5,a4,a5
20010396:	853e                	mv	a0,a5
20010398:	4472                	lw	s0,28(sp)
2001039a:	6105                	addi	sp,sp,32
2001039c:	8082                	ret

2001039e <main>:
2001039e:	7171                	addi	sp,sp,-176
200103a0:	d706                	sw	ra,172(sp)
200103a2:	d522                	sw	s0,168(sp)
200103a4:	1900                	addi	s0,sp,176
200103a6:	4789                	li	a5,2
200103a8:	fef42423          	sw	a5,-24(s0)
200103ac:	05b00513          	li	a0,91
200103b0:	20d9                	jal	20010476 <putchar>
200103b2:	fe042623          	sw	zero,-20(s0)
200103b6:	a899                	j	2001040c <main+0x6e>
200103b8:	2885                	jal	20010428 <getcycles>
200103ba:	fea42023          	sw	a0,-32(s0)
200103be:	feb42223          	sw	a1,-28(s0)
200103c2:	fe842503          	lw	a0,-24(s0)
200103c6:	3f6d                	jal	20010380 <run_in_itim>
200103c8:	fca42e23          	sw	a0,-36(s0)
200103cc:	28b1                	jal	20010428 <getcycles>
200103ce:	86aa                	mv	a3,a0
200103d0:	872e                	mv	a4,a1
200103d2:	fe042583          	lw	a1,-32(s0)
200103d6:	fe442603          	lw	a2,-28(s0)
200103da:	40b687b3          	sub	a5,a3,a1
200103de:	853e                	mv	a0,a5
200103e0:	00a6b533          	sltu	a0,a3,a0
200103e4:	40c70833          	sub	a6,a4,a2
200103e8:	40a80733          	sub	a4,a6,a0
200103ec:	883a                	mv	a6,a4
200103ee:	fcf42823          	sw	a5,-48(s0)
200103f2:	fd042a23          	sw	a6,-44(s0)
200103f6:	fd042783          	lw	a5,-48(s0)
200103fa:	85be                	mv	a1,a5
200103fc:	84818513          	addi	a0,gp,-1976 # 80000ae8 <__metal_driver_vtable_fixed_clock+0x8>
20010400:	280d                	jal	20010432 <iprintf>
20010402:	fec42783          	lw	a5,-20(s0)
20010406:	0785                	addi	a5,a5,1
20010408:	fef42623          	sw	a5,-20(s0)
2001040c:	fec42703          	lw	a4,-20(s0)
20010410:	47fd                	li	a5,31
20010412:	fae7d3e3          	bge	a5,a4,200103b8 <main+0x1a>
20010416:	85018513          	addi	a0,gp,-1968 # 80000af0 <__metal_driver_vtable_fixed_clock+0x10>
2001041a:	22f1                	jal	200105e6 <puts>
2001041c:	4781                	li	a5,0
2001041e:	853e                	mv	a0,a5
20010420:	50ba                	lw	ra,172(sp)
20010422:	542a                	lw	s0,168(sp)
20010424:	614d                	addi	sp,sp,176
20010426:	8082                	ret

20010428 <getcycles>:
20010428:	b80025f3          	csrr	a1,mcycleh
2001042c:	b0002573          	csrr	a0,mcycle
20010430:	8082                	ret

which returns [2446, 38, 38, 38, ..., 38].

@jimw I’m not sure that a proper configuration of the ITIM/DTIM is the reason here. Actually it seems that the code runs faster when not put into the ITIM: what could explain that behavior?

It doesn’t help if getcycles() is in ITIM (or cache) but the code that calls it is not.

All of this …

oldcyccount = getcycles();
y = run_in_itim(x);
newcyccount = getcycles()-oldcyccount;

… (except the actual subtraction, but you might as well include it) needs to be either 1) put in ITIM, or 2) run once before making an actual measurement to get it from flash into cache.

The difference between 46 and 38 is probably instruction alignment within a cache line or instruction prefetch buffer.

1 Like