HPM instruction cache miss counter

Hi,

I am currently benchmarking the HiFive1 Rev B with the TACLe benchmark collection. For measuring the performance I use the Hardware Performance Monitor (HPM) interface. This provides easy access to clock cycles, retired instructions and a couple of extra counters that can be activated.
For my benchmarking I set the instruction cache miss event on the mhpmevent3 register and successfully receive values from the mhpmcounter3 after each benchmark run. However if cache misses occured the values I get from the mhpmcounter3 register are humongous.

For example:
clock cycle counter: 139104
instruction retired counter: 3859
instruction cache miss counter: 292057776196

I know that the huge difference between clock cycle counter and retired instructions is due to the instructions being read from flash storage. There was no cache warmup in this case.
What I would like to know is the unit or meaning of the instruction cache miss counter. The FE310-G002 manual doesn’t provide sufficient information about this. So this is perhaps some kind of common knowledge I don’t have yet…?

Thanks in advance.

292057776196 is an interesting number in hex: 0x0000004400000044

Are you sure you didn’t read the low word of the counter twice?

1 Like

Thanks for suggesting that. Indeed a lot of my results look similar to this in HEX. Very suspicious.
However I don’t see any reason why my code would behave like this. This is the function I use to gather the counter values:

unsigned long long hpm_read_counter(unsigned int counter)
{
	unsigned long long val = 0;
	unsigned long hi = 0, hi1 = 0, lo = 0;

	do{
		switch (counter) {
			case HPM_CLOCK_CYCLES:
				asm volatile ("csrr %0, mcycleh" : "=r"(hi));
				asm volatile ("csrr %0, mcycle" : "=r"(lo));
				asm volatile ("csrr %0, mcycleh" : "=r"(hi1));
				break;
			case HPM_INSTRUCTIONS:
				asm volatile ("csrr %0, minstreth" : "=r"(hi));
				asm volatile ("csrr %0, minstret" : "=r"(lo));
				asm volatile ("csrr %0, minstreth" : "=r"(hi1));
				break;
			case HPM_COUNTER_3:
				asm volatile ("csrr %0, mhpmcounter3h" : "=r"(hi));
				asm volatile ("csrr %0, mhpmcounter3" : "=r"(lo));
				asm volatile ("csrr %0, mhpmcounter3h" : "=r"(hi1));
				break;
			case HPM_COUNTER_4:
				asm volatile ("csrr %0, mhpmcounter4h" : "=r"(hi));
				asm volatile ("csrr %0, mhpmcounter4" : "=r"(lo));
				asm volatile ("csrr %0, mhpmcounter4h" : "=r"(hi1));
				break;
            default:
                break;
		}
	} while (hi != hi1);

	if (counter == HPM_COUNTER_3 || counter == HPM_COUNTER_4)
		hi &= 0xFF;

	val = ((unsigned long long)hi << 32) | lo;

    return val;
}

Any idea why this would produce such a result for the instruction cache miss counter?

Edit:
During debugging it seemed like mhpmcounter3h and mhpmcounter3 both return the same value. I even tested this with:
asm volatile ("csrr t0, mhpmcounter3h);
asm volatile ("csrr t1, mhpmcounter3);
… and had the same value or a value, which differed only by 1 in both registers.

I’m not an expert in this area, so I may be way off-base here. But I’m suspicious that you might have to jump through an extra hoop to enable these counters. For example, SiFive’s example program that uses Freedom Metal’s HPM API,
https://github.com/sifive/example-hpm/blob/master/example-hpm.c
features the following suspicious comment:

/* Note that mcycle, mtime, minstret are enabled by default,
  * hence do not need to be explicitly set like done above.  */

You might consider starting by modifying this example (using Freedom-E-SDK) and seeing if you can make progress.

To dig further, here is where the Freedom Metal HPM logic lives:
https://github.com/sifive/freedom-metal/blob/master/metal/hpm.h
https://github.com/sifive/freedom-metal/blob/master/src/hpm.c
Unfortunately, I didn’t see this code discussed in the Metal documentation.

I recently had a project which needed to collect HPMs and run into same issue, even though the docs say counters event3 and event4 are 40 bit wide I found you can only really access the lower 32 bit register (e.g. hpmcounter3), the higher seems to just hardwired to the lower one, so supplementing with higher register resulted in the aforementioned behaviour. To tackle this I just resorted to using only lower 32 bits and clearing the counters after each sampling and it works just fine (of course that depends on how often you collect the values, but if its in seconds rather than minutes/hours you should be fine).

2 Likes

@nick.knight I made sure the corresponding event is set and the counter is being initialized before reading any values from it. Verifying this behavior with the official implementation is a good idea and I was already thinking about doing that. I will perhaps try this in a couple of days / weeks.

@MichalOlborski thank you for sharing this with me. This will be my solution for now, as I am in a bit of a hurry.

If however anybody knows more about this, pease let me know.

OK, short story is that you should only trust the lower 32 bits of hpmcounter3 and hpmcounter4, i.e., use @MichalOlborski’s workaround. I haven’t checked, but I suspect the Freedom Metal example I cited is also problematic, so please ignore it. I, or someone from the CX team, will follow up with more details; sorry for the confusion.

1 Like

Thanks again for reporting this issue. It is an erratum (CIP-127) in the FE310-GOO2. It was recently documented publicly here:
HiFive1 Rev B - SiFive (“Freedom E310-G002 Errata”, under Documentation & Support)

The relevant text says the following (no surprises here for the previous commenters in this thread):

CIP-127
Title: mhpmcounterXh registers can’t be read on RV32
Implication: The performance counters are documented as 40 bits wide. On RV32 systems, the upper 8 bits are in the h CSR and the lower 32 bits are in the usual CSR (on 64 bit systems there is only one CSR). However, in RV32 systems reading the h version of the CSRs is reading the lower 32 bits.
Workaround: Do not rely on the upper 8 bits of the register.