Trap or count instruction cache misses?

I’m working on some timing-sensitive code (an interrupt-driven SPI driver) that’s suffering from glitches due to instruction cache fills. The first time my function runs, it pauses for ~13us, then for ~160us. The second time it runs fine. That sounds pretty clearly like an instruction cache fill. I can, of course, work with this…but it got me wondering.

Is there any way to count or trap I-cache misses in the FE310G? It’d be great to confirm my hunches when I see glitches like this, and down the road (when my code gets >16KB), I’m going to need to know when an unexpected I-cache miss glitches my routine.

While I’m at it, is there any better way to prime the I-cache than executing the code I want loaded? Disabling its side-effects for the first run is going to increase its size, impacting performance.

I’m assuming that your code is running out of SPI Flash, not the Scratchpad RAM. Here are a few ideas:

  • If your routine is fairly small, you could allocate space for it in the SRAM and make sure it is copied there on startup (either manually or by declaring it as initialized data). Then even if it is evicted from the cache, you are dealing with a few cycles cache refill latency, rather than reading-the-SPI-flash latency.

  • Make sure the memory-mapped SPI flash parameters are set reasonably. If you are using the Freedom E SDK on the HiFive1, we conservatively reduce the SPI Flash read speed when we boost the clock speed. You can turn this back up significantly. Look for “SCK_DIV” in the SDK code to see what I mean.

  • To answer your actual question, you can’t explicitly trap on i-cache misses. You can try using the FE310’s breakpoint registers to give you a breakpoint exception if you execute at certain addresses or ranges. You could create a match condition that corresponds to your routine getting evicted, but not sure how you would make it NOT match on the routine you actually care about. FE310-G000 has two of these registers. It doesn’t have any i-cache miss counters or trace logic.

To see more info on the breakpoint registers, see, Chapter 9, “Trigger Module”.

As for “priming” the instructions without executing them, it’s not really possible (see this thread for why:

If your code is small then you can get each 32 byte cache line loaded into cache by hitting an instruction in it. If there is a RET instruction then you could call directly to it. If there isn’t a RET then you could insert a C.RET after some unconditional branch. If there’s no unconditional branch in that 32 bytes then you could insert a C.RET and a branch around it.

Or, instead of RETs which you have to explicitly call in turn from code elsewhere, you could chain together C.J instructions from one cache line to the next.

Of course this is fiddly, has to be updated every time you change the code in question, and takes away up to 1/16th of your code space. But it does let you prime the cache without executing your actual code.

Hopefully your timing-critical code isn’t very big.

Cache prefetch instructions would be much better!!!

Another source of timing irregularities is branch prediction. In tests on my HiFive1 I’ve found that the branch prediction is excellent once the code is warmed up. And correctly-predicted branches take only one cycle.

But there’s really no way to warm up the branch prediction other than to actually execute the code :frowning:

Fortunately, of course, a branch prediction miss is only 3 nS @ 256 Mhz, so utterly trivial compared to your 13 uS or more from instruction cache load from flash.

If you set a hardware breakpoint on an instruction, is the instruction actually fetched, or does the breakpoint happen as soon as the address is matched?

If the former, then that would be useful for prefetching. But not otherwise.