FE310-G002 clock and I-Cache performance

DanaK6JQ · December 28, 2019, 8:48pm

Tinkering a bit, I did a very simple benchmark sort of thing (excerpted code):

new_cpu_clock = metal_clock_set_rate_hz(&__metal_dt_clock_4.clock,
  clock_rate);

if (new_cpu_clock == last_cpu_clock) {
	printf("skipping %ld\n", new_cpu_clock);
	return;
}

last_cpu_clock = new_cpu_clock;

rtcd = metal_rtc_get_device(0);
t1 = metal_rtc_get_count(rtcd);

for (int32_t i = 0; i < 10000000L; i++) {
	// (void) metal_gpio_toggle_pin(gpd, 5);
	(void)mrand48();
}
t2 = metal_rtc_get_count(rtcd);

RTC is driven by an external 32.768kHz oscillator. My guess is that all of the code fits in the I-Cache (though I haven’t confirmed it). I found that mrand48() required consistently 96.2 clocks at every clock up to and including 288MHz. At 320MHz - the documented FMAX of the part - it required 103.7 clocks. If I’m doing this right, this suggests that the I-Cache runs out of steam between 288MHz and 320MHz.

My take-away is that 288MHz is the maximum practical clock then. Any comments?

tincman · May 1, 2020, 3:05pm

Fascinating! I’d definitely be curious about that.

But I do wonder if maybe the clocks aren’t as stable at 320 MHz or there’s a low voltage event (or something somethinghandwaving).

I’d be curious if you also looked at mcycle (metal_cpu_get_timer should return that) to see if that matched up as well. There are also a handful of performance counters which have an itim busy counter that may shed some more light on this.

DanaK6JQ · May 1, 2020, 3:26pm

It’s been a while since I ran the test, but I recall the results being consistent. So I’d figured the clock is stable (it’s the output of a PLL, right? Is there a fault associated with unlock etc.?) and the relatively small difference (~8%) is the result of something deterministic. If the measurements weren’t consistent I’d suspect something from class:handwaving

Now I’m curious again and will have to re-run the experiment and look more deeply at the performance counters.

Thank you!

deadcommon · May 1, 2020, 8:38pm

regarding the rtc frequency - this depends upon which board you are using and if there is an external oscillator/crystal. The internal LFROSC oscillator does not come pretrimmed and is not trimmed in the bootloader unless you add that code. Also, the internal osc is quite temperature sensitive, so if you are running your HFOSC at 320 MHz, then self heating from the big bump in power dissipation will pull your LFROSC…

DanaK6JQ · May 1, 2020, 9:27pm

I’m using the SparkFun RED-V board, which comes with a 32kHz oscillator:

and I believe I’ve correctly enabled it - meaning, I believe that AON_PSD_LFCLKSEL tied low selects the XO.

DanaK6JQ · May 1, 2020, 10:14pm

So I re-ran the test, with the following results for three trials:

32000000: 955124
320000000: 96885
288000000: 107641
32000000: 955128
320000000: 96877
288000000: 107641
32000000: 955135
320000000: 96885
288000000: 107641

where the first value is the programmed clock rate and the second value is the 32.768kHz RTC count. Loop count is 10e+6 here. I calculate clocks per call to mrand48() by:

(RTC_count [cycles] / 32768 [cycles/sec]) * CPU _clock [cycles/sec] / 10e+6 calls -> CPU cycles per call

32MHz: 93.28 cycles
288MHz: 94.6 cycles
320MHz: 94.6 cycles

Well, how about that? I apparently did some wrong math last December. My apologies.

Cheers,
Dana

DanaK6JQ · May 2, 2020, 1:08am

I previously ran this under the debugger, I didn’t think it would have a run-time impact. But once I exited the debugger and reset the MCU, I saw quite a change in the 32MHz result:

320000000: 96876 94.6 cycles
288000000: 107645 94.6 cycles
32000000: 948330 92.6 cycles

Makes me think the debugger halts the CPU periodically for a brief period. Also makes me think the flash is suffering a wait-state at higher clock rates.

tincman · May 2, 2020, 3:05am

Hmmm, or maybe it is heat related? Did you run each frequency trial back to back, or did you switch the frequency for each? I wonder if your previous result was from it running at 320 MHz for quite a while.

Also, you can ensure the instructions are “cached” by placing them explicitly in ITIM (decorate a function with METAL_PLACE_IN_ITIM).

Or, could it have been a spurious interrupt that had a higher probably of triggering at the higher clocks?

DanaK6JQ · May 2, 2020, 5:19pm

I believe I am finding some clarity. I tweaked the simple test to estimate the loop count for 30 seconds, so the 32kHz timer count would be approximately the same regardless of clock rate to reduce/eliminate the impact of timer granularity. This also means the test will ‘bake’ for 30 seconds at each clock rate, where before it spent 30 second at 32MHz, then ~3 seconds each at 288MHz, 320MHz. We’ll see the impact of that pretty readily.

Also added calculation of the clocks per iteration, so I don’t have to calculate that manually.

The test does a trial at 32MHz, 288MHz and 320MHz, in that order, and I see results like this, where the first number is the return value of metal_clock_set_rate_hz() and the second number is clock cycles * 10:

starting test
32000000: 946
288000000: 946
320000000: 945 : Consistent 94.5/94.6 clocks per loop. Awesome

32000000: 946
288000000: 946
-42450944: 0 : Error setting 320MHz; let’s hit the MCU with dust-off to cool it down

32000000: 946
288000000: 946
320000000: 946 : ah, that’s better, a little more dust-off

32000000: 946
288000000: 946
320000000: 946 : still good, no more dust-off

32000000: 946
288000000: 945
320000000: 945 : still good!

32000000: 946
288000000: 945
-42450944: 0 : whoops, warmed-up again (continued trials have the same result)

First of all, with a better test that eliminates granularity errors in the measurement, I get the same result, 94.6 clocks per call to mrand48(). Second, heating seems to make the PLL fail to lock at 320MHz.

I think I’ve cleared-up my own confusion here; sorry for the fire-drill.

Dana

Here’s the code for the test function:

void
do_run(long clock_rate)
{
	struct metal_rtc *rtcd;
	long new_cpu_clock;
    uint64_t t1, t2;
    static long last_cpu_clock = 0;
    float cycles;
    int32_t loop_count;

	new_cpu_clock = metal_clock_set_rate_hz(&__metal_dt_clock_4.clock,
	  clock_rate);

	last_cpu_clock = new_cpu_clock;

	/* estimate loop count for 30 seconds */
	loop_count = ((double)new_cpu_clock * 30.0) / 95.0;

    rtcd = metal_rtc_get_device(0);
	t1 = metal_rtc_get_count(rtcd);
	for (int32_t i = 0; i < loop_count; i++) {
		// (void) metal_gpio_toggle_pin(gpd, 5);
		(void)mrand48();
	}
	t2 = metal_rtc_get_count(rtcd);

	cycles = (((double)(t2 - t1) / 32768.0) * new_cpu_clock) / loop_count;
	printf("%ld: %d\n", new_cpu_clock, (int)(cycles * 10.0));
}

DanaK6JQ · May 2, 2020, 5:25pm

@deadcommon I believe I’m seeing an error return when attempting to metal_clock_set_rate_hz() to 320MHz once the die has warmed-up. I haven’t dug into the FE310 spec and the Freedom Metal docs don’t really tell me what a negative return value means, but I’m thinking that’s an error when I see -42450944

Cheers,
Dana

deadcommon · May 2, 2020, 9:01pm

Not aware of that but thanks for the heads up. On the -G003 chip with the 64K DTIM the fmax craps out at a lot lower frequency : around 275 typically in the few I have tested. It seems that it is better to “sneak up” on the desired frequency although that doesnt necessarily make a lot of sense. Going from 16MHz (default) clock to > 300MHz is guaranteed to fail whereas small steps might get you a lock that stays locked. Self heating is going to be the issue - with > 150mA quoted on the 1.8V rail at 300 MHz I believe.

pds · May 4, 2020, 1:37am

Might not be entirely a thermal issue (i.e., upsetting a bandgap or vco in the pll). At high speed–with higher current and power consumption–a surge in current such as from a digital transition of programming low-speed to high-speed clock could momentarily, ever so briefly (such as uS), cause a droop in voltage which might partially or completely reset or otherwise disturb some logic circuits. On the 1.8V line there’s not much headroom or margin from droop. Supply regulation and filtering must be done pretty well…
Hint: put a scope on 1.8V and norm-trigger just under that level, to catch any droops or sags.
Then: plot 1.8V quiescent current (in mA) versus clock rate (in MHz); linear/cmos?

deadcommon · May 4, 2020, 5:59am

@pds Thats a nice idea but I really dont think so in this case. I believe the clock systems are powered from the AON domain and this is from the 1.8V bus. There is only the CPU on this bus and it is well bypassed - if the system were that sensitive to tiny variations in voltage, it would barely be viable. This would be iatrogenic - where the chip is inducing instabilities of its own accord. But I do like the way you are thinking. I dont think there was ever very much effort put into the characterization of operation at the different corners of the envelope - just as there was no binning done. The plan originally I think was to bolt on a 16MHz oscillator/crystal and set the PLL coefficients to run at 100 MHz, for best results. The extended operation out to 300 MHz was a bonus.

DanaK6JQ · May 4, 2020, 4:53pm

My -G002 will run all day at 288MHz (in a room temperature office), it’s just not so reliably happy above that. Like you point out, I don’t think the > 150MHz clock rate was a design goal. This particular experiment was initially just poking around and once I understood what I was seeing, the ITIM smokes along just fine. I was curious if there were a bottleneck with the QSPI path to external flash. None that impacts this simple test - the mrand48() function surely lives in the ITIM.

Topic		Replies	Views
FE310 RTC Change Frequency Freedom E300	1	2520	May 1, 2020
Confirming: All E310-G002s have FMAX of 320MHz Freedom E300	2	2839	May 1, 2020
Speed of the E31 at its I/O pins Freedom E300	3	4290	January 5, 2017
Trap or count instruction cache misses? Freedom E300	5	3557	January 30, 2017
Question about setting configCPU_CLOCK_HZ in FreeRTOS HiFive1 Rev B	1	1538	June 14, 2023

FE310-G002 clock and I-Cache performance

Related topics