Unpredictable execution time when executing code from U54 ITIM ( data and Stack are in ITIM) memory

Hello everyone,
I’m executing a bare-metal task (Matrix mul) from ITIM .
I placed code, data and stack in ITIM of U54 application core. code contains a task (matrix multiplication) running in a loop. There is no other task executing in the system (all other cores are busy waiting/WFI) .

while(1){
Before_time = gettime();
Matrix_mul();
After_timer = gettime();
printf(“Execution time %ld\n”, After_time - Before_time); // Execution time not constant
}

I noticed that for every iteration, execution time of the task is changing. Could anyone help me understand why it is happening ? Since data and stack is in ITIM already it should give same execution time for every loop iteration ?
Thanks in advance

Are you hitting a breakpoint anywhere? Are you using freedom-metal? What is your platform, FPGA, simulation, something else?

Thank u for your reply.
I’m running custom code on Hifive unleashed board. I am not using freedom-metal. Code running from E51 does minimal initialization( Like UART etc.) and copies code and data to U54_1 ITIM and sets stack pointer in U54_1 ITIM location. Remaining cores execute WFI. E51sends IPI to U51_1.
U54 core execute code from ITIM (data and stack are set in ITIM as well). There are no breakpoints. I’m printing time to execute task (Matrix Mul) on UART. But I’m getting different execution time for every loop iteration. I’m not sure if it is correct behavior.
It is mentioned in U54 user manual that ITIM gives deterministic execution time. Is it deterministic only for code executing from ITIM ? Or is it also applicable when data /stack is placed in ITIM as well?

Hi @sunb,

Do you have branches or jumps in the code you measure? (I suspect there are some)
If so, you also need to disable the branch predictor on the core you are running your code.
On the PolarFire SoC, I successfully disabled it by doing :
write_csr(0x7c0, 1)

See Section 7.5 of the U54-MC datasheet for more information on how to disable the branch predictor.

Hi @atroger ,
Thank u for your reply.
I think “for loop” comes under conditional branch /jump (please correct me if I’m wrong). The task “matrix multiplication” code has for loops in it. Apart from that (and function return) there are no branch/jumps in the task.

Thanks for link to datasheet and pointing out branch predictor will look into it.

In sifive manuals it is stated that executing from ITIM is deterministic (which I’m assuming executing code from ITIM) but in my case I placed Data in ITIM as well. Since both code and data in same memory ( i.e ITIM and assuming single path for code and data) does it result in contention or stalls in pipe line which results in variable execution time ?

In theory, does executing code ( data and stack) from ITIM is deterministic ?
By deterministic I mean it must give same execution time every-time I run the code? or can it vary?

I think “more deterministic” would be a better description. If code and data is not in TIM then an L1 cache miss can cause anything from a few cycles of delay to get it from L2 cache3, or hundreds of cycles of delay to get it from DRAM. With everything in TIM that extreme source of variability doesn’t exist, but there are still other minor sources of variability.

The E51 core is much better suited to running code in deterministic time because it has a DTIM in addition to the ITIM and also features such as turning off the branch predictor (which makes all taken (?) branches execute in the maximum amount of time instead of usually executing in 1 cycle but sometimes not.

@bruce : You can also disable the branch predictor on the U54 cores, not only the E51 (correct me if I’m wrong, but from what I’ve tested, both worked)

@sunb : Memory accesses to ITIM/DTIM are deterministic, yes. However, if your code execution is not deterministic (e.g. branch prediction enabled), your total execution time won’t be deterministic.

You can also use the L2-LIM instead of the DTIM/ITIM. Access to the L2-LIM is also deterministic. (See Section 13.2.1 of the datasheet regarding the L2-LIM : “The L2 LIM is an uncacheable port into unused L2 SRAM and provides deterministic access time”)

You may also consider running something simple and deterministic first to see if you get a consistent (expected) value, and then adding more complex code after. Also try checking the value you are attempting to print against the actual value printed by using the debugger. printf() cannot alwasy be trusted! :slight_smile:

That feature might be available in recent U54, such as the PolarFre SoC, but not in 4 year old U54 in the HiFive Unleashed the OP is using. But I don’t know for sure.

Well, the PolarFire SoC’s U54-MC is a “pre-19.02” version (i.e. v1p0), the same as the HiFive Unleashed I think (?). But yeah, same as you, I can’t guarantee that it’s actually implemented on the Unleashed though (even if same version number should mean same implementation).