Function calls seem to greatly slow down code

Hi,
While working on a project, I deciding to make execution time measurements by reading the mcycle register. The result was mediocre and after investigation it seems that function calls are the problem.

I created a small project to test that out :

/* Copyright 2019 SiFive, Inc */
/* SPDX-License-Identifier: Apache-2.0 */

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>     /* for uinptr_t */
#include <time.h>
#include <metal/cpu.h>
#include <metal/machine.h>
#include <metal/csr.h>

uintptr_t cycle_s, cycle_e;

int main() {
	int x =0;

	cycle_s=( unsigned int *)malloc(sizeof( unsigned int));
	cycle_e=( unsigned int *)malloc(sizeof( unsigned int));
	METAL_CPU_GET_CSR(mcycle, cycle_s);
	int b =fun(x);
	METAL_CPU_GET_CSR(mcycle, cycle_e);

	printf(" Elapsed time = %lu clock cycles or %lu microseconds \r\n",(unsigned )(cycle_e-cycle_s),(unsigned )(cycle_e-cycle_s)/16);

	free(cycle_s);
	free(cycle_e);
	return 0;
}

with fun a trivial function described in fun.c :

/*
 * fun.c
 *
 *  Created on: 21 Jul 2023
 *      Author: baf
 */
#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include <metal/cpu.h>
#include <metal/machine.h>
#include <metal/timer.h>
#include <metal/led.h>
#include <metal/csr.h>
extern uintptr_t cycle_s, cycle_e;


int fun(int a ){
   // METAL_CPU_GET_CSR(mcycle, cycle_s);
	 a ++;
    //METAL_CPU_GET_CSR(mcycle, cycle_e);
	return a;
}

When measuring the elapsed time inside the fun(), it of course only take 4 clock cycles.
However measuring the elapsed time inside main gives 2414 clock cycles.

I did not see any extra instructions in disassembly nor a jump to the interrupt handler or anything when going step by step with the debugger. I have no idea what happens during these thousands of clock cycle

If anyone has already experimented this issue or has any ideas, I would be grateful.

It may be caused by the fact the chip reads code from SPI flash which is extremely slow. Consequent reads are a faster since they don’t interrupt a SPI read transaction. Maybe when everything is inside main some sort of a prefetch mechanism works which puts everything to a cache (and you pay this huge read cost earlier), but when the jump occurs the jump destination may be too far from the current read offset and therefore a new read transaction is started.