@bruce For part of my project, I need to benchmark the difference between running in machine mode versus user mode. It seems that there should be a performance hit from PMP checking while in u mode. My first thought was to run a sorting algorithm once in m mode and once in u mode and compare the instruction counts using the mhpmcounter and hpmcounter registers. Does this seem like a correct and simple way to compare the two?

Sorting could work, but it might be hard to find a reasonable algorithm that takes long enough to run with maximum 16 KB of data.

My own counting primes benchmark takes quite a but of time without using a huge amount of RAM: http://hoult.org/primes.txt

Iâ€™m pretty sure I know what your results will be, but I wonâ€™t spoil your fun It would certainly be interesting if Iâ€™m wrong.

Thanks for link, I will port this concept into rust and give it a shot.

I havenâ€™t tested this on the HiFive1B yet, but it should do what I need. (I wonâ€™t be printing the values out)

```
fn main() {
let mut primes: [usize;1000] = [0;1000];
for i in 2..primes.len()-1 {
primes[i] = i;
}
for i in 0..primes.len() {
let factor = primes[i];
if factor != 0 {
sieve(&mut primes, factor);
}
}
for i in 0..primes.len() {
if primes[i] != 0 {
println!("{}", primes[i])
}
}
}
fn sieve(primes: &mut [usize], factor: usize) {
for i in 0..primes.len() {
let value = primes[i];
if value != 0 && value != factor {
if value % factor == 0 {
primes[i] = 0;
}
}
}
}```
```

@bruce I ran a sieve on a 1000 element array with 100 iterations in user mode and in machine mode. Here are my results

```
Total Instructions: 49176310
Avg Cycle Count M-Mode: 75530226
avg Cycle Count U-Mode: 76465046
1.238% performance loss
```

What are you actually timing? Does it include the I/O? I donâ€™t see any timing code in what you posted.

Also, the whole point of a â€śsieveâ€ť algorithm is that you donâ€™t need to do any division operations.

I have never done this, so I wouldnâ€™t be surprised if I did something wrong. I didnâ€™t use `mtime`

or `time`

registers, just the `mcycle/cycle`

and `minstret/instret`

. The counts are to complete the prime sieve only. No division operations? Do you mean dividing the cycle count by clock frequency? Also, I planned on timing the UART console printing as a separate test. The only difference being the system call process.

Division.

Haha, oh the modulus. Maybe not a true sieve, but it did the job. Are the results surprising to you, or is it what you expected? Again, this timing was around the memory accesses of the *sieve* only.

I donâ€™t know any reason U mode would run slower than M mode for pure computation.

Would the pmp checks not affect it?

I would be shocked if PMP caused memory accesses to take extra clock cycles. Certainly a CPU designer could do that, but Iâ€™d expect they are striving not to.

I think there is a flaw in my methodology. I ran each test as seperate programs. Iâ€™ve been told the timing of cache accesses could vary due to this. I got advice to run both umode and mmode tests in one binary and throw away the first iteration. I will post the updated results.

Results are in. Below is the average of 100 iterations with the program running through the sieve once in each mode before cycles are started to be recorded.

```
M-Mode Cycles: 6175442
U-Mode Cycles: 6172695
0.044% difference = negligable
```

Yes, you absolutely donâ€™t want to count cycles waiting for code to be loaded from SPI flash the first time it is used.

So that explains why the cycles were close to twice the number of instructions executed before, which is pretty unusual.

But now Iâ€™m even more confused how 49176310 instructions can run in 6175442 cycles.

I made the array smaller in the compiled program is why

So how many instructions now?

Total instructions: 3997027

CPI ~1.545

EDIT:

I also just benchmarked the stock bootloader. I know that it sends out some commands to the esp32 chip, but not sure what else it may be doing. The average boot time was 3.157 seconds, I set a gpio pin high then reset the board and measure how long before the pin goes high again.

My u-mode system call setup takes 339 instructions more than m-mode calling functions directly (where m-level privilege is required).

So you werenâ€™t timing just the computation, but also some â€śsystem callsâ€ť?