Benchmarking Security in the HiFive1

@bruce For part of my project, I need to benchmark the difference between running in machine mode versus user mode. It seems that there should be a performance hit from PMP checking while in u mode. My first thought was to run a sorting algorithm once in m mode and once in u mode and compare the instruction counts using the mhpmcounter and hpmcounter registers. Does this seem like a correct and simple way to compare the two?

Sorting could work, but it might be hard to find a reasonable algorithm that takes long enough to run with maximum 16 KB of data.

My own counting primes benchmark takes quite a but of time without using a huge amount of RAM: http://hoult.org/primes.txt

I’m pretty sure I know what your results will be, but I won’t spoil your fun :slight_smile: It would certainly be interesting if I’m wrong.

1 Like

Thanks for link, I will port this concept into rust and give it a shot.

I haven’t tested this on the HiFive1B yet, but it should do what I need. (I won’t be printing the values out)

fn main() {

    let mut primes: [usize;1000] = [0;1000];

    for i in 2..primes.len()-1 {
        primes[i] = i;
    }

    for i in 0..primes.len() {
        let factor = primes[i];
        if factor != 0 {
            sieve(&mut primes, factor);
        }
    }

    for i in 0..primes.len() {
        if primes[i] != 0 {
            println!("{}", primes[i])
        }
    }
}

fn sieve(primes: &mut [usize], factor: usize) {
    for i in 0..primes.len() {
        let value = primes[i];
        if value != 0 && value != factor {
            if value % factor == 0 {
                primes[i] = 0;
            }
        }
    }
}```

@bruce I ran a sieve on a 1000 element array with 100 iterations in user mode and in machine mode. Here are my results

    Total Instructions: 49176310
Avg Cycle Count M-Mode: 75530226
avg Cycle Count U-Mode: 76465046

1.238% performance loss

:slightly_smiling_face:

What are you actually timing? Does it include the I/O? I don’t see any timing code in what you posted.

Also, the whole point of a “sieve” algorithm is that you don’t need to do any division operations.

I have never done this, so I wouldn’t be surprised if I did something wrong. I didn’t use mtime or time registers, just the mcycle/cycle and minstret/instret. The counts are to complete the prime sieve only. No division operations? Do you mean dividing the cycle count by clock frequency? Also, I planned on timing the UART console printing as a separate test. The only difference being the system call process.

Division.

Haha, oh the modulus. Maybe not a true sieve, but it did the job. Are the results surprising to you, or is it what you expected? Again, this timing was around the memory accesses of the sieve only.

I don’t know any reason U mode would run slower than M mode for pure computation.

Would the pmp checks not affect it?

I would be shocked if PMP caused memory accesses to take extra clock cycles. Certainly a CPU designer could do that, but I’d expect they are striving not to.

I think there is a flaw in my methodology. I ran each test as seperate programs. I’ve been told the timing of cache accesses could vary due to this. I got advice to run both umode and mmode tests in one binary and throw away the first iteration. I will post the updated results.

Results are in. Below is the average of 100 iterations with the program running through the sieve once in each mode before cycles are started to be recorded.

M-Mode Cycles: 6175442 
U-Mode Cycles: 6172695
0.044% difference = negligable

Yes, you absolutely don’t want to count cycles waiting for code to be loaded from SPI flash the first time it is used.

So that explains why the cycles were close to twice the number of instructions executed before, which is pretty unusual.

But now I’m even more confused how 49176310 instructions can run in 6175442 cycles.

I made the array smaller in the compiled program is why

So how many instructions now?

Total instructions: 3997027
CPI ~1.545

EDIT:
I also just benchmarked the stock bootloader. I know that it sends out some commands to the esp32 chip, but not sure what else it may be doing. The average boot time was 3.157 seconds, I set a gpio pin high then reset the board and measure how long before the pin goes high again.

My u-mode system call setup takes 339 instructions more than m-mode calling functions directly (where m-level privilege is required).

So you weren’t timing just the computation, but also some “system calls”?