No the other was computation only. I rewrote the program to check my syscall overhead. Next, Im going to time a context switch assuming that pmp would need to be reconfigured.
My original test was using the debug version, I ran the sieve program again as a release version with optimizations.
cycles: 186673 instructions: 130662 CPI: 1.429
This is like 30 times as fast as before! I had no idea that using a release version (takes longer to compile) would affect it that much. I need to test my system call again with the a release version.
I haven’t been keeping up with this thread… are you talking about Freedom-E-SDK’s
CONFIGURATIONs? (If not, please disregard the following.)
If so, IIRC,
-O2. Consider trying
-O3, or even
-Ofast. Take a look at the GCC docs to get an idea what these mean.
omg. There’s never any good reason to use -O0. Gcc makes just awful code with that – and it’s not even easier to debug. Always at least -O1 !
I’m using the Rust compiler
cargo which is built on LLVM
I got my system calls down to 314 instructions at 2290 cycles (~7.16us @ 320MHz). The context switch as far as re-configuring the PMP registers came in at about 5k cycles which is ~16us @ 320 MHz. I don’t really need a context switch in my programming, and a real switch would be more involved then just changing the PMP configs and addresses. I was just curious.
That seems like a surprising mismatch between instructions and cycles.
I wonder if the code is reading constants out of the SPI flash – including such things as virtual function dispatch tables, if Rust has those. It’s really really slow to read data from the flash as there is no dcache.
I don’t know how Rust sets things up … or Metal for that matter … but in the old sdk there is a setting at line 84 in https://github.com/sifive/freedom-e-sdk/blob/v1_0/bsp/env/freedom-e300-hifive1/init.c:
// Div = f_sck/2 SPI0_REG(SPI_REG_SCKDIV) = 8;
With the 8 setting and a 256 MHz main clock it runs the SPI at 32 MHz which is excessively slow for loading code from flash to icache, but more importantly for loading constant data. The flash spec says it can run at 133 MHz. For my own use at 256 MHz I change the setting to 2 (as in the comment) to run the flash at 128 MHz. Maybe for 320 MHz you’d need to use 3, giving 107 MHz. Also, I think the flash is quad SPI but is only being run at single.
If you can find where this is set up in Rust then maybe you can experiment with this and it might make your code much faster.
Or else make sure you’re not doing any data loads from the flash memory range by moving any frequently accessed constant data to RAM – you can get the linker to do this, or else simply declaring it non-constant will do it too.
I believe the cycle disparity is definitely flash/cache related, but really it is above my pay-grade at this point. There is a rust e310x and e310x-hal crate, but most of it looks like something that came from the Roswell crash site. https://github.com/riscv-rust/e310x/blob/master/src/common/qspi0/sckdiv.rs
I might get to that level one day For now, it works the way I need it to.
I may have not run enough iterations when timing my system calls. I rewrote the program slightly, and made sure I used the release profile. 1000 iterations came back as 134 instructions and 194 cycles. That makes much more sense
I also added a i2c temp sensor, PCT2075, and read the temperature several times in machine mode then from u mode with system calls. With optimizations, u-mode came back as only 100 cycles more per read
That is much more in line with what I’d expect.
Measuring cycles is tricky. You have to make sure to pre-run all the code you’re measuring – including the measuring code itself.