Load/Store pair

Has there been any consideration given to adding load/store (or push/pop) pair instructions?

No point of course if it needs a 32 bit instruction (because there’s already 16 bit single register load/store), but a 16 bit instruction would be great, even if restricted to an adjacent even/odd pair of registers, stored to an aligned address (so it can’t cross page boundaries).

The ARM/Thumb push/pop multiple instructions are wonderfully handy, but I understand they cause a lot of microarchitectural issues, with need for a state machine, saving state on an exception etc. Which is why ARM are now deprecating them and replacing with load/store pair across all the instruction sets (T32, A32, A64).

We discussed such instructions at length when designing the ISA. The load/store pair instructions are better than full on load-multiple/store-multiple, but still cause a bunch of micro architectural issues, so we did not include them. The save-restore optimization is actually invoked by -msave-restore, rather than -Os, as this is still an experimental ABI.

Ah, cool, thanks.

Now to find where the compiler flags are for newlib (etc) and rebuild…

I got spike going, though the instructions produce a 64 bit build without RVC support, so had to change that. I was afraid it would be really slow, but it seems to be about the same as qemu emulating ARM. On the other hand, qemu riscv is nearly as fast as a real Raspberry Pi 2! Probably because of (no need for) condition code emulation, I suspect.

Testing on:

int fib(int n){
  return n<2 ? n : fib(n-2) + fib(n-1);

Default (with -O) is 72 bytes. Adding “-march=RV32IMC -msave-restore” it’s 30 bytes, and 45% slower on spike. Will be interesting to see the effect on real hardware when the HiFive1 arrives.

Just RVC without -msave-restore is 40 bytes. no speed difference from RV. A bit of a speed trade-off but size is king in embedded. Thumb2 is 28 bytes using push/pop, so RVC with millicode is vary close!

This is a very extreme case, of course. Real code will see both less size savings and less slowdown.