Arduino performance (again ;o)


(Richard) #1

Hello

My company was nice enough ti buy me some HiFive-1 boards for testing…and as it is supported under Arduino IDE I thought just to do a quick test and compare to an original Arduino Uno.

A simple:

void loop() {
digitalWrite(LED_BUILTIN, HIGH); // turn the LED on (HIGH is the voltage level)
digitalWrite(LED_BUILTIN, LOW); // turn the LED off by making the voltage LOW
}

results in following DSO capture (HiFive-1 also running at 16MHz):

Even running it at 320MHz the GPIO switching is around 2.5 times faster that the original Arduino Uno at 16MHz.

Is just the HiFive Arduino implementation so slow?

In a similar and one year old thread only math libs were compared…but this is just plain GPIO toggling. I would have expected a much faster switching time.

thanks in advance
richard


(Dave) #2

I’m afraid I can’t comment on the Arduino stuff, but I’ve run similar experiments in the past using soft-loops and PWM and the GPIO does indeed run far faster. N.b: the GPIO pins have a maximum speed of 100MHz, documented in one of the manuals, which I’ve observed using PWM as the source.


(Liviu Ionescu) #3

Hi Richard,

I did a similar test, but with the code generated by the Eclipse template, adjusted to run in a continuous loop:

  led red_led { BLINK_PORT_NUMBER, RED_LED_OFFSET, BLINK_ACTIVE_LOW };

  while (true) {
      red_led.turn_on ();
      red_led.turn_off ();
  }

With the board running at 16 MHz, I measured slightly less than 300 kHz on the red led. (actually no higher than 297, but sometimes as low as 249)

How does this compare to your measurements?


#4

I never saw a resolution to this situation, so let me dive in: I am seeing the same slow execution speeds in the Arduino environment.

I used objdump to look at the code generated by the sample code from the opening post to this thread. This is my first attempt at calculating execution times for RISC-V, but by my reckoning, the entire process through the Arduino loop (including invoking loop() from the Arduino main() program) should take 59 cycles. That number assumes that SW (store word) instructions take 1 cycle. If they take 2, then it would be 61 cycles per loop. At 59 cycles per loop and a 16 MHz clock, it means that the entire arduino loop should take about 3.687 uSec. From the scope trace above (and my own measurements), the arduino loop is taking roughly 50 uSec, or about 13.5 times slower than expected.

I then instrumented the toggle code to calculate the cycle count over the two digitalWrite() calls. As expected, the first time through, the cycle count was huge at 12006 cycles no doubt due to the code getting loaded from the SPI flash into the icache. Thereafter, the execution times were much smaller. However, those execution times were not consistent. Over 100000 iterations, the min execution time was 779 cycles and the max execution time was 816 cycles. Disregarding the unexpected inconsistency, it is clear that 800 cycles is a lot larger than 60 cycles. In fact, it is about 13.3 times larger, which corresponds basically exactly with the difference between the cycle count from the assembly code as compared to the time taken in the scope trace.

Also note that if the 59 cycles count is accurate, it means that the arduino loop should be toggling at 271 KHz. This result is squarely within the range of results generated by the Eclipse test in the post directly above.

So: there is something weird with the arduino world, but not the Eclipse world. Also note, I ran my tests with interrupts disabled, so it can’t be interrupts using up cycles.

Last question: I can’t figure out how to add text formatted as code to a post. Can someone point me how at how to do that?


#5

Update:

I instrumented the code to count the number of instructions retired during the two calls to digitalWrite(). It came out to 42 instructions, which is exactly what I expected from reading the assembly code. In conjunction with the timing info from above, those 42 instructions (plus a few more to perform the arduino loop() overhead) are taking about 800 cycles to execute, which is way out of line with expectations.

I am trying to use the event counting mechanism to see what might be causing delays, but so far, I have not been able to get anything out of the event counter except zeroes.


(Liviu Ionescu) #6

Did you check the actual implementation of digitalWrite()? I bet it is more complicated than the implementation of the led class I used in the Eclipse test.


#7

Yes I did. As I mentioned above, my starting point was to disassemble the Arduino output and look at the result [If someone could tell me how to post a formatted code snippet here, I would post my annotated version of the Arduino code].

I used the hardware instruction counter mechanism to verify the instruction count involved in performing both digitalWrites(). The cycle counter hardware counts 42 instructions to execute both calls to digitalWrite(), which includes getting the params in place and making the calls. That agrees with counting the instructions in the disassembled code.

So why it is taking about approximately 800 cycles to execute those 42 instructions? Something is fishy…

Something seems amiss…


(Bruce Hoult) #8

You can post code/preformatted monospaced text by hitting the “double quote” icon above the message box, or alternatively by entering 4 spaces before each line of code.

this is code

(Bruce Hoult) #9

I doubt anyone has put work into making the Arduino code fast. If it does GPIO on the ms level that’s probably good enough for most people using it, let alone us. People who need faster speeds will be hitting the GPIO registers directly.

I haven’t looked at the Arduino code, but one thing that makes code go very slowly on the HiFive1 is loads of constant data from program space. That takes on the order of 1 us for a load instruction, pretty much regardless of whether it’s byte, short, or int if the CPU is running at 256 MHz.

As the Arduino digitalWrite() code has to map Arduino logical pin numbers to physical pin numbers it may well be using a constant data table to do that.

But as I said I haven’t looked at the actual code recently. Look for register loads from 0x2000_0000 to 0x20FF_FFFF.


#10

Thanks for the response.

To be clear, the Arduino-ness of this all is immaterial. I am just using the Arduino IDE as a testbed to compile code. The real goal is to understand the performance limits of the CPU. The strange part (to me) is that the Arduino code should be fast as currently implemented, at least if you go by instruction count: it only takes about 21 instructions to perform a digitalWrite(), including setting up the parameters and calling the routine. What is not clear is why those 21 instructions take about 400 cycles to execute on a processor that is supposed to retire about 1 instr per clock under typical conditions.

I would be bummed if the IO mapping lookup operation is responsible for all of the excess cycles, but knowing is better than not knowing :slight_smile:. If the result of this all is knowing that I should never, ever use ROM data tables for situations that require performance, but rather have ROM-initialized RAM-resident tables, then that is a perfect example of things that are worth knowing when working with this processor.

Does the source code exist somewhere for digitalWrite()?


(Bruce Hoult) #11

Precisely correct on this particular board. That’s why we have a special linker script for Dhrystone (which I think should be the default) to put constant data into the RAM scratchpad instead of ROM.

If that’s too big a hammer, then simply marking that particular global table as not const will also work. Assuming there is one – as I said I haven’t looked at the code.


#12

Ah. Then I would also have to believe that interrupt latencies on the E31 processor could potentially be atrocious if an interrupt arrives during a ROM data read.

I would vote for storing all constant data in RAM as a default, at least for this processor. Putting the constant data in ROM should require the user to make a conscious decision knowing that access to it will be very slow.


(Bruce Hoult) #13

Yes I agree, and we discussed it internally some time ago and I expect the next software release will do that, at least for freedom-e-sdk. I guess the pre-built toolchains for Arduino would get updated at the same time.


(Liviu Ionescu) #14

For a simple blinky application this might be enough, but for a more complicated application, with a separate Debug build, which enables assert() and has a lot of trace::printf() messages, storing the constant strings in RAM might very soon fill the 16 KB of internal RAM.

There are also the C++ virtual tables; if stored in flash, they might make the application slow; if stored in RAM, they might use some space, depending on the application.


(Bruce Hoult) #15

In a debug build where you’re poking messages out the UART at 115200 bps (a little over 10 KB/s), being able to read those messages from SPI Flash at “only” 1 MB/s is not a big issue.

People doing big real applications – as opposed to just quickly running Dhrystone or CoreMark or HelloWorld – should be writing their own linker scripts anyway on any small microcontroller.


#16

This is not about being an Arduino app or a benchmark versus being a big, real app, or about writing linker scripts or not. I am evaluating the E31 processor; learning about its strengths, weaknesses, and its various constraints that would affect my design choices. Some of those strengths and weaknesses will be due to RISC-V in general, while others will be specific to the E31. The point is that I can’t make effective system design choices (hardware or software) until I know the limits.

For example, maybe I have a big, real application that requires the MIPS of an E31 processor. Let’s also assume that my big, real application needs to guarantee a certain interrupt latency. It would appear that the simple Arduino app that kicked off this thread is telling me that if the big, real application has even a single instance in its codebase where it reads data from the ROM, the worst-case interrupt latency changes from 40-ish cycles (if the interrupt occurs during a divide instr that causes the pipeline to stall) to being a few hundred cycles (if the interrupt occurs after a ROM data fetch kicks off). That’s a constraint worth knowing about: if a guarantee of short interrupt latency is a design requirement for my big, real E31-based system, then all constant data has to go in RAM. As noted, that immediately puts pressure on the E31’s limited RAM resource, which is another constraint worth knowing about, although a far more obvious one.

All processors are good at some things, and constrained by other things. I’m just trying to figure out what the limits are for this processor so I can make informed choices and effective designs.


(Krste Asanovic) #17

If you are specifically interested in the FE310 chip, then the HiFive1 is a good demonstrator of that chip’s capabilities. If you are interested in the E31 as a soft core, then please be aware that our E3 series cores can be flexibly configured in various different ways, including for example having a data cache or adding floating-point units, and the E31 on the FE310 is only one possible design point.


#18

Oops, my brush was a too broad! My apologies. You are entirely correct: this thread only applies to the FE310 instantiation of an E31 core.

I will still argue that the default linker script for FE310 builds should put ROM constant data in RAM. Otherwise, I suspect that the system-related effects of slow constant data loads will just get rediscovered by FE310 users in a variety of forms.


(Bruce Hoult) #19

If you’re looking at volume production using a customised E31 rather than the actual SoC in the HiFive1, then in addition to what Krste said:

  • standard options for the E31 include up to 64 KB of RAM scratchpad, or 64 KB of data cache that can be flexibly on-the-fly reconfigured to (almost) any proportion of cache and scratchpad. The current Arty FPGA E31 demo image comes with 64 KB of scratchpad.

  • the icache size is similarly adjustable, and on-the-fly configurable to a mix of cache and non-evictable instruction scratchpad.

  • you can specify a faster (and bigger) divider. Or, leave it off and do divides in software (or reciprocal multiplication for constant or semi-constant divisors).

Also, with respect to the HiFive1 (and to your own product), the SPI Flash is being clocked very conservatively. With the default 256 (ish) MHz CPU clock, the SPI is being clocked at 32 MHz. The specs for the Flash chip say up to 133 MHz and on my own board I happily run with the SPI divider reduced from 8 to 2 thus clocking it at 128 MHz (line 84 in https://github.com/sifive/freedom-e-sdk/blob/master/bsp/env/freedom-e300-hifive1/init.c)

That reduces the latency by 4x.

I’m also not sure whether we’re enabling the Quad in QSPI by default on the HiFive1.

But in any case, of course any design frequently using off-chip storage accessed by a serial protocol is going to be slower than configuring your SoC with on-chip storage adequate for your application.

The key thing to remember about SiFive processors is that you’re not limited to half a dozen or a dozen configurations that you can find on Element14. You can customise an E31 core to literally thousands of different variations to precisely fit your needs.


(Krste Asanovic) #20

Actually, while the E31 I-cache can be dynamically configured on the fly to have a variable portion of ITIM scratchpad, the data memory has to be selected at design time to be either all D-cache or all DTIM.