Timing issues in the Arduino IDE

I’m currently testing out a HiFive1, and I’ve run into an interesting problem: the timing is off.

I began by investigating low scores in Whetstone (link removed as new users are limited to two links, apparently, but if you search for “Arduino Whetstone” on Google I’m the first hit) and Dhrystone benchmarks I’ve previously run on other microcontrollers via the Arduino IDE. They run, but on the HiFive1 I’m getting results way, way lower than I’d expect: 21.85 DMIPS and just 1.42 MWIPS.

So, I started poking at the timers - and here’s where things got interesting. I wrote a simple timing test which does nothing but record the current micros(); in an unsigned long variable, print that variable to the serial port, then wait for what should be one second.

The first thing I did was run it on an Arduino MKR1000 as a baseline. The result: every output was between 999999 microseconds and 1000001 microseconds, averaging out at 1000000 microseconds. In other words, a perfectly working timer.

Then I came to run it on the HiFive1, and the results were very, very different. The stable results ranged from 1000054 microseconds to 1002099 microseconds, with four major blips over the test run: 12529, 261104, and two instances of 743641 microseconds.

Now, I’m not sure if this is enough for my benchmarks to be thrown quite as far out of whack as they are - and I’m open to suggestions on that front, as while I know I can run the benchmarks included in the SDK I really wanted to run these two so I have a direct comparison to other microcontrollers I’ve run them on - but it certainly can’t be helping.

Is there an issue with the HiFive1’s timers? Is there a way to make a second be a second?

Gareth,

What version of the Arduino IDE are you using? 1.0.1 fixed some issues in this area, but perhaps it’s still wrong.

Oops, I mean the SiFive Board Support package, not the Arduino IDE.

I’m using Arduino IDE 1.8.1 with version 1.0.1 of the SiFive BSP.

Just to add a few more data points: I’ve run the test with the board set to 16MHz crystal and 256MHz PLL, and neither hits a second dead-on. The 16MHz mode, though, is more stable: it gives me a low of 1000426 microseconds and a high of 1003805 microseconds, with none of the massively-wrong outliers of the 256MHz mode.

Plus, just because I had the spreadsheet open anyway, here’s a couple of graphs. This first one shows all the results I collected, including the 256MHz’s outliers:

The second I’ll have to put in a separate post, as I’m limited to one image per post…

This second shows the results up to the first outlier, demonstrating how far both HiFive1 modes are away from the ‘perfect’ second of the MKR1000:

Let me know if there’s anything else you want me to try out.

I tried out your test code, and got the same behavior as you noted.

I noticed that your test code is including the time that it takes to do the Serial.println, so it doesn’t make sense to expect the delay between the two iterations to be 1 second.
I changed your code to just measure micros before and after the delay and subtract, then got mostly consistent result, but you’re right, there are still some way-too-short blips.

Even if the delay between the two iterations isn’t a second - and bear in mind that it is a second ±1 microsecond when that same test is running on the MKR1000 - it should be stable, and it isn’t.

What’s interesting is, the major blips aside, it’s roughly unstable stably - if that makes sense. By that I mean what you see on that second graph: the ‘second’ starts off way too high, then gets close to accurate for a bit, then goes back up again. Notice now the HiFive1 results drop at the same time and rise at the same time in the second graph, though by different amounts, even across two separate test runs and running at 16MHz and 256MHz.

Thanks for running your experiments! I’ve been playing around with your example code. Couple things going on here…

First, to explain why your simple test doesn’t quite make sense for the FE310:
On the FE310, the UART only has an 8-byte TX buffer. Your code prints the output of micros(). If you want to Serial.println(“1000000”), you’re really printing 9 bytes (because of ‘\n’ and ‘\r’ at the end of the string). So you need to wait for 1 byte to actually transmit over the UART so you can send the last one. At 9600 baud, the time it takes to TX a byte is ~(1/9600)*8, or 833 microseconds. So, for the FE310, it really doesn’t make sense to expect your loop to take exactly 1 second, when you are doing both a Serial.println() and delay(1000). I can’t speak for other microcontrollers as I haven’t dug into their HW or SW, perhaps they have deeper TX buffers or other ways they handle the UART transmission. Note that after 10 seconds or so, you’ll want to print another digit, so you have to wait for 2 bytes, which is why it gets even slower.

The reason the first iteration takes so long is caches. The first iteration of the loop has to read all the code from SPI Flash, so it takes longer. After the first iteration, it’s all in cache so executes fairly stably.

Next, I admit there are two issues with our timing code.

First, I do see the ocassional “too short” delays you see. I have a hunch what they are due to (rollover in the MCYCLE counter while we’re reading it), but haven’t convinced myself of that yet.

Second, our micros() function is accurate, but slow, since as currently written it’s calling a SW divide. If you just call ‘micros()’ twice in a row, it’s consistently 4 uS apart @256MHz, which is super inefficient. We can work on tightening it up.

1 Like

Thanks for that explanation - it makes perfect sense! I’m also glad to hear that I’m not doing something silly to make those glitches happen!

Do you think the issue with micros() could be what is throwing the benchmarks off? I’ve had a look at the Dhrystone code included with the SDK, and I’ve noticed it’s using something else for timing - but I’m not sure how to modify my benchmarks in such a way that they’ll run accurately on the HiFive and retain compatibility with other microcontrollers for fair like-with-like comparisons.

I don’t know if maybe this is useful. I just knocked up 64 bit and 32 bit unsigned integer divide by a 32 bit integer using reciprocal. If you’re dividing by the same thing a lot of times (e.g. by the board clock frequency) then it should be quite a bit faster than a normal divide.

It’s using two multiply instructions for the 32 bit case and 5 for the 64 bit.

#include <inttypes.h>

inline uint32_t mulhu(uint32_t a, uint32_t b){
  return (uint32_t)(((uint64_t)a * b) >> 32);
}

typedef struct {
  uint32_t n;
  uint32_t mult;
  uint32_t shift;
} int_inverse ;

int_inverse calc_inv(uint32_t n){
  uint32_t one = ~0;
  uint32_t d = one/n;
  uint32_t r = one%n + 1;
  if (r >= n) ++d;
  if (d == 0) --d;
  uint32_t shift = 0;
  while ((d & 0x80000000) == 0){
    d <<= 1;
    ++shift;
  }
  int_inverse res;
  res.n = n;
  res.mult = d;
  res.shift = shift;
  return res;
}

inline uint32_t divide32_using_inverse(uint32_t n, int_inverse *inv){
  uint32_t d = mulhu(n, inv->mult);
  d >>= inv->shift;
  if (n - d*inv->n >= inv->n) ++d;
  return d;
}

uint32_t divide64_using_inverse(uint64_t n, int_inverse *inv){
  uint32_t d = divide32_using_inverse(n >> (inv->shift+1), inv) << (inv->shift+1);
  uint32_t r = n - (uint64_t)d * inv->n;
  d += divide32_using_inverse(r, inv);
  return d;
}

The divide64_using_inverse function compiles for RV32IMC to:

0001013a <divide64_using_inverse>:
   1013a:       00862803                lw      a6,8(a2)
   1013e:       00180793                addi    a5,a6,1
   10142:       0207f713                andi    a4,a5,32
   10146:       c739                    beqz    a4,10194 <divide64_using_inverse+0x5a>
   10148:       00f5d5b3                srl     a1,a1,a5
   1014c:       4258                    lw      a4,4(a2)
   1014e:       4210                    lw      a2,0(a2)
   10150:       02e5b6b3                mulhu   a3,a1,a4
   10154:       0106d6b3                srl     a3,a3,a6
   10158:       02c688b3                mul     a7,a3,a2
   1015c:       411585b3                sub     a1,a1,a7
   10160:       00c5b5b3                sltu    a1,a1,a2
   10164:       0015c593                xori    a1,a1,1
   10168:       96ae                    add     a3,a3,a1
   1016a:       00f696b3                sll     a3,a3,a5
   1016e:       02c687b3                mul     a5,a3,a2
   10172:       40f507b3                sub     a5,a0,a5
   10176:       02e7b733                mulhu   a4,a5,a4
   1017a:       01075733                srl     a4,a4,a6
   1017e:       02c70533                mul     a0,a4,a2
   10182:       8f89                    sub     a5,a5,a0
   10184:       00c7b7b3                sltu    a5,a5,a2
   10188:       0017c793                xori    a5,a5,1
   1018c:       973e                    add     a4,a4,a5
   1018e:       00e68533                add     a0,a3,a4
   10192:       8082                    ret
   10194:       0586                    slli    a1,a1,0x1
   10196:       fff7c713                not     a4,a5
   1019a:       00e59733                sll     a4,a1,a4
   1019e:       00f555b3                srl     a1,a0,a5
   101a2:       8dd9                    or      a1,a1,a4
   101a4:       b765                    j       1014c <divide64_using_inverse+0x12>

Improved version of divide64_using_inverse(). Now works for much larger input values (nearly full 64 bit range for small divisors), and returns full 64-bit result. Eliminated conditional code/branches in the generated code. Tested for billyuns and bullyuns of random inputs on both x86 native and spike.

// Almost full-range 64/32 divide.
// If divisor-1 has i bits, then the answer is exact for n of up to 64-i bits
// e.g. for divisors up to a million, n can have up to 45 bits
// On RV32IM with divide32_using_inverse inlines this uses 5 multiplies,
// 33 instructions, zero branches, 3 loads, 0 stores.
uint64_t divide64_using_inverse(uint64_t n, int_inverse *inv){
  uint32_t preshift = (31 - inv->shift) & 31;
  uint64_t d = (uint64_t)divide32_using_inverse(n >> preshift, inv) << preshift;
  uint32_t r = n - d * inv->n;
  d += divide32_using_inverse(r, inv);
  return d;
}

Generated code:

00010180 <divide64_using_inverse>:
   10180:       00862883                lw      a7,8(a2)
   10184:       425c                    lw      a5,4(a2)
   10186:       00159313                slli    t1,a1,0x1
   1018a:       fff8c693                not     a3,a7
   1018e:       8afd                    andi    a3,a3,31
   10190:       fff6c593                not     a1,a3
   10194:       00b31333                sll     t1,t1,a1
   10198:       00d55833                srl     a6,a0,a3
   1019c:       01036833                or      a6,t1,a6
   101a0:       02f83733                mulhu   a4,a6,a5
   101a4:       4210                    lw      a2,0(a2)
   101a6:       01175733                srl     a4,a4,a7
   101aa:       02c70333                mul     t1,a4,a2
   101ae:       40680833                sub     a6,a6,t1
   101b2:       00c83833                sltu    a6,a6,a2
   101b6:       00184813                xori    a6,a6,1
   101ba:       9742                    add     a4,a4,a6
   101bc:       00d716b3                sll     a3,a4,a3
   101c0:       02d60833                mul     a6,a2,a3
   101c4:       8305                    srli    a4,a4,0x1
   101c6:       00b755b3                srl     a1,a4,a1
   101ca:       41050533                sub     a0,a0,a6
   101ce:       02f537b3                mulhu   a5,a0,a5
   101d2:       0117d7b3                srl     a5,a5,a7
   101d6:       02c78733                mul     a4,a5,a2
   101da:       8d19                    sub     a0,a0,a4
   101dc:       00c53533                sltu    a0,a0,a2
   101e0:       00154513                xori    a0,a0,1
   101e4:       97aa                    add     a5,a5,a0
   101e6:       00d78533                add     a0,a5,a3
   101ea:       00f537b3                sltu    a5,a0,a5
   101ee:       95be                    add     a1,a1,a5
   101f0:       8082                    ret

Thanks Bruce! Will try it out (if the community doesn’t beat me to it!)

By the way, how do the mcycle/mcycleh CSRs behave in HiFive1, or RV32 in general?

For example, the 8-bit PIC microcontroller can read the program counter atomically by copying upper bits of the program counter to PCLATH and PCLATU registers when PCL is read from.

Does RISC-V also utilize some similar kind of scheme with the cycle counter, or do all reads from mcycle/mcycleh just simply give out the lower/upper bits of the cycle counter value at the time the CSR is read from? The current RISC-V privileged ISA spec doesn’t say much about that.

I noticed the rdmcycle function in GitHub. It just reads from mcycle and then from mcycleh. So if the lower 32 bits of the cycle counter wrap around between the two reads, would it give an incorrect result? In that case, you could maybe read mcycleh first, then mcycle, then mcycleh again, check whether an overflow happened, and calculate a valid result. Maybe disabling interrupts temporarily could be useful, too.

See “User-Level ISA Specification v2.1” https://riscv.org/specifications/ P.23
Figure 2.5: Sample code for reading the 64-bit cycle counter in RV32.

Yep, this is exactly what we should use for that function. Thanks!

I’m still not convinced that the mcycle rollover explains the too-short spikes, but we’re working on cleaning up this whole set of functions.

Confirmed the too-short spikes were from the mcycle rollover. In general hitting this case would be “rare”, but since the code is periodic and predictable we were hitting the “rare” event regularly. The fix from @sflin is the right one for this problem.

Hey @brucehoult , tried out your divide64_using_inverse code but it is giving the the wrong answers. What kinds of test vectors did you try? I think it’s still hitting some overflow cases.

Hi Megan, I tried it exhaustively with up to 20 billion divided by everything from 1 to 256, and with a million random 64 bit numbers with divisors up to 100000 or so.

I saw no failures providing the argument to divide64_using_inverse() is as mentioned in the comment.

Can you give me a couple of examples of failures you’re seeing? What re you trying to divide by what?

Ah, I was trying it out directly with divisors of 256000000 and 16000000 (the F_CPU values in the millis() and micros() functions). Seems that those divisors are too large, which makes sense based on the comments you added.

I moved the division by 1k and 1M into the inverse calculation and now get the right answer, in much shorter time for general F_CPU!