HiFive1 Rev.B Benchmark


#1

In the past I published an article about Arduino benchmarks in a German electronic magazine. I used the Sieve of Eratosthenes to cover small 8-bit to 32-bit with more performance.

The links are here:
DESIGN&ELEKTRONIK 5/2016 (Teil 1), 6/2016 (Teil 2)



BM

For HiFive1 Rev. B I build a similar benchmark and got unexpected results.
Sieve
At this moment I have no idea why the resulting runtime is so big. Is there anything wrong?

The source is quite similar to the Arduino source:

#include <stdio.h>

#define RTC_FREQ    32768
#define CLINT_MTIME 0x200bff8

#define TRUE 1
#define FALSE 0

int i,k, prime,count;
const int SIZE = 1000;    
char flags[1001];

void delay(int sec)
{
    uint64_t ticks = sec * RTC_FREQ;
    mtime_wait(ticks);
}

void mtime_wait(uint64_t ticks)
{
    volatile uint64_t * mtime = (uint64_t*) (CLINT_MTIME);    
    uint64_t now = *mtime;
    uint64_t then = now + ticks;
 
    while(*mtime<then) {}
}

uint64_t now(void)
{
    volatile uint64_t * mtime = (uint64_t*) (CLINT_MTIME);
    return *mtime;
}


int main(void)
{
    printf("Sieve of Eratosthenes - CPU Benchmark HiFive1 Rev.B\n");
    printf("5000 iterations\n");
    uint64_t runtime = now();
    //printf("Start: %15lld\n", runtime);
    /***************************************************************************/
    for (unsigned int iter = 1; iter <= 5000; iter++) /* do program 5000 times */
    { 
        count = 0;                      /* initialize prime counter */
        for (i = 0; i <= SIZE; i++)     /* set all flags true */
            flags[i] = TRUE;
        for (i = 0; i <= SIZE; i++)
        {
            if (flags[i])               /* found a prime */
            {
                prime = i + i + 3;      /* twice index + 3 */
                for (k = i + prime; k <= SIZE; k += prime)
                flags[k] = FALSE;       /* kill all multiples */
                count++;                /* primes found */
            }
        }
    }
    //delay(10);
    /***************************************************************************/
    //printf("Stop: %15lld\n", now());
    runtime = now() - runtime; 
    printf("%d primes.\n", count);
	printf("Runtime = %15.2llf s\n", (float) runtime/ (float) RTC_FREQ);

	return 0;
}

(Bruce Hoult) #2

I think that’s probably your biggest problem. Remove the “const”, or change it to “constexpr” or #define. Check the compiled code, but I expect the program will be reading SIZE from SPI flash every time, which takes around 1 us. There is no data cache, so frequently used data should be kept in SRAM.

A smaller problem is that all of i, k, prime, count should be local variables in main() so that they can be kept in registers.

Possibly flags too, or else rename that gFlags and put “char *flags = gFLags;” in the start of main.


#3

I change the source as follwos:

#include <stdio.h>

#define RTC_FREQ    32768
#define CLINT_MTIME 0x200bff8

#define TRUE 1
#define FALSE 0
#define SIZE 1000

char flags[1001];

void delay(int sec)
{
    uint64_t ticks = sec * RTC_FREQ;
    mtime_wait(ticks);
}

void mtime_wait(uint64_t ticks)
{
    volatile uint64_t * mtime = (uint64_t*) (CLINT_MTIME);    
    uint64_t now = *mtime;
    uint64_t then = now + ticks;
 
    while(*mtime<then) {}
}

uint64_t now(void)
{
    volatile uint64_t * mtime = (uint64_t*) (CLINT_MTIME);
    return *mtime;
}


int main(void)
{
    int i,k, prime,count; 
    
    printf("Sieve of Eratosthenes - CPU Benchmark HiFive1 Rev.B\n");
    printf("5000 iterations\n");
    uint64_t runtime = now();
    //printf("Start: %15lld\n", runtime);
    /***************************************************************************/
    for (unsigned int iter = 1; iter <= 5000; iter++) /* do program 5000 times */
    { 
        count = 0;                      /* initialize prime counter */
        for (i = 0; i <= SIZE; i++)     /* set all flags true */
            flags[i] = TRUE;
        for (i = 0; i <= SIZE; i++)
        {
            if (flags[i])               /* found a prime */
            {
                prime = i + i + 3;      /* twice index + 3 */
                for (k = i + prime; k <= SIZE; k += prime)
                flags[k] = FALSE;       /* kill all multiples */
                count++;                /* primes found */
            }
        }
    }
    //delay(10);
    /***************************************************************************/
    //printf("Stop: %15lld\n", now());
    runtime = now() - runtime; 
    printf("%d primes.\n", count);
	printf("Runtime = %15.2llf s\n", (float) runtime/ (float) RTC_FREQ);

	return 0;
}

In a first step of changing const to #define SIZE there was no change in result.
After the second change i,k,… as local the runtime changes from 24.07 to 25.2

What about compiler options?


#4

Thinking about that behavior I came to the influence of the OB debugger.
Compiling with CONFIGURATION=release was the point!

Bildschirmfoto%20vom%202019-05-16%2011-41-46

Now I get 7.96 s. I not happy with this value because it is in the region of Cortex-M0. Is this realistic?

To compare I compiled and uploaded the Dhrystone benchmark but it stops running after first outputs. This time i have no idea why.


(Liviu Ionescu) #5

it is somehow in line with the figures I got with HiFive1 A.

But for comparable results, you should factor somehow the CPU speed.


(Bruce Hoult) #6

The scanf() in dhry_1.c needs to be commented out. There is currently no way to input from the tty in metal and I think at some point there was a stub for scanf() that did nothing, but apparently not now.


#7

That’s right. At this time I do not know how to get/set clock frequency.


#8

I found it. I get 333 Dhrystones/sec. Is this right or what should I expect?