About Hifive1 DMIPS

Hi everyone,

I have tested DMIPS on my Hifive1 board, but seems it’s strange. I tried to get time by below function:

Begin_Time = *(volatile unsigned long *)(CLINT_CTRL_ADDR + CLINT_MTIME);

for(j = 0; j < 1000; j++)
{
	for (i = 0; i < 100; i++)
	{
		results_data[i] = input_data1[i] * input_data2[i];
	}
}

End_Time = *(volatile unsigned long *)(CLINT_CTRL_ADDR + CLINT_MTIME);

Use_Time = End_Time - Begin_Time;

Then I got below results:
Time: Begin_Time= 17109, End_Time= 17272, Use_Time= 163

So the time cost is 163/32768 = 5ms.

The DMIPS is 100k(32bit int x 32bit int operations) / 5ms = 20 DMIPS.

But seems it’s not fast, is it normal? Thanks in advance.

In general, it’s very difficult to say anything without seeing the entire, exact, program you ran.

Assuming for example that input_data1 and input_data2 are initialized global variables it can make a HUGE difference whether or not they have “const” on them.

If we can see your whole program then we don’t have to guess.

We don’t even know the clock speed you’re using.

If it’s 320 MHz then you have almost exactly 16 clock cycles per iteration. If it’s 256 MHz then it’s between 12 and 13.

We don’t know what compiler flags you used, what optimization level.

Assuming your code looks like this:

int input_data1[100];
int input_data2[100];
int results_data[100];

void test(){
  int i, j;
  for(j = 0; j < 1000; j++)
  {
	for (i = 0; i < 100; i++)
	{
      results_data[i] = input_data1[i] * input_data2[i];
	}
  }
}

… and you compiled with -O or -O1 then you’ll have something like this as your inner loop:

00000024 <.L3>:
  24:	4398                	lw	a4,0(a5)
  26:	420c                	lw	a1,0(a2)
  28:	02b70733          	mul	a4,a4,a1
  2c:	c298                	sw	a4,0(a3)
  2e:	0791                	addi	a5,a5,4
  30:	0611                	addi	a2,a2,4
  32:	0691                	addi	a3,a3,4
  34:	fea798e3          	bne	a5,a0,24 <.L3>

I count eight instructions. The loads and stores take two cycles each, the multiply 1 to 4 depending on your data (which we don’t know). The branch will almost always take 1 cycle.

So, 12 - 15 cycles per loop looks bang-on, depending on your data.

-O2 or -O3 will give the same instructions, just rearranged a little.

If you add -funroll-loops then you can get:

00000046 <.L3>:
  46:	0046ab83          	lw	s7,4(a3)
  4a:	0087ae83          	lw	t4,8(a5)
  4e:	0086aa83          	lw	s5,8(a3)
  52:	00c7ae03          	lw	t3,12(a5)
  56:	00c6aa03          	lw	s4,12(a3)
  5a:	0107a303          	lw	t1,16(a5)
  5e:	0106a983          	lw	s3,16(a3)
  62:	0147a883          	lw	a7,20(a5)
  66:	0006af83          	lw	t6,0(a3)
  6a:	0047af03          	lw	t5,4(a5)
  6e:	0146a903          	lw	s2,20(a3)
  72:	0187a803          	lw	a6,24(a5)
  76:	4e84                	lw	s1,24(a3)
  78:	4fc8                	lw	a0,28(a5)
  7a:	4ec0                	lw	s0,28(a3)
  7c:	538c                	lw	a1,32(a5)
  7e:	0206a383          	lw	t2,32(a3)
  82:	53d0                	lw	a2,36(a5)
  84:	0246a283          	lw	t0,36(a3)
  88:	0007ac03          	lw	s8,0(a5)
  8c:	037f0f33          	mul	t5,t5,s7
  90:	02870713          	addi	a4,a4,40
  94:	02878793          	addi	a5,a5,40
  98:	02868693          	addi	a3,a3,40
  9c:	035e8bb3          	mul	s7,t4,s5
  a0:	fde72e23          	sw	t5,-36(a4)
  a4:	034e0eb3          	mul	t4,t3,s4
  a8:	ff772023          	sw	s7,-32(a4)
  ac:	03330ab3          	mul	s5,t1,s3
  b0:	ffd72223          	sw	t4,-28(a4)
  b4:	03288e33          	mul	t3,a7,s2
  b8:	ff572423          	sw	s5,-24(a4)
  bc:	03fc0fb3          	mul	t6,s8,t6
  c0:	ffc72623          	sw	t3,-20(a4)
  c4:	02980a33          	mul	s4,a6,s1
  c8:	fdf72c23          	sw	t6,-40(a4)
  cc:	02850333          	mul	t1,a0,s0
  d0:	ff472823          	sw	s4,-16(a4)
  d4:	027589b3          	mul	s3,a1,t2
  d8:	fe672a23          	sw	t1,-12(a4)
  dc:	025608b3          	mul	a7,a2,t0
  e0:	ff372c23          	sw	s3,-8(a4)
  e4:	ff172e23          	sw	a7,-4(a4)
  e8:	f4fb1fe3          	bne	s6,a5,46 <.L3>

I’m not going to work this out exactly, but it’s 44 instructions for 10 loop iterations and I’m guessing about 75 - 100 clock cycles total, or 7.5 - 10 per source code iteration, depending on the data you are multiplying.

Hi Bruce,

By the way, do we have a cycle count for each RISC-V instruction available (e.g. a table showing that which instruction takes how many clock cycles to be executed in e300 platform for example)?

Thanks,
Dong

Hi Bruce,

Yes, my code is just like your example, except I define the variables as static.
Whole code:

#define DATA_SIZE 100
static int input_data1[DATA_SIZE] = {0};
static int input_data2[DATA_SIZE] = {0};
static int results_data[DATA_SIZE] = {0};
unsigned long Begin_Time;
unsigned long End_Time;
unsigned long User_Time;

void test()
{
int i, j;

Begin_Time = *(volatile unsigned long *)(CLINT_CTRL_ADDR + CLINT_MTIME);

for(j = 0; j < 1000; j++)
{
for (i = 0; i < 100; i++)
{
results_data[i] = input_data1[i] * input_data2[i];
}
}

End_Time = *(volatile unsigned long *)(CLINT_CTRL_ADDR + CLINT_MTIME);

Use_Time = End_Time - Begin_Time;

printf(“Begin_Time= %ld, End_Time= %ld, Use_Time= %ld\n”, Begin_Time, End_Time, User_Time);

}

I use 256MHz CPU freq, in fact, I put my test code into demo_gpio.c, and didn’t change the cpu freq and compile config.

I got following from UART after I run the test:

core freq at 265289728 Hz
Time: Begin_Time= 17109, End_Time= 17272, Use_Time= 163

Cetainly. That is documented in “3.3 Execution Pipeline” in the E31 Coreplex manual. https://www.sifive.com/documentation/coreplex/e31-coreplex-manual/

“3.3 Execution Pipeline
The E31 Coreplex execution unit is a single-issue, in-order pipeline. The pipeline comprises five
stages: instruction fetch, instruction decode and register fetch, execute, data memory access, and
register writeback.
The pipeline has a peak execution rate of one instruction per clock cycle, and is fully bypassed so
that most instructions have a one-cycle result latency. There are several exceptions:
• LW has a two-cycle result latency, assuming a cache hit.
• LH, LHU, LB, and LBU have a three-cycle result latency, assuming a cache hit.
• CSR reads have a three-cycle result latency.
• MUL, MULH, MULHU, and MULHSU have a 5-cycle result latency.
• DIV, DIVU, REM, and REMU have between a 2-cycle and 33-cycle result latency, depending
on the operand values.
The pipeline only interlocks on read-after-write and write-after-write hazards, so instructions may be scheduled to avoid stalls.”

Hmm. I thought the multiplier did “early out”, but it doesn’t say so there.

1 Like

Thanks Bruce, this helps a lot!