In general, it’s very difficult to say anything without seeing the entire, exact, program you ran.
Assuming for example that input_data1 and input_data2 are initialized global variables it can make a HUGE difference whether or not they have “const” on them.
If we can see your whole program then we don’t have to guess.
We don’t even know the clock speed you’re using.
If it’s 320 MHz then you have almost exactly 16 clock cycles per iteration. If it’s 256 MHz then it’s between 12 and 13.
We don’t know what compiler flags you used, what optimization level.
Assuming your code looks like this:
int input_data1[100];
int input_data2[100];
int results_data[100];
void test(){
int i, j;
for(j = 0; j < 1000; j++)
{
for (i = 0; i < 100; i++)
{
results_data[i] = input_data1[i] * input_data2[i];
}
}
}
… and you compiled with -O or -O1 then you’ll have something like this as your inner loop:
00000024 <.L3>:
24: 4398 lw a4,0(a5)
26: 420c lw a1,0(a2)
28: 02b70733 mul a4,a4,a1
2c: c298 sw a4,0(a3)
2e: 0791 addi a5,a5,4
30: 0611 addi a2,a2,4
32: 0691 addi a3,a3,4
34: fea798e3 bne a5,a0,24 <.L3>
I count eight instructions. The loads and stores take two cycles each, the multiply 1 to 4 depending on your data (which we don’t know). The branch will almost always take 1 cycle.
So, 12 - 15 cycles per loop looks bang-on, depending on your data.
-O2 or -O3 will give the same instructions, just rearranged a little.
If you add -funroll-loops then you can get:
00000046 <.L3>:
46: 0046ab83 lw s7,4(a3)
4a: 0087ae83 lw t4,8(a5)
4e: 0086aa83 lw s5,8(a3)
52: 00c7ae03 lw t3,12(a5)
56: 00c6aa03 lw s4,12(a3)
5a: 0107a303 lw t1,16(a5)
5e: 0106a983 lw s3,16(a3)
62: 0147a883 lw a7,20(a5)
66: 0006af83 lw t6,0(a3)
6a: 0047af03 lw t5,4(a5)
6e: 0146a903 lw s2,20(a3)
72: 0187a803 lw a6,24(a5)
76: 4e84 lw s1,24(a3)
78: 4fc8 lw a0,28(a5)
7a: 4ec0 lw s0,28(a3)
7c: 538c lw a1,32(a5)
7e: 0206a383 lw t2,32(a3)
82: 53d0 lw a2,36(a5)
84: 0246a283 lw t0,36(a3)
88: 0007ac03 lw s8,0(a5)
8c: 037f0f33 mul t5,t5,s7
90: 02870713 addi a4,a4,40
94: 02878793 addi a5,a5,40
98: 02868693 addi a3,a3,40
9c: 035e8bb3 mul s7,t4,s5
a0: fde72e23 sw t5,-36(a4)
a4: 034e0eb3 mul t4,t3,s4
a8: ff772023 sw s7,-32(a4)
ac: 03330ab3 mul s5,t1,s3
b0: ffd72223 sw t4,-28(a4)
b4: 03288e33 mul t3,a7,s2
b8: ff572423 sw s5,-24(a4)
bc: 03fc0fb3 mul t6,s8,t6
c0: ffc72623 sw t3,-20(a4)
c4: 02980a33 mul s4,a6,s1
c8: fdf72c23 sw t6,-40(a4)
cc: 02850333 mul t1,a0,s0
d0: ff472823 sw s4,-16(a4)
d4: 027589b3 mul s3,a1,t2
d8: fe672a23 sw t1,-12(a4)
dc: 025608b3 mul a7,a2,t0
e0: ff372c23 sw s3,-8(a4)
e4: ff172e23 sw a7,-4(a4)
e8: f4fb1fe3 bne s6,a5,46 <.L3>
I’m not going to work this out exactly, but it’s 44 instructions for 10 loop iterations and I’m guessing about 75 - 100 clock cycles total, or 7.5 - 10 per source code iteration, depending on the data you are multiplying.