Cycle Counts on Unleashed Board Seem Innaccurate?

Hi Everyone, I hope this is the right place to put this. It’s either here or the Freedom Unleashed forum and I’m not 100% sure which one would be better. I’ll preface this by saying I’m much more used to working on much less complex boards - that will probably become more obvious in a minute.

Anyways, I’ve modified U-Boot to perform a verified boot using RSA 2048 and I’m printing out cycle counts for the actual verification step and I’m seeing a number that doesn’t totally make sense to me. Here’s the code in rsa_verify_with_keynode located in rsa-verify.c:

    printf("Verifying signature.......");
#ifndef _GNU_SOURCE
    //This is int here, but I've tried unsigned long, unsigned long long, uint64_t, etc including changing the appropriate printf formatter below
    int t, t1;
    t = rdcycle();
#endif
   ret = rsa_verify_key(&prop, sig, sig_len, hash,
			     info->crypto->key_len, info->checksum);

#ifndef _GNU_SOURCE
    t1 = rdcycle();
    printf("Successfully verified signature?? %d\n",  t1 - t);
#endif

So I’m getting a cycle count out of this RSA verification of 8,973,663 cycles and I know that when I run RSA 2048 verification on an arm cortex m4 device I hover around 3,500,000 cycles so something is up. This number is also pretty consistent, so its at least reliably high.

As a test I chained together 100 asm(“nop”) calls and got a cycle count of somewhere around ~800 - ~1000.

So what am I missing? Am I accessing the cycle counter incorrectly? Are interrupts causing an inflated cycle count? If so, can I turn off interrupts? bootm_disable_interrupts() doesn’t appear to have any effect and disable_interrupts() in HiFive_UBoot/arch/riscv/lib/interrupts.c is literally just a return statement.

What happens if you copy and paste that code twice more (or wrap it in a function) and run it three times in a row. Are the 2nd and 3rd runs a) fewer cycles, and b) consistent with each other?

It’s also possible that this is simply down to the lack of a rotate instruction in the current RISC-V instruction set. In general-purpose code this is not a big deal, but rsa_verify_key() seems like something that might use rotates heavily, in which case a 2:1 or 3:1 difference in cycles would not be surprising. If RSA is a large part of the workload then of course custom hardware and/or instructions would be the way to go, but this seems like it might be fairly incidental.

I increased it to a total of 300 nops as you suggested and now 2300-2600 cycles which I guess could mean it scales? Not sure what to make of that.

Just for kicks I threw micro-ECC on there with a hardcoded signature and timed that and got 16 million cycles. I’ve got micro-ECC running with the same curve on another 64-bit RISCV processor at only 5 million.

Can you try this https://github.com/tmagik/secp256k1/commit/8186725c7db92f196c13c2e96e187a9e61cea5e9 (it builds and links into U-boot, but I’ve never tested it), and do the cycle count?

It might be a lot easier to compare the cycle counts if we are running on a full linux system, can you point to some example code of micro-ECC so I can just compile it and put it on a machine Bruce can look at?

Also given the stage you are running U-boot at, we might not have the cache(s) fully enabled, so that could also explain the delta.

Not nops … I meant calls to rsa_verify_key().

Preferably put all the code from "“Verifying signature” to “Successfully verified signature” into a loop that runs three or more times (not straight line code).

Bruce,

Performing three calls into the RSA verify results in a cycle count of 26,896,534, so roughly 3 times as long.

Troy,

Here is micro-ecc: GitHub - kmackay/micro-ecc: ECDH and ECDSA for 8-bit, 32-bit, and 64-bit processors.. If running on a RISCV processor it helps to set uECC_OPTIMIZATION_LEVEL=3, uECC_SQUARE_FUNC=1 and to explicitly define uECC_WORD_SIZE 8 otherwise it defaults to a 4 byte word size.

I got secp256k1 up and running but when I try and generate public keys it fails right out of the gate:

[libsecp256k1] internal consistency check failed: Out of memory
[libsecp256k1] internal consistency check failed: Out of memory
[libsecp256k1] internal consistency check failed: Out of memory
[libsecp256k1] internal consistency check failed: Out of memory
exception code: 7 , , epc fff80a4c , ra fff81218
exception code: 7 , , epc fff80a4c , ra fff81218
exception code: 7 , , epc fff80a4c , ra fff81218

I dunno if it really is running out of memory or if I implemented libsecp256k1 incorrectly. I’ll keep poking around and if I have to I’ll generate the public keys and signature elsewhere off the board and just embed them in the bootloader and run verify over them.

That would be impressive to generate keys in U-boot.

I imagine key generation could easily take more memory than U-boot is used to running with, or there is some other problem with using the minimal malloc implemented in it.

Somewhere there’s likely a u-boot malloc debug option, but at this point I’d say stick a printf in the u-boot malloc code and try embedding the keys/sig in the bootloader and just verifying it.

Oh, I wasn’t aware ECDSA key/signature generation was so resource intensive. I guess I naively thought because it’s presence as an embedded solution for signature verification was so prominent that key/signature generation was equally lightweight.

I’ll just get some hardcoded values in there and report back.

Thanks for that. I was actually looking for the time for each call, to see the progression and see if it stabilizes.

However, that total is so close to 3x the number before that code warmup and cache effects are clearly at most a very minor issue.

Bruce, let’s pretend I always knew what you wanted from the start and I was just doing it wrong on purpose. That’s a little easier on my ego :grinning:

You are correct that the three runs complete in rough the same time:

run 1: 8627750
run 2: 8616522
run 3: 8616088

So it doesn’t look like I can even make it past allocating the first context without running out of memory with that secp256k1 fork. A quick look online indicates that particular implementation is “optimized for large fast chips with huge cache” although I don’t know where the Unleashed falls in the spectrum of devices. I know its far and above the fastest board I’ve worked on since I’m used to Cortex m3’s and m4s.

Here’s my code:

    t1 = rdcycle();
    printf("Successfully verified signature %d? %lu\n", success,  t1 - t);

	secp256k1_context *sign = secp256k1_context_create(SECP256K1_CONTEXT_VERIFY);
	printf("Created context\n");
    secp256k1_context *vrfy = secp256k1_context_create(SECP256K1_CONTEXT_VERIFY);
	printf("Created context\n");
    secp256k1_ecdsa_signature sigeccc;
	secp256k1_pubkey ecpubkey;
	
printf("Declared things\n");
    unsigned char eccpr[65] = {
			0x04, 	0x1b, 	0x84, 	0xc5, 	0x56, 	0x7b, 	0x12, 	0x64, 
	0x40, 	0x99, 	0x5d, 	0x3e, 	0xd5, 	0xaa, 	0xba, 	0x05, 
	0x65, 	0xd7, 	0x1e, 	0x18, 	0x34, 	0x60, 	0x48, 	0x19, 
	0xff, 	0x9c, 	0x17, 	0xf5, 	0xe9, 	0xd5, 	0xdd, 	0x07, 
	0x8f, 	0x70, 	0xbe, 	0xaf, 	0x8f, 	0x58, 	0x8b, 	0x54, 
	0x15, 	0x07, 	0xfe, 	0xd6, 	0xa6, 	0x42, 	0xc5, 	0xab, 
	0x42, 	0xdf, 	0xdf, 	0x81, 	0x20, 	0xa7, 	0xf6, 	0x39, 
	0xde, 	0x51, 	0x22, 	0xd4, 	0x7a, 	0x69, 	0xa8, 	0xe8, 
	0xd1, 

	};

	unsigned char sigbytes[64] = {
			0x4c, 	0x18, 	0xaf, 	0x94, 	0xcc, 	0xcd, 	0x72, 	0x34, 
	0xe6, 	0x45, 	0x80, 	0x96, 	0x2d, 	0x9b, 	0x9e, 	0xe8, 
	0x0a, 	0xe2, 	0xeb, 	0x09, 	0x2a, 	0x4c, 	0x02, 	0xed, 
	0xe1, 	0xe8, 	0xfb, 	0xaa, 	0x30, 	0xe6, 	0xf5, 	0xc7, 
	0x00, 	0x00, 	0x7f, 	0xff, 	0xff, 	0xff, 	0xdd, 	0x40, 
	0x00, 	0x00, 	0x00, 	0x00, 	0x00, 	0x40, 	0x18, 	0x00, 
	0x00, 	0x00, 	0x00, 	0x00, 	0x00, 	0x64, 	0xb0, 	0x30, 
	0x1b, 	0xc4, 	0x32, 	0xc3, 	0xde, 	0xfe, 	0x4b, 	0x91, 

	};
	
	secp256k1_ec_pubkey_parse(sign, &ecpubkey, eccpr, 65);
	printf("Parsing pubkey\n");
	secp256k1_ecdsa_signature_parse_compact(sign, &sigeccc, sigbytes);
	printf("Parsing signaturet\n");
	t = rdcycle() ;
	if(secp256k1_ecdsa_verify(vrfy, &sigeccc, hashecc, &ecpubkey) != 1)
	{
		printf("Error verifying\n");
	}
	t1 = rdcycle() ;
    printf("New ECDSA\n", t1 - t);

And output:

Verifying signature.......Successfully verified signature 1? 16675577 (previous run of micro ECDSA)
[libsecp256k1] internal consistency check failed: Out of memory
[libsecp256k1] internal consistency check failed: Out of memory
[libsecp256k1] internal consistency check failed: Out of memory
[libsecp256k1] internal consistency check failed: Out of memory
exception code: 7 ,  , epc fff807ac , ra fff80f78
exception code: 7 ,  , epc fff807ac , ra fff80f78
exception code: 7 ,  , epc fff807ac , ra fff80f78

Dunno, I’ll keep digging

edit: The first 5 malloc calls that version of ECDSA makes are for 208 bytes, 1.04 million bytes, 2.03 million bytes, 1.3 million bytes and 665,360 bytes. I dunno what sort of ram the bootloader has carved out for itself, though.

rsa_verify_key() ends up as a number of calls to montgomery_mul_add_step

static void montgomery_mul_add_step(const struct rsa_public_key *key,
		uint32_t result[], const uint32_t a, const uint32_t b[])
{
	uint64_t acc_a, acc_b;
	uint32_t d0;
	uint i;

	acc_a = (uint64_t)a * b[0] + result[0];
	d0 = (uint32_t)acc_a * key->n0inv;
	acc_b = (uint64_t)d0 * key->modulus[0] + (uint32_t)acc_a;
	for (i = 1; i < key->len; i++) {
		acc_a = (acc_a >> 32) + (uint64_t)a * b[i] + result[i];
		acc_b = (acc_b >> 32) + (uint64_t)d0 * key->modulus[i] +
				(uint32_t)acc_a;
		result[i - 1] = (uint32_t)acc_b;
	}

	acc_a = (acc_a >> 32) + (acc_b >> 32);

	result[i - 1] = (uint32_t)acc_a;

	if (acc_a >> 32)
		subtract_modulus(key, result);
}

That looks ok for us. The multiple casting of uint32_t values to uint64_t is not so good on risc-v but gcc manages to move it out of the loop or otherwise optimise it.

The inner loop:

0000000000000052 <.L3>:
  52:   0008e303                lwu     t1,0(a7)
  56:   000e6503                lwu     a0,0(t3)
  5a:   00486e83                lwu     t4,4(a6)
  5e:   02c30333                mul     t1,t1,a2
  62:   9381                    srli    a5,a5,0x20
  64:   97f6                    add     a5,a5,t4
  66:   9301                    srli    a4,a4,0x20
  68:   0811                    addi    a6,a6,4
  6a:   0891                    addi    a7,a7,4
  6c:   0e11                    addi    t3,t3,4
  6e:   02750533                mul     a0,a0,t2
  72:   979a                    add     a5,a5,t1
  74:   01f7f333                and     t1,a5,t6
  78:   972a                    add     a4,a4,a0
  7a:   971a                    add     a4,a4,t1
  7c:   fee82e23                sw      a4,-4(a6)
  80:   fd1699e3                bne     a3,a7,52 <.L3>

That looks fine. The results of the loads and multiplies are well after those instructions, so I think it should run at essentially 1.0 IPC.

It should be fewer cycles than a Cortex M4 because we can actually do 64 bit arithmetic. Also, compiling it for M4 I see a lot of spilling to the stack.

Troy may be correct that at the stage uboot is running this caches are not set up. Because otherwise I think it should run fine.

Would the lack of caching also explain why 300 nops takes almost 10 times as many cycles as it theoretically should? I’m in pretty unfamiliar territory here, but I wouldn’t guess that caching comes in to play with nops?

Instruction fetch works better with caching too. One RAM access for every 64 bytes instead of for every instruction, potentially. (I don’t know the exact details)

If you can, try extracting that code and running it as a normal application.

Getting everything running in a normal application is my next step, I’ll report back when I get that going.

Scanning through the user manual I see some mention of cache misses incurring three-cycle penalties, so that’s seems to add up

So as a quick test I compiled a quick application on my desktop that runs 100 nops and scp’d it over to the board.

#include <stdio.h>

#define read_csr(reg) ({ unsigned long __tmp; \
  asm volatile ("csrr %0, " #reg : "=r"(__tmp)); \
  __tmp; })

#define write_csr(reg, val) ({ \
  asm volatile ("csrw " #reg ", %0" :: "rK"(val)); })

#define swap_csr(reg, val) ({ unsigned long __tmp; \
  asm volatile ("csrrw %0, " #reg ", %1" : "=r"(__tmp) : "rK"(val)); \
  __tmp; })

#define set_csr(reg, bit) ({ unsigned long __tmp; \
  asm volatile ("csrrs %0, " #reg ", %1" : "=r"(__tmp) : "rK"(bit)); \
  __tmp; })

#define clear_csr(reg, bit) ({ unsigned long __tmp; \
  asm volatile ("csrrc %0, " #reg ", %1" : "=r"(__tmp) : "rK"(bit)); \
  __tmp; })

#define rdtime() read_csr(time)
#define rdcycle() read_csr(cycle)
#define rdinstret() read_csr(instret)

int main()
{
	unsigned long t = rdcycle();
	asm("nop");
        ....
        asm("nop");

	unsigned long t1 = rdcycle();
	printf("cycles: %lu\n", t1 - t);

}

I had to statically link all the dependencies because the program was giving me an error about not having the right version of GLIBC on the HiFive device, but once I did that I got the following output:

# ./a.out
cycles: 342
# ./a.out
cycles: 236
# ./a.out
cycles: 640
# ./a.out
cycles: 404
# ./a.out
cycles: 390
# ./a.out
cycles: 342
# ./a.out
cycles: 346
# ./a.out
cycles: 676
# ./a.out
cycles: 222
# ./a.out
cycles: 344
# ./a.out
cycles: 378
# ./a.out
cycles: 220

I’m not sure what to make of that. I guess now that the interrupts are going off now that the entire system is running, its unlikely any numbers out of the cycle count can be really counted on to that level of granularity?

edit: I should also point out that just a single nop in there reports out as 14 cycles

It’s impossible to measure anything useful (at least about instruction execution times) with a program like this. Code always runs slower the first time because it needs to be copied from RAM into cache (or even from disk into RAM) the first time you run it.

You’ll get much more useful data by doing it like this:

int main()
{
  for (int i=0; i<10; ++i) {
    unsigned long t = rdcycle();
    asm("nop");
    ....
    asm("nop");

    unsigned long t1 = rdcycle();
    printf("cycles: %lu\n", t1 - t);
  }
}

There we go:

# ./a.out
cycles: 327
cycles: 157
cycles: 112
cycles: 112
cycles: 112
cycles: 112
cycles: 112
cycles: 112
cycles: 112
cycles: 134

Great! So that looks like 100 nops really do take 100 cycles, plus 12 for returning from the first call to rdcycle() after reading the actual CSR and getting into the 2nd rdcycle() to the point of reading the CSR.

You could verify that by running it with just the two calls with rdcycle() without the nops between.

Interesting question why the last iteration took 22 cycles longer. Maybe that really was an interrupt, though it would be a pretty quick one.