Risc-v U540 inferencing is too slow

As SiFive official website “SiFive Essential™” down page write "U54 core can compare with A53 "

so I speculate that when use cpu A53 and U54 to inference, if the software is almost the same including the glibc version and gcc compiler param and so on , that the inference time should be close on both platform, however I execute two tests using the same model named “mobilenet_v1.tflite”:

  • python-tflite_runtime inference time compare
    I ran tflite_runtime (python) inference on both U540 board(cpu core : U54) and Raspberry 3B(cpu core : ARM A53)

  • c++ tensorflow_lite inference time compare
    I ran benchmark (tensorflow-r2.5\tensorflow\lite\tools\benchmark) inference on both U540 board(cpu core : U54) and Raspberry 3B(cpu core : ARM A53),when compile the c++ benchmark , I use both -O0 and -O3 to compare,

the test result shows:
U540 inference time always far more than Raspberry 3B, why is that ? can anyone tell?

“can compare” doesn’t mean same performance for all benchmarks. The Unleashed FU540 is roughly equivalent to an A53 for integer code, but has a slower memory system (no hardware cache prefetch at that time), and has slower FP. tflite is an FP benchmark I think, so I would expect the performance to be worse on U54. How much slower depends on exactly what the benchmark is doing. I’m not familiar with that benchmark so I don’t know how to read the results you got, or explain why you got those results.

Also pay attention to clock speed. Wikipedia says the Raspberry PI Model 3B runs at 1.2GHz and the 3B+ at 1.4 GHz. The Unleashed runs at 1.0GHz by defauit. Clock rate will affect the performance.

Also keep in mind that SiFive is an IP Core company, and we release improved versions of our cores every 3 month. The Unleashed FU540 is a 3.5 year old part, and does not have the same performance as current U54 cores. The web site is listing the performance of the current U54 core.

Also performance depends on how the core is configured. The U54 for instance can be configured with a 4-cycle multiplier or a 1-cycle pipelined multiplier. I believe the FU540 has the slower one which was the default at the time. That will be a problem for benchmarks that do a lot of multiplies. The faster multiplier is now the default.