Memory access is too slow

I tried to run memory copy tests of 512 MB blocks and it give about 120 MB/s speed. For comparison the same test give about 3 GB/s for my x86 Intel PC. I tried to run tests on Linux and Haiku, I also tried assembly memcpy() version with multiple loads/stores per step, result is mostly the same. memset() to some non-zero value was performed on memory blocks to ensure that physical memory was allocated by kernel.

Is it designed memory speed or something is wrong (DDR controller, cache controller configuration etc.)?

I get similar results, running my benchmark at https://hoult.org/test_memcpy.c

ubuntu@ubuntu:~/programs$ ./test_memcpy 
Byte size :              ns     Speed
        0 :            18.3       0.0 MB/s
        1 :            23.3      40.9 MB/s
        2 :            23.8      80.3 MB/s
        4 :            34.4     110.8 MB/s
        8 :            45.6     167.4 MB/s
       16 :            38.1     400.3 MB/s
       32 :            39.1     779.7 MB/s
       64 :            45.2    1351.8 MB/s
      128 :            56.8    2150.2 MB/s
      256 :            84.8    2880.1 MB/s
      512 :           135.1    3614.7 MB/s
     1024 :           243.8    4006.0 MB/s
     2048 :           447.8    4361.3 MB/s
     4096 :           861.9    4532.0 MB/s
     8192 :          1682.8    4642.7 MB/s
    16384 :          3481.7    4487.7 MB/s
    32768 :         20896.7    1495.5 MB/s
    65536 :         47393.2    1318.8 MB/s
   131072 :         96372.7    1297.0 MB/s
   262144 :        193140.3    1294.4 MB/s
   524288 :        400208.0    1249.4 MB/s
  1048576 :       2133293.0     468.8 MB/s
  2097152 :       9486804.7     210.8 MB/s
  4194304 :      22763531.2     175.7 MB/s
  8388608 :      45851468.8     174.5 MB/s
 16777216 :      92099687.5     173.7 MB/s
 33554432 :     183821750.0     174.1 MB/s
 67108864 :     367601500.0     174.1 MB/s

That’s with the CPU running at 1.5 GHz. Note it’s 174 MB/s read plus 174 MB/s write for a total bandwidth of around 350 MB/s.

At least it’s much faster than the HiFive Unleashed.

The BeagleV beta board (with SiFive U74 cores but possibly different DDR controller) gives similar results. I have made BeagleBoard and StarFive aware of my concerns about the very slow DRAM speed and they have assured me that the SoC I have now is only a test item and all will be fixed in the mass produced version. I’m dubious, to be honest.

In contrast, the $99 Allwinner D1 “Nezha” evaluation board gives much higher figures with an Alibaba C906 single-issue core running at 1.0 GHz (extract from https://hoult.org/d1_memcpy.txt):

rvbtest@RVboards:~$ ./test_memcpy_std 
Byte size :              ns     Speed
        0 :            50.3       0.0 MB/s
        1 :            54.8      17.4 MB/s
        2 :            61.6      31.0 MB/s
        4 :            71.6      53.3 MB/s
        8 :            91.6      83.3 MB/s
       16 :            93.7     162.9 MB/s
       32 :            99.7     306.2 MB/s
       64 :           111.6     546.8 MB/s
      128 :           140.5     868.5 MB/s
      256 :           198.4    1230.6 MB/s
      512 :           314.0    1554.9 MB/s
     1024 :           551.7    1770.0 MB/s
     2048 :          1011.4    1931.1 MB/s
     4096 :          1937.8    2015.8 MB/s
     8192 :          3795.8    2058.2 MB/s
    16384 :          8336.3    1874.3 MB/s
    32768 :         20937.3    1492.5 MB/s
    65536 :         58882.3    1061.4 MB/s
   131072 :        113748.5    1098.9 MB/s
   262144 :        225554.1    1108.4 MB/s
   524288 :        446150.4    1120.7 MB/s
  1048576 :        927754.9    1077.9 MB/s
  2097152 :       1849499.0    1081.4 MB/s
  4194304 :       3666302.7    1091.0 MB/s
  8388608 :       7309773.4    1094.4 MB/s
 16777216 :      14528070.3    1101.3 MB/s
 33554432 :      28922562.5    1106.4 MB/s
 67108864 :      57848562.5    1106.3 MB/s

A speed difference of 6.3x in favour of the Allwinner is not insignificant.

There’s nothing wrong with the U74’s core or L1 cache (2.3x faster than the D1), though the Unmatched’s L2 cache is barely faster than the D1’s RAM at about 1250 vs 1100 MB/s.

1 Like

My test code and results. Linux have slightly better results probably because of different scheduling quantum settings and/or interrupt overhead.

#include <malloc.h>
#include <string.h>
#include <stdint.h>
#include <stdio.h>
#include <OS.h>

extern "C" void *Memcpy(void *, const void *, size_t);

uint8* AllocArea(size_t size)
{
	uint8* adr;
	create_area("mem area", (void**)&adr, B_ANY_ADDRESS, size, B_NO_LOCK, B_READ_AREA | B_WRITE_AREA);
	return adr;
}

int main()
{
	bigtime_t t0, t1;
	size_t size = 512*1024*1024LL;
	uint8* mem1;
	uint8* mem2;
	mem1 = AllocArea(size); /*new uint8[size]*/;
	mem2 = AllocArea(size); /*new uint8[size]*/;
	printf("(1)\n");
	memset(mem1, 0x11, size);
	printf("(2)\n");
	memset(mem2, 0x11, size);
	printf("(3)\n");
	memset(mem1, 0xcc, size);
	printf("(4)\n");
	t0 = system_time();
	memcpy(mem2, mem1, size);
	t1 = system_time();
	printf("%g MB/s\n", double(size) / (double(t1 - t0)/1000000.0) / (1024*1024));
	return 0;
}

/*
HiFive Unmatched
111.122 MB/s
110.626 MB/s
110.475 MB/s
110.233 MB/s

x86
6550.33 MB/s
6491.53 MB/s
6562.92 MB/s
6433.45 MB/s
*/

Here’s the result from a Beagleboard-X15 (1.5 GHz Cortex-A15).

ubuntu@arm:~/xfer$ ./test_mem
Byte size :              ns     Speed
        0 :            12.7       0.0 MB/s
        1 :            12.7      75.0 MB/s
        2 :            12.7     150.0 MB/s
        4 :            12.7     300.0 MB/s
        8 :            12.7     599.8 MB/s
       16 :            13.4    1139.6 MB/s
       32 :            14.7    2071.9 MB/s
       64 :            18.8    3255.2 MB/s
      128 :            25.4    4797.0 MB/s
      256 :            34.9    6997.4 MB/s
      512 :            58.3    8377.4 MB/s
     1024 :           104.4    9353.1 MB/s
     2048 :           195.2   10003.6 MB/s
     4096 :           386.4   10108.7 MB/s
     8192 :           748.9   10431.5 MB/s
    16384 :          1492.4   10469.5 MB/s
    32768 :          4405.8    7092.9 MB/s
    65536 :         11066.2    5647.8 MB/s
   131072 :         22732.4    5498.8 MB/s
   262144 :         45678.1    5473.1 MB/s
   524288 :         93299.5    5359.1 MB/s
  1048576 :        197394.5    5066.0 MB/s
  2097152 :        806877.0    2478.7 MB/s
  4194304 :       1990074.2    2010.0 MB/s
  8388608 :       4217566.4    1896.8 MB/s
 16777216 :       8436390.6    1896.5 MB/s
 33554432 :      16840359.4    1900.2 MB/s
 67108864 :      34565812.5    1851.5 MB/s
ubuntu@arm:~/xfer$

BTW, to get rid of the coloring, just put the word text after the first set of backticks.

The core on the Unmatched doesn’t have hardware prefetch. That significantly reduces memory throughput for easily predicted streams like a memcpy benchmark. I believe we got something like a 4 times speed up on some memory bound benchmarks when we added hardware prefetch. I don’t know what that would translate to for a memcpy benchmark. But that probably explains most of the speed difference with the Nezha board, assuming it does have hardware prefetch which is likely.

This is not a high performance part on the Unmatched. It isn’t in the same class as any recent x86_64 cpu. And it isn’t even in the same class as a Cortex-A15 either. It is a Cortex-A53 class processor.

The BeagleV beta should have a newer version of the U7 core than on Unmatched, and I would have thought it was new enough to have hardware prefetch, but I don’t know for sure. It does have problems with a non-coherent system bus that reduces performance, and that is one of the things they plan to fix in the production part.

2 Likes

Correct. BealgeV (7100 SoC) doesn’t contain L2 prefecher as it’s an older U74 release. If you look into 7100 datasheet (public) and go into memory map section there are no “U7 Hart X L2 Prefetcher” devices listed.

You can look into “SiFive U74-MC Core Complex Manual” (public) for 21G2.01.00 release. Appendix B, SiFive RISC-V Implementation Registers section contains mimpid values for all U74 releases. Note this only shows which release of U74 cores the SoC is using, but not the actual U74 core configuration. Appendix A lists all possible configurations for the core.

2 Likes

My scepticism has unfortunately proven to be well-founded. The BeagleV Starlight project has been cancelled: The Future of BeagleV™ community - #2 by DrewFustini - BeagleV - BeagleBoard

1 Like

Can DMA controller RAM → RAM copy be used to speed up memory copy?

I tried to use Radeon GPU DMA and found that it is MUCH faster: 3.3 GiB/s vs 140 MiB/s CPU sequential write speed. GPU have its own MMU so it can linearize physical page access and DMA can work on byte granularity.

Is there any practical way to utilize that?

Implementing memcpy()/memset() for large memory blocks, graphics related processing.

But I hope that new SiFive CPU designs will be released with faster sequential large memory block access. Using DMA is basically workaround of unbalanced hardware (memory bus works much faster then CPU can access it).

I just started looking into why my Unmatched seemed so slow as well, compared to some of the ARM boards I am using.

I noticed that the memory latency is extremely high. 140ns to DRAM!

I have been using the original firmware binaries that came out at the same time as the boards delivered, but looking at the upstream opensbi and u-boot repos, I don’t see any relevant changes to memory init or other configs since then that could indicate that performance bugs were fixed.

Has others seen similar numbers? This is from lmbench’s lat_mem_rd, with a RPi4 on the left, the unmatched on the right.

Both FU540 and FU740 are “slow” regarding bandwidth and latencies. This hasn’t change since the SoCs or/and boards were released. Also U74 cores used in FU740 do not container L2 prefetcher. That got available only in a newer releases on U74 cores.