Memory access is too slow

I tried to run memory copy tests of 512 MB blocks and it give about 120 MB/s speed. For comparison the same test give about 3 GB/s for my x86 Intel PC. I tried to run tests on Linux and Haiku, I also tried assembly memcpy() version with multiple loads/stores per step, result is mostly the same. memset() to some non-zero value was performed on memory blocks to ensure that physical memory was allocated by kernel.

Is it designed memory speed or something is wrong (DDR controller, cache controller configuration etc.)?

I get similar results, running my benchmark at https://hoult.org/test_memcpy.c

ubuntu@ubuntu:~/programs$ ./test_memcpy 
Byte size :              ns     Speed
        0 :            18.3       0.0 MB/s
        1 :            23.3      40.9 MB/s
        2 :            23.8      80.3 MB/s
        4 :            34.4     110.8 MB/s
        8 :            45.6     167.4 MB/s
       16 :            38.1     400.3 MB/s
       32 :            39.1     779.7 MB/s
       64 :            45.2    1351.8 MB/s
      128 :            56.8    2150.2 MB/s
      256 :            84.8    2880.1 MB/s
      512 :           135.1    3614.7 MB/s
     1024 :           243.8    4006.0 MB/s
     2048 :           447.8    4361.3 MB/s
     4096 :           861.9    4532.0 MB/s
     8192 :          1682.8    4642.7 MB/s
    16384 :          3481.7    4487.7 MB/s
    32768 :         20896.7    1495.5 MB/s
    65536 :         47393.2    1318.8 MB/s
   131072 :         96372.7    1297.0 MB/s
   262144 :        193140.3    1294.4 MB/s
   524288 :        400208.0    1249.4 MB/s
  1048576 :       2133293.0     468.8 MB/s
  2097152 :       9486804.7     210.8 MB/s
  4194304 :      22763531.2     175.7 MB/s
  8388608 :      45851468.8     174.5 MB/s
 16777216 :      92099687.5     173.7 MB/s
 33554432 :     183821750.0     174.1 MB/s
 67108864 :     367601500.0     174.1 MB/s

That’s with the CPU running at 1.5 GHz. Note it’s 174 MB/s read plus 174 MB/s write for a total bandwidth of around 350 MB/s.

At least it’s much faster than the HiFive Unleashed.

The BeagleV beta board (with SiFive U74 cores but possibly different DDR controller) gives similar results. I have made BeagleBoard and StarFive aware of my concerns about the very slow DRAM speed and they have assured me that the SoC I have now is only a test item and all will be fixed in the mass produced version. I’m dubious, to be honest.

In contrast, the $99 Allwinner D1 “Nezha” evaluation board gives much higher figures with an Alibaba C906 single-issue core running at 1.0 GHz (extract from https://hoult.org/d1_memcpy.txt):

rvbtest@RVboards:~$ ./test_memcpy_std 
Byte size :              ns     Speed
        0 :            50.3       0.0 MB/s
        1 :            54.8      17.4 MB/s
        2 :            61.6      31.0 MB/s
        4 :            71.6      53.3 MB/s
        8 :            91.6      83.3 MB/s
       16 :            93.7     162.9 MB/s
       32 :            99.7     306.2 MB/s
       64 :           111.6     546.8 MB/s
      128 :           140.5     868.5 MB/s
      256 :           198.4    1230.6 MB/s
      512 :           314.0    1554.9 MB/s
     1024 :           551.7    1770.0 MB/s
     2048 :          1011.4    1931.1 MB/s
     4096 :          1937.8    2015.8 MB/s
     8192 :          3795.8    2058.2 MB/s
    16384 :          8336.3    1874.3 MB/s
    32768 :         20937.3    1492.5 MB/s
    65536 :         58882.3    1061.4 MB/s
   131072 :        113748.5    1098.9 MB/s
   262144 :        225554.1    1108.4 MB/s
   524288 :        446150.4    1120.7 MB/s
  1048576 :        927754.9    1077.9 MB/s
  2097152 :       1849499.0    1081.4 MB/s
  4194304 :       3666302.7    1091.0 MB/s
  8388608 :       7309773.4    1094.4 MB/s
 16777216 :      14528070.3    1101.3 MB/s
 33554432 :      28922562.5    1106.4 MB/s
 67108864 :      57848562.5    1106.3 MB/s

A speed difference of 6.3x in favour of the Allwinner is not insignificant.

There’s nothing wrong with the U74’s core or L1 cache (2.3x faster than the D1), though the Unmatched’s L2 cache is barely faster than the D1’s RAM at about 1250 vs 1100 MB/s.

1 Like

My test code and results. Linux have slightly better results probably because of different scheduling quantum settings and/or interrupt overhead.

#include <malloc.h>
#include <string.h>
#include <stdint.h>
#include <stdio.h>
#include <OS.h>

extern "C" void *Memcpy(void *, const void *, size_t);

uint8* AllocArea(size_t size)
{
	uint8* adr;
	create_area("mem area", (void**)&adr, B_ANY_ADDRESS, size, B_NO_LOCK, B_READ_AREA | B_WRITE_AREA);
	return adr;
}

int main()
{
	bigtime_t t0, t1;
	size_t size = 512*1024*1024LL;
	uint8* mem1;
	uint8* mem2;
	mem1 = AllocArea(size); /*new uint8[size]*/;
	mem2 = AllocArea(size); /*new uint8[size]*/;
	printf("(1)\n");
	memset(mem1, 0x11, size);
	printf("(2)\n");
	memset(mem2, 0x11, size);
	printf("(3)\n");
	memset(mem1, 0xcc, size);
	printf("(4)\n");
	t0 = system_time();
	memcpy(mem2, mem1, size);
	t1 = system_time();
	printf("%g MB/s\n", double(size) / (double(t1 - t0)/1000000.0) / (1024*1024));
	return 0;
}

/*
HiFive Unmatched
111.122 MB/s
110.626 MB/s
110.475 MB/s
110.233 MB/s

x86
6550.33 MB/s
6491.53 MB/s
6562.92 MB/s
6433.45 MB/s
*/

Here’s the result from a Beagleboard-X15 (1.5 GHz Cortex-A15).

ubuntu@arm:~/xfer$ ./test_mem
Byte size :              ns     Speed
        0 :            12.7       0.0 MB/s
        1 :            12.7      75.0 MB/s
        2 :            12.7     150.0 MB/s
        4 :            12.7     300.0 MB/s
        8 :            12.7     599.8 MB/s
       16 :            13.4    1139.6 MB/s
       32 :            14.7    2071.9 MB/s
       64 :            18.8    3255.2 MB/s
      128 :            25.4    4797.0 MB/s
      256 :            34.9    6997.4 MB/s
      512 :            58.3    8377.4 MB/s
     1024 :           104.4    9353.1 MB/s
     2048 :           195.2   10003.6 MB/s
     4096 :           386.4   10108.7 MB/s
     8192 :           748.9   10431.5 MB/s
    16384 :          1492.4   10469.5 MB/s
    32768 :          4405.8    7092.9 MB/s
    65536 :         11066.2    5647.8 MB/s
   131072 :         22732.4    5498.8 MB/s
   262144 :         45678.1    5473.1 MB/s
   524288 :         93299.5    5359.1 MB/s
  1048576 :        197394.5    5066.0 MB/s
  2097152 :        806877.0    2478.7 MB/s
  4194304 :       1990074.2    2010.0 MB/s
  8388608 :       4217566.4    1896.8 MB/s
 16777216 :       8436390.6    1896.5 MB/s
 33554432 :      16840359.4    1900.2 MB/s
 67108864 :      34565812.5    1851.5 MB/s
ubuntu@arm:~/xfer$

BTW, to get rid of the coloring, just put the word text after the first set of backticks.

The core on the Unmatched doesn’t have hardware prefetch. That significantly reduces memory throughput for easily predicted streams like a memcpy benchmark. I believe we got something like a 4 times speed up on some memory bound benchmarks when we added hardware prefetch. I don’t know what that would translate to for a memcpy benchmark. But that probably explains most of the speed difference with the Nezha board, assuming it does have hardware prefetch which is likely.

This is not a high performance part on the Unmatched. It isn’t in the same class as any recent x86_64 cpu. And it isn’t even in the same class as a Cortex-A15 either. It is a Cortex-A53 class processor.

The BeagleV beta should have a newer version of the U7 core than on Unmatched, and I would have thought it was new enough to have hardware prefetch, but I don’t know for sure. It does have problems with a non-coherent system bus that reduces performance, and that is one of the things they plan to fix in the production part.

2 Likes

Correct. BealgeV (7100 SoC) doesn’t contain L2 prefecher as it’s an older U74 release. If you look into 7100 datasheet (public) and go into memory map section there are no “U7 Hart X L2 Prefetcher” devices listed.

You can look into “SiFive U74-MC Core Complex Manual” (public) for 21G2.01.00 release. Appendix B, SiFive RISC-V Implementation Registers section contains mimpid values for all U74 releases. Note this only shows which release of U74 cores the SoC is using, but not the actual U74 core configuration. Appendix A lists all possible configurations for the core.

2 Likes

My scepticism has unfortunately proven to be well-founded. The BeagleV Starlight project has been cancelled: The Future of BeagleV™ community - #2 by DrewFustini - BeagleV - BeagleBoard

1 Like

Can DMA controller RAM → RAM copy be used to speed up memory copy?