Low 1 core STREAM bandwidth

I found this doc from Sifive/Starfive that explains the L2 prefetcher in detail in U74-MC (previous Gen):

See Chapter 13.2.5
Now we have private L1/L2 and shared L3, but I assume some terminology still applies. It’s far more readable than the TRM released by ESWIN.

drmpeg’s code suffered significant perf regression – took ~2x the time to finish. I did some tweak and found that you don’t need that many tweaks to L1/L2 prefetcher to boost the STREAM workload and, in the meantime, penalize drmpeg’s LDPC workload. Based on the original value of CSR 0x7c3 and 0x7c4 before patch hifive-premier-p550: opensbi: Modify CSR registers · sifiveinc/meta-sifive@9759264 · GitHub, increase maxL1PFDist a little bit is all you need to boost STREAM performance. I tried to set it to 2 or 3, and I got pretty good STREAM perf (on par with the new firmware release), and didn’t see noticeable regression with LDPC. ESWIN/Sifive needs to do more testing to make L1/L2 prefetcher settings fitting wider range of workloads. Even better, provide a SBI interface so it can be adjusted without having to flash a new firmware.

FYI: My current setting:

0x7c3: 0x1005c1be649  {
  "reg": "0x7c3",
  "name": "L1 Prefetcher CSR",
  "fields": {
    "l1pfEnable": 1,
    "window": 36,
    "initialDist": 12,
    "maxAllowedDist": 31,
    "linToExpThrd": 3,
    "qFullnessThrdL1": 14,
    "hitCacheThrdL1": 2,
    "hitMSHRThrdL1": 0,
    "issueBubble": 0,
    "maxL1PFDist": 2,
    "forgiveThrd": 0,
    "numL1PFIssQEnt": 0
  }
}
0x7c4: 0x929f  {
  "reg": "0x7c4",
  "name": "L1 Prefetcher CSR",
  "fields": {
    "l2pfEnable": 1,
    "qFullnessThrdL2": 15,
    "hitCacheThrdL2": 20,
    "hitMSHRThrdL2": 4,
    "numL2PFIssQEnt": 2
  }
}