Intermittent kernel oops under heavy load

I get kernel oops like this under heavy load:

[36126.041244] Unable to handle kernel paging request at virtual address ffffffe0011cc010
[36126.048452] Oops [#1]          
[36126.050697] Modules linked in: amdgpu mfd_core gpu_sched backlight snd_hda_codec_hdmi drm_ttm_helper snd_hda_intel snd_intel_dspcfg snd_hda_codec ttm snd_hda_core fuse
[36126.065642] CPU: 1 PID: 39422 Comm: … Not tainted 5.12.4 #1
[36126.071804] Hardware name: SiFive HiFive Unmatched (DT)
[36126.077017] epc : get_page_from_freelist+0x79e/0xdc4
[36126.081968]  ra : get_page_from_freelist+0x71c/0xdc4                                                           
[36126.086919] epc : ffffffe00017a8b4 ra : ffffffe00017a832 sp : ffffffe0fc33bbe0
[36126.094127]  gp : ffffffe001a1b770 tp : ffffffe0fc322800 t0 : ffffffffffffffff
[36126.101338]  t1 : ffffffe3fec8ee40 t2 : ffffffe001000218 s0 : ffffffe0fc33bd20
[36126.108546]  s1 : 0000000000000010 a0 : ffffffe0019cc320 a1 : ffffffcf057226b0
[36126.115754]  a2 : ffffffe0011cc010 a3 : 0000000000000000 a4 : ffffffe0019cbf50
[36126.122964]  a5 : ffffffe0019cbf40 a6 : 00000000000000d0 a7 : ffffffe001a57770
[36126.130173]  s2 : 000000000000003f s3 : 0000000000000010 s4 : ffffffe3fec8ee40
[36126.137383]  s5 : ffffffe0019cbf40 s6 : 0000000000000000 s7 : ffffffe3fec8ee30
[36126.144592]  s8 : 0000000000000001 s9 : 00000000000000d0 s10: ffffffcf03c669e8
[36126.151801]  s11: 0000000000000000 t3 : 0000000000000000 t4 : 0000000000000000
[36126.159010]  t5 : ffffffcf03c669e0 t6 : 0000000000000171
[36126.164307] status: 0000000200000100 badaddr: ffffffe0011cc010 cause: 000000000000000f
[36126.172213] Call Trace: 
[36126.174644] [<ffffffe00017a8b4>] get_page_from_freelist+0x79e/0xdc4
[36126.180900] [<ffffffe00017bd74>] __alloc_pages_nodemask+0xf4/0x1a4
[36126.187066] [<ffffffe000165d2a>] __handle_mm_fault+0x3c8/0xa08
[36126.192885] [<ffffffe0001663e4>] handle_mm_fault+0x7a/0x100
[36126.198444] [<ffffffe000009bc8>] do_page_fault+0x128/0x426
[36126.203917] [<ffffffe0000039ea>] ret_from_exception+0x0/0xc
[36126.209654] ---[ end trace 292e34538dff91f5 ]---

and

[ 5331.322051] Unable to handle kernel access to user memory without uaccess routines at virtual address 0000000000000008                                                                                           
[ 5331.332050] Oops [#1]                                                                                          
[ 5331.334291] Modules linked in: amdgpu snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_hda_codec mfd_core gpu_sched backlight drm_ttm_helper snd_hda_core ttm fuse                                                          
[ 5331.349238] CPU: 1 PID: 21644 Comm: … Not tainted 5.12.4 #1                                             
[ 5331.355401] Hardware name: SiFive HiFive Unmatched (DT)                                                        
[ 5331.360614] epc : get_page_from_freelist+0x19e/0xdc4                                                                                                                                                                             
[ 5331.365563]  ra : __alloc_pages_nodemask+0xf4/0x1a4
[ 5331.370428] epc : ffffffe00017a2b4 ra : ffffffe00017bd74 sp : ffffffe0893b3be0                                                                                                                                                   
[ 5331.377637]  gp : ffffffe001a1b770 tp : ffffffe0893e4600 t0 : ffffffe000009aa0                                 
[ 5331.384845]  t1 : ffffffe3fec8ee40 t2 : ffffffe001000218 s0 : ffffffe0893b3d20                                 
[ 5331.392055]  s1 : 0000000000000010 a0 : 0000000000100cca a1 : 0000000000000000
[ 5331.399265]  a2 : 0000000000001bf5 a3 : 0000000000000000 a4 : ffffffe3fec8ee40                                 
[ 5331.406474]  a5 : ffffffcf0367e660 a6 : 0000000000000cc0 a7 : ffffffe001a57770
[ 5331.413683]  s2 : ffffffe3fec8ee20 s3 : ffffffcf0367e658 s4 : ffffffe3fec8ee30                                 
[ 5331.420893]  s5 : ffffffe0019cbf40 s6 : ffffffe0019cc8c0 s7 : ffffffe0893b3d28
[ 5331.428100]  s8 : 0000000000000000 s9 : 0000000000000000 s10: 0000000000000001
[ 5331.435310]  s11: 0000000000000001 t3 : 0000003fc6df6494 t4 : 000000000000000f                                 
[ 5331.442519]  t5 : 0000000000000001 t6 : 0000000000000000                                                       
[ 5331.447816] status: 0000000200000100 badaddr: 0000000000000008 cause: 000000000000000f                         
[ 5331.455721] Call Trace:                                                                                        
[ 5331.458153] [<ffffffe00017a2b4>] get_page_from_freelist+0x19e/0xdc4                                            
[ 5331.464409] [<ffffffe00017bd74>] __alloc_pages_nodemask+0xf4/0x1a4                                             
[ 5331.470576] [<ffffffe000165d2a>] __handle_mm_fault+0x3c8/0xa08                                                 
[ 5331.476393] [<ffffffe0001663e4>] handle_mm_fault+0x7a/0x100                                                    
[ 5331.481955] [<ffffffe000009bc8>] do_page_fault+0x128/0x426                                                     
[ 5331.487425] [<ffffffe0000039ea>] ret_from_exception+0x0/0xc                                                    
[ 5331.493079] ---[ end trace 4e418ad0738e1f80 ]---

and

[12094.443042] BUG: Bad page state in process …  pfn:1d7cec
[12094.448241] page:0000000005166fc2 refcount:0 mapcount:0 mapping:000000008b33616d index:0x1 pfn:0x1d7cec
[12094.457623] failed to read mapping contents, not a valid kernel address?
[12094.464305] flags: 0x4000000000080000(swapbacked)             
[12094.469001] raw: 4000000000080000 0000000000000100 0000000000000122 0000000000650000 
[12094.476728] raw: 0000000000000001 0000000000000000 00000000ffffffff
[12094.482980] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag(s) set
[12094.489406] Modules linked in: amdgpu snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_hda_codec mfd_core gpu_sched snd_hda_core backlight drm_ttm_helper ttm fuse
[12094.504349] CPU: 1 PID: 93058 Comm: … Not tainted 5.12.4 #1
[12094.510511] Hardware name: SiFive HiFive Unmatched (DT)
[12094.515720] Call Trace:
[12094.518153] [<ffffffe00000576a>] walk_stackframe+0x0/0xe2
[12094.523539] [<ffffffe000ac2bf4>] dump_backtrace+0x4c/0x5a
[12094.528922] [<ffffffe000ac2c26>] show_stack+0x24/0x2c
[12094.533959] [<ffffffe000ac9fa6>] dump_stack+0x7a/0x94
[12094.538997] [<ffffffe00017797e>] bad_page+0xf8/0x11e
[12094.543947] [<ffffffe00017ab04>] get_page_from_freelist+0x9ee/0xdc4
[12094.550202] [<ffffffe00017bd74>] __alloc_pages_nodemask+0xf4/0x1a4
[12094.556368] [<ffffffe000165d2a>] __handle_mm_fault+0x3c8/0xa08
[12094.562188] [<ffffffe0001663e4>] handle_mm_fault+0x7a/0x100
[12094.567746] [<ffffffe000009bc8>] do_page_fault+0x128/0x426
[12094.573219] [<ffffffe0000039ea>] ret_from_exception+0x0/0xc
[12094.578779] Disabling lock debugging due to kernel taint
[12094.584086] Unable to handle kernel paging request at virtual address ffffff8000000008

and

[12130.749844] Unable to handle kernel paging request at virtual address ffffffe0016cc0a0
[12130.757046] Oops [#1]
[12130.759292] Modules linked in: fuse
[12130.762768] CPU: 0 PID: 86990 Comm: python3 Not tainted 5.12.4 #1
[12130.768848] Hardware name: SiFive HiFive Unmatched (DT)
[12130.774061] epc : get_page_from_freelist+0x79e/0xdc4
[12130.779010]  ra : get_page_from_freelist+0x71c/0xdc4
[12130.783959] epc : ffffffe00017a8b4 ra : ffffffe00017a832 sp : ffffffe099d47be0
[12130.791170]  gp : ffffffe001a1b770 tp : ffffffe0fbbec600 t0 : 0000000000000001
[12130.798379]  t1 : ffffffe3fec73e40 t2 : ffffffe001000218 s0 : ffffffe099d47d20
[12130.805590]  s1 : 0000000000000010 a0 : 0000000000000000 a1 : ffffffcf0b0000c8
[12130.812798]  a2 : ffffffe0016cc0a0 a3 : 0000000000000002 a4 : ffffffe0019cbfe0
[12130.820007]  a5 : ffffffe0019cbfd0 a6 : ffffffe0019cc010 a7 : ffffffe001a57770
[12130.827216]  s2 : 000000000000003f s3 : 0000000000000010 s4 : ffffffe3fec73e40
[12130.834425]  s5 : ffffffe0019cbf40 s6 : 0000000000000003 s7 : ffffffe3fec73e30
[12130.841635]  s8 : 0000000000000001 s9 : 00000000000000d0 s10: ffffffcf04b3a968
[12130.848843]  s11: 0000000000000003 t3 : 0000000000000000 t4 : 0000000000000000
[12130.856052]  t5 : ffffffcf04b3a960 t6 : ffffffffffffffff
[12130.861350] status: 0000000200000100 badaddr: ffffffe0016cc0a0 cause: 000000000000000f
[12130.869255] Call Trace:
[12130.871687] [<ffffffe00017a8b4>] get_page_from_freelist+0x79e/0xdc4
[12130.877940] [<ffffffe00017bd74>] __alloc_pages_nodemask+0xf4/0x1a4
[12130.884109] [<ffffffe000165d2a>] __handle_mm_fault+0x3c8/0xa08
[12130.889927] [<ffffffe0001663e4>] handle_mm_fault+0x7a/0x100
[12130.895485] [<ffffffe000009bc8>] do_page_fault+0x128/0x426
[12130.900958] [<ffffffe0000039ea>] ret_from_exception+0x0/0xc
[12130.906641] ---[ end trace 2f8d60d8322c02e5 ]---

and

[44732.714147] page dumped because: bad pte
[44732.718056] addr:0000003fb0099000 vm_flags:00200073 anon_vma:ffffffe09592f058 mapping:0000000000000000 index:3fb0099
[44732.728566] file:(null) fault:0x0 mmap:0x0 readpage:0x0
[44732.733779] CPU: 3 PID: 143969 Comm: b-addcon Tainted: G    B D           5.12.4 #1
[44732.741408] Hardware name: SiFive HiFive Unmatched (DT)
[44732.746620] Call Trace:
[44732.749052] [<ffffffe00000576a>] walk_stackframe+0x0/0xe2
[44732.754436] [<ffffffe000ac2bf4>] dump_backtrace+0x4c/0x5a
[44732.759822] [<ffffffe000ac2c26>] show_stack+0x24/0x2c
[44732.764859] [<ffffffe000ac9fa6>] dump_stack+0x7a/0x94
[44732.769897] [<ffffffe000162c9a>] print_bad_pte+0x172/0x1b8
[44732.775370] [<ffffffe0001643dc>] unmap_page_range+0x45e/0x642
[44732.781102] [<ffffffe0001647fe>] unmap_vmas+0x92/0xda
[44732.786140] [<ffffffe00016c028>] exit_mmap+0xb0/0x1ba
[44732.791178] [<ffffffe00000c15c>] mmput+0x4c/0x10a
[44732.795868] [<ffffffe000013054>] do_exit+0x248/0x7fa
[44732.800819] [<ffffffe0000143ec>] do_group_exit+0x3e/0xce
[44732.806117] [<ffffffe00001fb0a>] get_signal+0x1a0/0x828
[44732.811328] [<ffffffe000004b90>] do_notify_resume+0x88/0x356
[44732.816974] [<ffffffe0000039ea>] ret_from_exception+0x0/0xc

I’m not sure if this is some kind of page table corruption or something else. It happens intermittently, not after any specific action.

The system is unresponsive after that. I haven’t changed the clock speed, and the board is mounted in an enclosure with proper airflow (haven’t seen temperatures >50°C in sensors) . This happens with 5.12.4 kernel pre-built from the SDK image, and also with the previous 5.11.x kernel.

An AMD RX 550 GPU is installed, but I am not using it at the moment. I have tried without the GPU connected, same problem. No USB devices are connected. The current workload is headless software testing utilizing CPU (and disk/memory) only.

The file system is a Debian RISC-V rootfs on a Samsung SSD 970 EVO Plus 2TB.

There doesn’t seem to be anything wrong with the board’s RAM, running a (single-threaded) memory tester for a few days didn’t uncover a single problem.

# memtester 15G                                                                                                                                                                                                      
memtester version 4.5.0 (64-bit)                                                                                                                                                                                                    
Copyright (C) 2001-2020 Charles Cazabon.                                                                          
Licensed under the GNU General Public License version 2 (only).                                                                                                                                                                     
                                                                                                                  
pagesize is 4096                                                                                                  
pagesizemask is 0xfffffffffffff000                                                                                                                                                                                                  
want 15360MB (16106127360 bytes)                                                                                                                                                                                                    
got  15360MB (16106127360 bytes), trying mlock ...locked.                                                         
Loop 1:                                                                                                                                                                                                                             
  Stuck Address       : ok                                                                                                                                                                                                          
  Random Value        : ok                                                                                        
  Compare XOR         : ok          
  Compare SUB         : ok                                                                                        
  Compare MUL         : ok                                                                                        
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok         
  Block Sequential    : ok         
  Checkerboard        : ok         
  Bit Spread          : ok         
  Bit Flip            : setting 216

I can’t really think of anything else to try. I could try removing the NVME, but that will be kind of inconvenient (and it might stop triggering the problem for sake of I/O just being a lot slower, instead of helping narrow down the underlying issue). Or maybe downclocking the CPU.

Disabling two of the cores stops the problem from happening:

echo 0 > ./devices/system/cpu/cpu3/online
echo 0 > ./devices/system/cpu/cpu2/online

(disabling only cpu2 or cpu3 was not enough)
My vague hypothesis is that while creating and destroying a lot of processes, allocating and de-allocating memory, something goes wrong with memory synchronization in the page tables, causing corruption.

The issue still exists on kernel 5.12.18 from demo-coreip-cli-unmatched-2021.07.00.rootfs.wic.xz.

I probably narrowed it down to (frequent?) interaction with the NVME. When mounting a tmpfs as working area, the problem doesn’t arise. Even when running a long time with all cores enabled.