Intermittent kernel oops under heavy load

I get kernel oops like this under heavy load:

[36126.041244] Unable to handle kernel paging request at virtual address ffffffe0011cc010
[36126.048452] Oops [#1]          
[36126.050697] Modules linked in: amdgpu mfd_core gpu_sched backlight snd_hda_codec_hdmi drm_ttm_helper snd_hda_intel snd_intel_dspcfg snd_hda_codec ttm snd_hda_core fuse
[36126.065642] CPU: 1 PID: 39422 Comm: … Not tainted 5.12.4 #1
[36126.071804] Hardware name: SiFive HiFive Unmatched (DT)
[36126.077017] epc : get_page_from_freelist+0x79e/0xdc4
[36126.081968]  ra : get_page_from_freelist+0x71c/0xdc4                                                           
[36126.086919] epc : ffffffe00017a8b4 ra : ffffffe00017a832 sp : ffffffe0fc33bbe0
[36126.094127]  gp : ffffffe001a1b770 tp : ffffffe0fc322800 t0 : ffffffffffffffff
[36126.101338]  t1 : ffffffe3fec8ee40 t2 : ffffffe001000218 s0 : ffffffe0fc33bd20
[36126.108546]  s1 : 0000000000000010 a0 : ffffffe0019cc320 a1 : ffffffcf057226b0
[36126.115754]  a2 : ffffffe0011cc010 a3 : 0000000000000000 a4 : ffffffe0019cbf50
[36126.122964]  a5 : ffffffe0019cbf40 a6 : 00000000000000d0 a7 : ffffffe001a57770
[36126.130173]  s2 : 000000000000003f s3 : 0000000000000010 s4 : ffffffe3fec8ee40
[36126.137383]  s5 : ffffffe0019cbf40 s6 : 0000000000000000 s7 : ffffffe3fec8ee30
[36126.144592]  s8 : 0000000000000001 s9 : 00000000000000d0 s10: ffffffcf03c669e8
[36126.151801]  s11: 0000000000000000 t3 : 0000000000000000 t4 : 0000000000000000
[36126.159010]  t5 : ffffffcf03c669e0 t6 : 0000000000000171
[36126.164307] status: 0000000200000100 badaddr: ffffffe0011cc010 cause: 000000000000000f
[36126.172213] Call Trace: 
[36126.174644] [<ffffffe00017a8b4>] get_page_from_freelist+0x79e/0xdc4
[36126.180900] [<ffffffe00017bd74>] __alloc_pages_nodemask+0xf4/0x1a4
[36126.187066] [<ffffffe000165d2a>] __handle_mm_fault+0x3c8/0xa08
[36126.192885] [<ffffffe0001663e4>] handle_mm_fault+0x7a/0x100
[36126.198444] [<ffffffe000009bc8>] do_page_fault+0x128/0x426
[36126.203917] [<ffffffe0000039ea>] ret_from_exception+0x0/0xc
[36126.209654] ---[ end trace 292e34538dff91f5 ]---

and

[ 5331.322051] Unable to handle kernel access to user memory without uaccess routines at virtual address 0000000000000008                                                                                           
[ 5331.332050] Oops [#1]                                                                                          
[ 5331.334291] Modules linked in: amdgpu snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_hda_codec mfd_core gpu_sched backlight drm_ttm_helper snd_hda_core ttm fuse                                                          
[ 5331.349238] CPU: 1 PID: 21644 Comm: … Not tainted 5.12.4 #1                                             
[ 5331.355401] Hardware name: SiFive HiFive Unmatched (DT)                                                        
[ 5331.360614] epc : get_page_from_freelist+0x19e/0xdc4                                                                                                                                                                             
[ 5331.365563]  ra : __alloc_pages_nodemask+0xf4/0x1a4
[ 5331.370428] epc : ffffffe00017a2b4 ra : ffffffe00017bd74 sp : ffffffe0893b3be0                                                                                                                                                   
[ 5331.377637]  gp : ffffffe001a1b770 tp : ffffffe0893e4600 t0 : ffffffe000009aa0                                 
[ 5331.384845]  t1 : ffffffe3fec8ee40 t2 : ffffffe001000218 s0 : ffffffe0893b3d20                                 
[ 5331.392055]  s1 : 0000000000000010 a0 : 0000000000100cca a1 : 0000000000000000
[ 5331.399265]  a2 : 0000000000001bf5 a3 : 0000000000000000 a4 : ffffffe3fec8ee40                                 
[ 5331.406474]  a5 : ffffffcf0367e660 a6 : 0000000000000cc0 a7 : ffffffe001a57770
[ 5331.413683]  s2 : ffffffe3fec8ee20 s3 : ffffffcf0367e658 s4 : ffffffe3fec8ee30                                 
[ 5331.420893]  s5 : ffffffe0019cbf40 s6 : ffffffe0019cc8c0 s7 : ffffffe0893b3d28
[ 5331.428100]  s8 : 0000000000000000 s9 : 0000000000000000 s10: 0000000000000001
[ 5331.435310]  s11: 0000000000000001 t3 : 0000003fc6df6494 t4 : 000000000000000f                                 
[ 5331.442519]  t5 : 0000000000000001 t6 : 0000000000000000                                                       
[ 5331.447816] status: 0000000200000100 badaddr: 0000000000000008 cause: 000000000000000f                         
[ 5331.455721] Call Trace:                                                                                        
[ 5331.458153] [<ffffffe00017a2b4>] get_page_from_freelist+0x19e/0xdc4                                            
[ 5331.464409] [<ffffffe00017bd74>] __alloc_pages_nodemask+0xf4/0x1a4                                             
[ 5331.470576] [<ffffffe000165d2a>] __handle_mm_fault+0x3c8/0xa08                                                 
[ 5331.476393] [<ffffffe0001663e4>] handle_mm_fault+0x7a/0x100                                                    
[ 5331.481955] [<ffffffe000009bc8>] do_page_fault+0x128/0x426                                                     
[ 5331.487425] [<ffffffe0000039ea>] ret_from_exception+0x0/0xc                                                    
[ 5331.493079] ---[ end trace 4e418ad0738e1f80 ]---

and

[12094.443042] BUG: Bad page state in process …  pfn:1d7cec
[12094.448241] page:0000000005166fc2 refcount:0 mapcount:0 mapping:000000008b33616d index:0x1 pfn:0x1d7cec
[12094.457623] failed to read mapping contents, not a valid kernel address?
[12094.464305] flags: 0x4000000000080000(swapbacked)             
[12094.469001] raw: 4000000000080000 0000000000000100 0000000000000122 0000000000650000 
[12094.476728] raw: 0000000000000001 0000000000000000 00000000ffffffff
[12094.482980] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag(s) set
[12094.489406] Modules linked in: amdgpu snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_hda_codec mfd_core gpu_sched snd_hda_core backlight drm_ttm_helper ttm fuse
[12094.504349] CPU: 1 PID: 93058 Comm: … Not tainted 5.12.4 #1
[12094.510511] Hardware name: SiFive HiFive Unmatched (DT)
[12094.515720] Call Trace:
[12094.518153] [<ffffffe00000576a>] walk_stackframe+0x0/0xe2
[12094.523539] [<ffffffe000ac2bf4>] dump_backtrace+0x4c/0x5a
[12094.528922] [<ffffffe000ac2c26>] show_stack+0x24/0x2c
[12094.533959] [<ffffffe000ac9fa6>] dump_stack+0x7a/0x94
[12094.538997] [<ffffffe00017797e>] bad_page+0xf8/0x11e
[12094.543947] [<ffffffe00017ab04>] get_page_from_freelist+0x9ee/0xdc4
[12094.550202] [<ffffffe00017bd74>] __alloc_pages_nodemask+0xf4/0x1a4
[12094.556368] [<ffffffe000165d2a>] __handle_mm_fault+0x3c8/0xa08
[12094.562188] [<ffffffe0001663e4>] handle_mm_fault+0x7a/0x100
[12094.567746] [<ffffffe000009bc8>] do_page_fault+0x128/0x426
[12094.573219] [<ffffffe0000039ea>] ret_from_exception+0x0/0xc
[12094.578779] Disabling lock debugging due to kernel taint
[12094.584086] Unable to handle kernel paging request at virtual address ffffff8000000008

and

[12130.749844] Unable to handle kernel paging request at virtual address ffffffe0016cc0a0
[12130.757046] Oops [#1]
[12130.759292] Modules linked in: fuse
[12130.762768] CPU: 0 PID: 86990 Comm: python3 Not tainted 5.12.4 #1
[12130.768848] Hardware name: SiFive HiFive Unmatched (DT)
[12130.774061] epc : get_page_from_freelist+0x79e/0xdc4
[12130.779010]  ra : get_page_from_freelist+0x71c/0xdc4
[12130.783959] epc : ffffffe00017a8b4 ra : ffffffe00017a832 sp : ffffffe099d47be0
[12130.791170]  gp : ffffffe001a1b770 tp : ffffffe0fbbec600 t0 : 0000000000000001
[12130.798379]  t1 : ffffffe3fec73e40 t2 : ffffffe001000218 s0 : ffffffe099d47d20
[12130.805590]  s1 : 0000000000000010 a0 : 0000000000000000 a1 : ffffffcf0b0000c8
[12130.812798]  a2 : ffffffe0016cc0a0 a3 : 0000000000000002 a4 : ffffffe0019cbfe0
[12130.820007]  a5 : ffffffe0019cbfd0 a6 : ffffffe0019cc010 a7 : ffffffe001a57770
[12130.827216]  s2 : 000000000000003f s3 : 0000000000000010 s4 : ffffffe3fec73e40
[12130.834425]  s5 : ffffffe0019cbf40 s6 : 0000000000000003 s7 : ffffffe3fec73e30
[12130.841635]  s8 : 0000000000000001 s9 : 00000000000000d0 s10: ffffffcf04b3a968
[12130.848843]  s11: 0000000000000003 t3 : 0000000000000000 t4 : 0000000000000000
[12130.856052]  t5 : ffffffcf04b3a960 t6 : ffffffffffffffff
[12130.861350] status: 0000000200000100 badaddr: ffffffe0016cc0a0 cause: 000000000000000f
[12130.869255] Call Trace:
[12130.871687] [<ffffffe00017a8b4>] get_page_from_freelist+0x79e/0xdc4
[12130.877940] [<ffffffe00017bd74>] __alloc_pages_nodemask+0xf4/0x1a4
[12130.884109] [<ffffffe000165d2a>] __handle_mm_fault+0x3c8/0xa08
[12130.889927] [<ffffffe0001663e4>] handle_mm_fault+0x7a/0x100
[12130.895485] [<ffffffe000009bc8>] do_page_fault+0x128/0x426
[12130.900958] [<ffffffe0000039ea>] ret_from_exception+0x0/0xc
[12130.906641] ---[ end trace 2f8d60d8322c02e5 ]---

and

[44732.714147] page dumped because: bad pte
[44732.718056] addr:0000003fb0099000 vm_flags:00200073 anon_vma:ffffffe09592f058 mapping:0000000000000000 index:3fb0099
[44732.728566] file:(null) fault:0x0 mmap:0x0 readpage:0x0
[44732.733779] CPU: 3 PID: 143969 Comm: b-addcon Tainted: G    B D           5.12.4 #1
[44732.741408] Hardware name: SiFive HiFive Unmatched (DT)
[44732.746620] Call Trace:
[44732.749052] [<ffffffe00000576a>] walk_stackframe+0x0/0xe2
[44732.754436] [<ffffffe000ac2bf4>] dump_backtrace+0x4c/0x5a
[44732.759822] [<ffffffe000ac2c26>] show_stack+0x24/0x2c
[44732.764859] [<ffffffe000ac9fa6>] dump_stack+0x7a/0x94
[44732.769897] [<ffffffe000162c9a>] print_bad_pte+0x172/0x1b8
[44732.775370] [<ffffffe0001643dc>] unmap_page_range+0x45e/0x642
[44732.781102] [<ffffffe0001647fe>] unmap_vmas+0x92/0xda
[44732.786140] [<ffffffe00016c028>] exit_mmap+0xb0/0x1ba
[44732.791178] [<ffffffe00000c15c>] mmput+0x4c/0x10a
[44732.795868] [<ffffffe000013054>] do_exit+0x248/0x7fa
[44732.800819] [<ffffffe0000143ec>] do_group_exit+0x3e/0xce
[44732.806117] [<ffffffe00001fb0a>] get_signal+0x1a0/0x828
[44732.811328] [<ffffffe000004b90>] do_notify_resume+0x88/0x356
[44732.816974] [<ffffffe0000039ea>] ret_from_exception+0x0/0xc

I’m not sure if this is some kind of page table corruption or something else. It happens intermittently, not after any specific action.

The system is unresponsive after that. I haven’t changed the clock speed, and the board is mounted in an enclosure with proper airflow (haven’t seen temperatures >50°C in sensors) . This happens with 5.12.4 kernel pre-built from the SDK image, and also with the previous 5.11.x kernel.

An AMD RX 550 GPU is installed, but I am not using it at the moment. I have tried without the GPU connected, same problem. No USB devices are connected. The current workload is headless software testing utilizing CPU (and disk/memory) only.

The file system is a Debian RISC-V rootfs on a Samsung SSD 970 EVO Plus 2TB.

There doesn’t seem to be anything wrong with the board’s RAM, running a (single-threaded) memory tester for a few days didn’t uncover a single problem.

# memtester 15G                                                                                                                                                                                                      
memtester version 4.5.0 (64-bit)                                                                                                                                                                                                    
Copyright (C) 2001-2020 Charles Cazabon.                                                                          
Licensed under the GNU General Public License version 2 (only).                                                                                                                                                                     
                                                                                                                  
pagesize is 4096                                                                                                  
pagesizemask is 0xfffffffffffff000                                                                                                                                                                                                  
want 15360MB (16106127360 bytes)                                                                                                                                                                                                    
got  15360MB (16106127360 bytes), trying mlock ...locked.                                                         
Loop 1:                                                                                                                                                                                                                             
  Stuck Address       : ok                                                                                                                                                                                                          
  Random Value        : ok                                                                                        
  Compare XOR         : ok          
  Compare SUB         : ok                                                                                        
  Compare MUL         : ok                                                                                        
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok         
  Block Sequential    : ok         
  Checkerboard        : ok         
  Bit Spread          : ok         
  Bit Flip            : setting 216

I can’t really think of anything else to try. I could try removing the NVME, but that will be kind of inconvenient (and it might stop triggering the problem for sake of I/O just being a lot slower, instead of helping narrow down the underlying issue). Or maybe downclocking the CPU.

Disabling two of the cores stops the problem from happening:

echo 0 > ./devices/system/cpu/cpu3/online
echo 0 > ./devices/system/cpu/cpu2/online

(disabling only cpu2 or cpu3 was not enough)
My vague hypothesis is that while creating and destroying a lot of processes, allocating and de-allocating memory, something goes wrong with memory synchronization in the page tables, causing corruption.

The issue still exists on kernel 5.12.18 from demo-coreip-cli-unmatched-2021.07.00.rootfs.wic.xz.

I probably narrowed it down to (frequent?) interaction with the NVME. When mounting a tmpfs as working area, the problem doesn’t arise. Even when running a long time with all cores enabled.

I now do think it’s clock speed related. At 1.2Ghz and below it doesn’t happen anymore given that tmpfs is mounted in /tmp. But I tried higher speeds and I get what looks like this issue again. Of course this is a risk when clocking higher than spec.

I see similar problems during some high I/O loads. I’ve been trying to transfer a few dozen GB to the NVME card and it always locks up quite quickly but not 100% deterministically, i.e. at different amounts and with different kinds of problems, e.g., segfaults of the copying application (pv or dd), kernel ooopses (sometimes with a complete lockup, sometimes with continuous but unreliable service).

Initially, I though it might be related to the PCIe SATA controller I am using but this is not the case. I have not exhaustively tested it, but observed it in the following cases only (but even with transfer rates limited to 15MB/s):

  • PCIe SATA to NVMe
  • USB SATA to NVMe

I have not seen it with:

  • USB SATA to PCIe SATA
  • PCIe SATA to USB SATA
  • GbE to NVMe
  • GbE to /dev/null
  • /dev/zero to NVMe
  • /dev/zero to /dev/null
  • USB SATA to /dev/null

This is with FreedomUSDK 2021.08.00 (kernel 5.13.9) and the NVMe SSD is the recommended Samsung. The CPU frequency is set to 1.2GHz in U-Boot already… (is there any reliable way to determine it in Linux?).

I have also tested it with only 2 cores enabled (as suggested by vmedea) and this indeed also seem to mitigate the problem.

I use “perf stat /bin/ls” to get an estimate of the frequency.

Is there any resolution to these issues?

I’m getting similar problems to these with stock Ubuntu 5.11.10 and unmodified (no overclocking, nor core-reduction) Unmatched.

They are easily reproduced.

Here’s the latest, which halted the Unmatched:

[14988.302143] BUG: Bad page state in process cp pfn:27e92c
[14988.306805] page:00000000b10bfc10 refcount:0 mapcount:0 mapping:00000000bdf980c7 index:0x1 pfn:0x27e92c
[14988.316184] failed to read mapping contents, not a valid kernel address?
[14988.322867] flags: 0x4000000000000000()
[14988.326693] raw: 4000000000000000 0000000000000100 0000000000000122 0000000000610000
[14988.334420] raw: 0000000000000001 0000000000000000 00000000ffffffff
[14988.340673] page dumped because: non-NULL mapping
[14988.345363] Modules linked in: bluetooth ecdh_generic ecc fuse
[14988.351185] CPU: 1 PID: 5826 Comm: cp Not tainted 5.11.10 #1
[14988.356830] Call Trace:
[14988.359262] [] walk_stackframe+0x0/0xaa
[14988.364646] Disabling lock debugging due to kernel taint
[14988.369952] Unable to handle kernel paging request at virtual address 00000000001c03b4
[14988.377850] Oops [#1]
[14988.380106] Modules linked in: bluetooth ecdh_generic ecc fuse
[14988.385927] CPU: 1 PID: 5826 Comm: cp Tainted: G B 5.11.10 #1
[14988.392961] epc: ffffffe00013f2ce ra : ffffffe00013f534 sp : ffffffe0b8563900
[14988.400083] gp : ffffffe000e56510 tp : ffffffe082299d40 t0 : ffffffe000e6803f
[14988.407292] t1 : ffffffe000e68030 t2 : 0000000000000000 s0 : ffffffe0b8563a50
[14988.414502] s1 : 0000000000000010 a0 : 0000000000000001 a1 : 00000000001c03ac
[14988.421711] a2 : ffffffe000e25fe0 a3 : 0000000000000002 a4 : ffffffcf07113ba0
[14988.428920] a5 : ffffffe000e25f10 a6 : ffffffcf08bb0250 a7 : ffffffe000e25f50
[14988.436129] s2 : ffffffcf07113ba8 s3 : 0000000000000010 s4 : 0000000000000021
[14988.443339] s5 : 0000000000000022 s6 : ffffffe000e25e80 s7 : ffffffe3fecd1650
[14988.450547] s8 : 0000000000000001 s9 : ffffffe3fecd1640 s10: 00000000000000d0
[14988.457757] s11: 000000000000003f t3 : ffffffffffffffff t4 : ffffffffffffffff
[14988.464966] t5 : 0000000000000000 t6 : ffffffe0b85635d8
[14988.470265] status: 0000000200000100 badaddr: 00000000001c03b4 cause: 000000000000000f

As far as I know, no, besides disabling cores which is really sub-optimal.

hmm, it does sound like a bug in the silicon.

Any comments from SiFive on this?

This smells like a kernel bug (there were a bunch of similar issues before, riscv64 and other architectures too). Here is what advice:

  • File a bug report with Ubuntu bug tracker (post a link here too).
  • Try to find a reproducer (consistently fails). Record all those failures and attach to the bug report.
  • Remove all non-essiantial HW (e.g. keep just NVMe). Re-run. Some driver for your HW might be broken (happens from time to time).
  • Bisect kernel to find the bad commit.
  • The issue might be related to CONFIG_* entry too. Try reduced defconfig (e.g. take the official riscv64 from the upstream kernel) or/and try a different distribution.

As you can see it takes some effort to reproduce it (e.g. try stress-ng) and narrow down exactly what’s causing it. Most likely it’s yet another bug in the kernel, but someone needs to do this work to catch it at least.

Thanks David! These are great suggestions and exactly what I was looking for originally.

I just found the links to Ubuntu’s Launchpad, so will file the bug there and post it here as you suggest.

BTW, the hardware is minimum already and the problem is easily reproduced with just the recommended M.2 Samsung drive.

Don’t know my way around to change kernels, etc. So, will do the easy one first by trying a different distribution that hopefully has a different kernel version.

Thanks!

You can try this 5.15.0 kernel that I’m running right now. Just sudo dpkg -i linux-image-5.15.0_5.15.0-1_riscv64.deb and sudo dpkg -i linux-headers-5.15.0_5.15.0-1_riscv64.deb.

https://www.w6rz.net/linux-image-5.15.0_5.15.0-1_riscv64.deb

https://www.w6rz.net/linux-headers-5.15.0_5.15.0-1_riscv64.deb

Thanks Ron!! Will take a few days to try these and the previous pointers by David and report back.

Unfortunately, compiling and running a new kernel (5.16.0) did not solve the issue for me. I just got:

[67297.470558] Unable to handle kernel paging request at virtual address ffffff800000203e
[67297.477752] Oops [#1]
[67297.480000] Modules linked in: xts dm_crypt dm_mod amdgpu drm_ttm_helper ttm gpu_sched drm_kms_helper ftdi_sio snd_hda_codec_hdmi backlight snd_hda_intel snd_intel_dspcfg uas cfbfillrect syscopyarea snd_hda_codec cfbimgblt sysfillrect sysimgblt snd_hda_core fb_sys_fops cfbcopyarea fuse
[67297.505279] CPU: 3 PID: 386888 Comm: python3 Not tainted 5.16.0-00010-gb351901d62a5 #22
[67297.513266] Hardware name: SiFive HiFive Unmatched (DT)
[67297.518478] epc : rmqueue_bulk+0xca/0x42c
[67297.522474]  ra : rmqueue_bulk+0x4a/0x42c
[67297.526470] epc : ffffffff8018bf7e ra : ffffffff8018befe sp : ffffffd11a213b50
[67297.533677]  gp : ffffffff81a2c278 tp : ffffffe0a0db4b40 t0 : ffffffff80009c6a
[67297.540886]  t1 : 000000000000000a t2 : ffffffff81000250 s0 : ffffffd11a213c00
[67297.548097]  s1 : ffffffff819da640 a0 : ffffffff819daa20 a1 : 0000000000000001
[67297.555306]  a2 : 4000000000002036 a3 : 0000000000000000 a4 : ffffffff819da710
[67297.562515]  a5 : ffffffff819da640 a6 : 0000000000000cc0 a7 : 00000000000000d0
[67297.569723]  s2 : 0000000000000010 s3 : 0000000000000000 s4 : 0000000000000000
[67297.576934]  s5 : 000000000000000a s6 : 000000000000000a s7 : ffffffe3fed63430
[67297.584142]  s8 : 000000000000003f s9 : 0000000000000001 s10: ffffffcf0c659708
[67297.591353]  s11: ffffffff819da710 t3 : 0000000000000001 t4 : 05000065756c6176
[67297.598562]  t5 : ffffffcf0c659700 t6 : 000000000000000b
[67297.603860] status: 0000000200000100 badaddr: ffffff800000203e cause: 000000000000000f
[67297.611765] [<ffffffff8018bf7e>] rmqueue_bulk+0xca/0x42c
[67297.617061] [<ffffffff8018d3c6>] get_page_from_freelist+0x67c/0x9e0
[67297.623317] [<ffffffff8018e5ee>] __alloc_pages+0x10e/0x1b2
[67297.628789] [<ffffffff80178182>] __handle_mm_fault+0x39e/0x9c2
[67297.634609] [<ffffffff80178820>] handle_mm_fault+0x7a/0x100
[67297.640165] [<ffffffff80009da2>] do_page_fault+0x138/0x39c
[67297.645640] [<ffffffff8000375e>] ret_from_exception+0x0/0xc
[67297.651367] ---[ end trace 74f2ea64d69c4ca9 ]--

I don’t have time at the moment to narrow it down, sorry. I tried with the bare minimum hw before, above. Bisecting is not possible as I have the problem with every kernel.