The situation is a bit complicated, but I’ll try to explain it concisely:
- Trying to stress-test a Xilinx 118 FPGA board configured to simulate two cores.
- No matter what kind of tests are run, after about 8-12 hours, the entire system will start to hang and print out this log:
[25480.961569] INFO: rcu_sched detected stalls on CPUs/tasks:
[25480.966414] (detected by 0, t=5367657 jiffies, g=314389, c=314388, q=8925110)
[25480.973689] All QSes seen, last rcu_sched kthread activity 5367657 (4302763136-4297395479), jiffies_till_next_fqs=1, root ->qsmask 0x0
[25480.985721] swapper/0 R running task 0 0 0 0x00000000
[25480.992749] Call Trace:
[25480.995320] [<00000000eed15f23>] walk_stackframe+0x0/0xa2
[25481.000657] [<000000002cfb051f>] show_stack+0x26/0x34
[25481.005678] [<00000000b38ada7e>] sched_show_task+0xa6/0xfc
[25481.011165] [<00000000a284f5dc>] rcu_check_callbacks+0x65a/0x660
[25481.017178] [<00000000345a99ce>] update_process_times+0x1e/0x48
[25481.023085] [<00000000c5067c62>] tick_periodic+0x40/0xac
[25481.028374] [<000000004313fe88>] tick_handle_periodic+0x1a/0x5c
[25481.034289] [<00000000c525e7e4>] riscv_timer_interrupt+0x26/0x32
[25481.040266] [<000000007b2d8fa8>] riscv_intc_irq+0xb4/0xf2
[25481.045666] [<0000000062a56ec1>] ret_from_syscall+0xa/0xe
[25481.051058] rcu_sched kthread starved for 5367657 jiffies! g314389 c314388 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=1
[25481.061677] rcu_sched R running task 0 8 2 0x00000000
[25481.068704] Call Trace:
[25481.071253] [<000000008c7032a6>] __schedule+0x1c6/0x4ea
- It seems to suggest CPU1 wasn’t initialized properly, or is somehow otherwise not functioning.
- I suspect the problem might have to do with “RCU” in the linux kernel, but it’s hard to believe that’s specifically the problem since that’s not part of the RISC-V specific parts of the kernel
- We’re using a frozen older version of the RISC-V linux kernel port, from GitHub - riscvarchive/riscv-linux at 758d792057a2c0276844bc88e790f3ddabfc43ae
Anyone else encounter this before?