Strange dual-core hang problem, may be due to riscv linux port

The situation is a bit complicated, but I’ll try to explain it concisely:

  • Trying to stress-test a Xilinx 118 FPGA board configured to simulate two cores.
  • No matter what kind of tests are run, after about 8-12 hours, the entire system will start to hang and print out this log:

[25480.961569] INFO: rcu_sched detected stalls on CPUs/tasks:
[25480.966414] (detected by 0, t=5367657 jiffies, g=314389, c=314388, q=8925110)
[25480.973689] All QSes seen, last rcu_sched kthread activity 5367657 (4302763136-4297395479), jiffies_till_next_fqs=1, root ->qsmask 0x0
[25480.985721] swapper/0 R running task 0 0 0 0x00000000
[25480.992749] Call Trace:
[25480.995320] [<00000000eed15f23>] walk_stackframe+0x0/0xa2
[25481.000657] [<000000002cfb051f>] show_stack+0x26/0x34
[25481.005678] [<00000000b38ada7e>] sched_show_task+0xa6/0xfc
[25481.011165] [<00000000a284f5dc>] rcu_check_callbacks+0x65a/0x660
[25481.017178] [<00000000345a99ce>] update_process_times+0x1e/0x48
[25481.023085] [<00000000c5067c62>] tick_periodic+0x40/0xac
[25481.028374] [<000000004313fe88>] tick_handle_periodic+0x1a/0x5c
[25481.034289] [<00000000c525e7e4>] riscv_timer_interrupt+0x26/0x32
[25481.040266] [<000000007b2d8fa8>] riscv_intc_irq+0xb4/0xf2
[25481.045666] [<0000000062a56ec1>] ret_from_syscall+0xa/0xe
[25481.051058] rcu_sched kthread starved for 5367657 jiffies! g314389 c314388 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=1
[25481.061677] rcu_sched R running task 0 8 2 0x00000000
[25481.068704] Call Trace:
[25481.071253] [<000000008c7032a6>] __schedule+0x1c6/0x4ea

  • It seems to suggest CPU1 wasn’t initialized properly, or is somehow otherwise not functioning.
  • I suspect the problem might have to do with “RCU” in the linux kernel, but it’s hard to believe that’s specifically the problem since that’s not part of the RISC-V specific parts of the kernel
  • We’re using a frozen older version of the RISC-V linux kernel port, from GitHub - riscvarchive/riscv-linux at 758d792057a2c0276844bc88e790f3ddabfc43ae

Anyone else encounter this before?

Does this patch help?

https://lore.kernel.org/linux-riscv/CALoQrwdLANaOaYiGvFxt23PBdHcgcc_LWVFORNwrAXWBhOyJsA@mail.gmail.com/

The patch did seem to change one thing - that big mess of a log out there no longer gets printed when the cores hang. Unfortunately, yes, the cores still hang.

That seems to be a known issue I kept running into with 4.15.

Are you able to upgrade to 4.19[1]? This version seems to be quite reliable

[1] https://github.com/sifive/riscv-linux/tree/20eeb6522e3302c5f6e435c0bdba40ff57ffa41a

I’m aware 4.19 probably fixes the issue - the part of the kernel I’ve been looking at (to do with initialization) makes certain changes that would likely resolve this core hang. However, since the underlying structure in 4.19 is so different than the one in 4.15, implementing that same change on my local version of the kernel doesn’t work. Upgrading isn’t an option either - the FPGA, for whatever reason, only seems to work with 4.15.

Are you using one of SiFive’s SoC builds?