Strange dual-core hang problem, may be due to riscv linux port

benjaminou4412 · June 20, 2019, 10:35pm

The situation is a bit complicated, but I’ll try to explain it concisely:

Trying to stress-test a Xilinx 118 FPGA board configured to simulate two cores.
No matter what kind of tests are run, after about 8-12 hours, the entire system will start to hang and print out this log:

[25480.961569] INFO: rcu_sched detected stalls on CPUs/tasks:
[25480.966414] (detected by 0, t=5367657 jiffies, g=314389, c=314388, q=8925110)
[25480.973689] All QSes seen, last rcu_sched kthread activity 5367657 (4302763136-4297395479), jiffies_till_next_fqs=1, root ->qsmask 0x0
[25480.985721] swapper/0 R running task 0 0 0 0x00000000
[25480.992749] Call Trace:
[25480.995320] [<00000000eed15f23>] walk_stackframe+0x0/0xa2
[25481.000657] [<000000002cfb051f>] show_stack+0x26/0x34
[25481.005678] [<00000000b38ada7e>] sched_show_task+0xa6/0xfc
[25481.011165] [<00000000a284f5dc>] rcu_check_callbacks+0x65a/0x660
[25481.017178] [<00000000345a99ce>] update_process_times+0x1e/0x48
[25481.023085] [<00000000c5067c62>] tick_periodic+0x40/0xac
[25481.028374] [<000000004313fe88>] tick_handle_periodic+0x1a/0x5c
[25481.034289] [<00000000c525e7e4>] riscv_timer_interrupt+0x26/0x32
[25481.040266] [<000000007b2d8fa8>] riscv_intc_irq+0xb4/0xf2
[25481.045666] [<0000000062a56ec1>] ret_from_syscall+0xa/0xe
[25481.051058] rcu_sched kthread starved for 5367657 jiffies! g314389 c314388 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0 ->cpu=1
[25481.061677] rcu_sched R running task 0 8 2 0x00000000
[25481.068704] Call Trace:
[25481.071253] [<000000008c7032a6>] __schedule+0x1c6/0x4ea

It seems to suggest CPU1 wasn’t initialized properly, or is somehow otherwise not functioning.
I suspect the problem might have to do with “RCU” in the linux kernel, but it’s hard to believe that’s specifically the problem since that’s not part of the RISC-V specific parts of the kernel
We’re using a frozen older version of the RISC-V linux kernel port, from GitHub - riscvarchive/riscv-linux at 758d792057a2c0276844bc88e790f3ddabfc43ae

Anyone else encounter this before?

paulw · June 21, 2019, 3:46pm

Does this patch help?

https://lore.kernel.org/linux-riscv/CALoQrwdLANaOaYiGvFxt23PBdHcgcc_LWVFORNwrAXWBhOyJsA@mail.gmail.com/

benjaminou4412 · June 26, 2019, 9:13pm

The patch did seem to change one thing - that big mess of a log out there no longer gets printed when the cores hang. Unfortunately, yes, the cores still hang.

tmagik · June 26, 2019, 9:19pm

That seems to be a known issue I kept running into with 4.15.

Are you able to upgrade to 4.19[1]? This version seems to be quite reliable

[1] https://github.com/sifive/riscv-linux/tree/20eeb6522e3302c5f6e435c0bdba40ff57ffa41a

benjaminou4412 · June 26, 2019, 9:36pm

I’m aware 4.19 probably fixes the issue - the part of the kernel I’ve been looking at (to do with initialization) makes certain changes that would likely resolve this core hang. However, since the underlying structure in 4.19 is so different than the one in 4.15, implementing that same change on my local version of the kernel doesn’t work. Upgrading isn’t an option either - the FPGA, for whatever reason, only seems to work with 4.15.

paulw · June 26, 2019, 11:12pm

Are you using one of SiFive’s SoC builds?

Topic		Replies	Views
Access RISCV JTAG through USB on Xilinx FPGA General	0	3067	February 24, 2017
Perf-like tool for RISC-V? General	2	2254	September 6, 2017
Will RISCV avoid the linux mainlining mess that ARM had? RISC-V	5	3848	January 31, 2019
Help me. I am confused about RISC-V cores General	5	3692	August 15, 2019
RISC-V Vector Extension RISC-V	2	2105	April 24, 2020

Strange dual-core hang problem, may be due to riscv linux port

Related Topics