Compilation failure

toshipp · July 24, 2021, 7:42am

I have a problem that compilation failure occurred with an internal compiler error.
The error message for building systemd is following.

malloc(): unaligned tcache chunk detected
malloc(): unaligned tcache chunk detected
cc: internal compiler error: Aborted signal terminated program cc1
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-10/README.Bugs> for instructions.

It seems to be related to building concurrency, it succeeds with make -j1, but rarely failed.
I found a similar issue on fantoo bug tracker here https://bugs.funtoo.org/browse/FL-8483?attachmentOrder=asc, and tried the latest GCC and glibc, but did not solve it.

Does anyone face a similar issue?

My environment is Ubuntu 21.04 and kernel is 5.11.0-1009, gcc is 10.3.0-1ubuntu1.
I also tested with 5.13 kernel from upstream and gcc 11.1.0-1ubuntu1~21.04, but they did not help.

My instruction is here

apt-get source systemd
apt-get build-dep systemd
cd systemd-*
dpkg-buildpackage

bruce · July 24, 2021, 9:52am

If you’ve changed the CPU clock speed, try with it 100 MHz slower.

jimw · July 24, 2021, 8:02pm

Is the error reproducible? Do you have USB devices attached? If so, what kind of devices and which sockets are they attached to?

I’ve seen the same problem. The error is non-producible in my case, and so far has only happened with USB devices attached. I suspect that only non-HID USB devices can cause a problem, but my system is headless so I’m not sure. I suspect a hardware or kernel problem with the USB support. I would suggest not using any non-HID USB devices when compiling.

vmedea · July 25, 2021, 9:02am

Does anything unexpected appear in dmesg when this happens?

toshipp · July 25, 2021, 2:25pm

I have not changed clock rate. I use the official u-boot image, so the clock rate is 1.2GHz.
BTW, I will test with lower clock rate.

toshipp · July 25, 2021, 2:28pm

It is reproduced with high probability if parallelism is 4.
I use the board as a headless too without any USB devices.

toshipp · July 25, 2021, 2:30pm

No, there are no messages.

toshipp · July 25, 2021, 2:55pm

Unfortunately, it is reproduced with 1.0GHz.

toshipp · August 28, 2021, 9:28am

I uninstalled NVMe SSD from the board and tested again, but the problem had happened.
I also run the same test on my unleashed, it succeeded, so I suspect my board is corrupted.

jimw · August 28, 2021, 4:27pm

Multiple people have reported this problem. I don’t think it is an issue with your board. I suspect a linux kernel bug or an issue with the SoC.

toshipp · August 29, 2021, 1:48pm

You are right.
I found an errata CIP-1200 that causes improper TLB flushing, but it is already fixed in 5.13.

github.com/torvalds/linux

riscv: sifive: Apply errata "cip-1200" patch

committed 02:26PM - 22 Mar 21 UTC

+40 -2

For certain SiFive CPUs, "sfence.vma addr" cannot exactly flush addr from TLB in… the particular cases. The details could be found here: https://sifive.cdn.prismic.io/sifive/167a1a56-03f4-4615-a79e-b2a86153148f_FU740_errata_20210205.pdf In order to ensure the functionality, this patch uses the Alternative scheme to replace all "sfence.vma addr" with "sfence.vma" at runtime. Signed-off-by: Vincent Chen <vincent.chen@sifive.com> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>

Could you tell me your kernel version?
I tested with meta-sifive release 2021.07 and 2021.08, but both do not help me.

jimw · August 29, 2021, 4:38pm

I’ve been testing freedom-u-sdk releases. What I have on my board now is from about a week before the 2021.08 release, so I might be missing a few patches. It is a 5.13 kernel. I did see two non-reproducible failures with about 3 days of make -j4 builds.

toshipp · August 30, 2021, 2:17pm

Thanks. I’ll try to test with the new kernel release 5.14.

X512 · August 30, 2021, 3:13pm

So “sfence.vma addr” is broken on HiFive Unmatched and must not be used?

bruce · August 31, 2021, 12:27am

Title
Instruction TLB can fail to respect a non-global SFENCE
Implication
If an SFENCE.VMA with rs1 != x0 or rs2 != x0 happens on the same cycle as an I-TLB refill,
the refill still occurs, even if the SFENCE.VMA should’ve flushed the entry being refilled.
This can lead to stale page mappings marked as valid in the TLB, which can in-turn allow
unprivileged accesses, a security hole.
A global sfence.vma must be issued to properly invalidate TLB entries, which would have
only performance implications and not functional.
Workaround
Flush the TLB using SFENCE.VMA x0, x0

A global SFENCE.VMA is a pretty big hammer to use, with performance implications. It probably doesn’t matter if you’re changing a lot of mappings e.g. on a process switch. But it could be pretty detrimental for things such as a JIT compiler.

I wonder if it’s practical to instead ensure that an iTLB refill isn’t happening in the same clock cycle?

How would that even happen? Because the SFENCE.VMA is in the last bytes of a 4k page? Because the other instruction issued in the same cycle is a branch to a different 4k page? Or at least is predicted to branch to a different 4K page.

Something like that?

If so, it seems to me that could be avoided with careful coding. Not as easily as just using a global SFENCE.VMA instead, but it might be worth it.

toshipp · August 31, 2021, 2:46pm

As bruce says, it seems ok to use global sfence.

toshipp · September 12, 2021, 8:39am

It still happens on 5.14.

Topic		Replies	Views
Hang on high CPU load HiFive Unleashed	3	2762	April 22, 2020
Intermittent kernel oops under heavy load HiFive Unmatched	20	4749	July 30, 2022
U-Boot says Unhandled exception: Illegal instruction HiFive Unmatched	13	4145	August 18, 2021
Linux 5.0 on HiFive Unleashed [solved] HiFive Unleashed	8	5533	November 4, 2019
What is the problem HiFive Unmatched	5	2509	May 5, 2022

Compilation failure

Related topics