Compilation failure

I have a problem that compilation failure occurred with an internal compiler error.
The error message for building systemd is following.

malloc(): unaligned tcache chunk detected
malloc(): unaligned tcache chunk detected
cc: internal compiler error: Aborted signal terminated program cc1
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-10/README.Bugs> for instructions.

It seems to be related to building concurrency, it succeeds with make -j1, but rarely failed.
I found a similar issue on fantoo bug tracker here https://bugs.funtoo.org/browse/FL-8483?attachmentOrder=asc, and tried the latest GCC and glibc, but did not solve it.

Does anyone face a similar issue?

My environment is Ubuntu 21.04 and kernel is 5.11.0-1009, gcc is 10.3.0-1ubuntu1.
I also tested with 5.13 kernel from upstream and gcc 11.1.0-1ubuntu1~21.04, but they did not help.

My instruction is here

apt-get source systemd
apt-get build-dep systemd
cd systemd-*
dpkg-buildpackage

If you’ve changed the CPU clock speed, try with it 100 MHz slower.

Is the error reproducible? Do you have USB devices attached? If so, what kind of devices and which sockets are they attached to?

I’ve seen the same problem. The error is non-producible in my case, and so far has only happened with USB devices attached. I suspect that only non-HID USB devices can cause a problem, but my system is headless so I’m not sure. I suspect a hardware or kernel problem with the USB support. I would suggest not using any non-HID USB devices when compiling.

1 Like

Does anything unexpected appear in dmesg when this happens?

I have not changed clock rate. I use the official u-boot image, so the clock rate is 1.2GHz.
BTW, I will test with lower clock rate.

It is reproduced with high probability if parallelism is 4.
I use the board as a headless too without any USB devices.

No, there are no messages.

Unfortunately, it is reproduced with 1.0GHz.

I uninstalled NVMe SSD from the board and tested again, but the problem had happened.
I also run the same test on my unleashed, it succeeded, so I suspect my board is corrupted.

Multiple people have reported this problem. I don’t think it is an issue with your board. I suspect a linux kernel bug or an issue with the SoC.

You are right.
I found an errata CIP-1200 that causes improper TLB flushing, but it is already fixed in 5.13.

Could you tell me your kernel version?
I tested with meta-sifive release 2021.07 and 2021.08, but both do not help me.

I’ve been testing freedom-u-sdk releases. What I have on my board now is from about a week before the 2021.08 release, so I might be missing a few patches. It is a 5.13 kernel. I did see two non-reproducible failures with about 3 days of make -j4 builds.

Thanks. I’ll try to test with the new kernel release 5.14.

So “sfence.vma addr” is broken on HiFive Unmatched and must not be used?

Title
Instruction TLB can fail to respect a non-global SFENCE
Implication
If an SFENCE.VMA with rs1 != x0 or rs2 != x0 happens on the same cycle as an I-TLB refill,
the refill still occurs, even if the SFENCE.VMA should’ve flushed the entry being refilled.
This can lead to stale page mappings marked as valid in the TLB, which can in-turn allow
unprivileged accesses, a security hole.
A global sfence.vma must be issued to properly invalidate TLB entries, which would have
only performance implications and not functional.
Workaround
Flush the TLB using SFENCE.VMA x0, x0

A global SFENCE.VMA is a pretty big hammer to use, with performance implications. It probably doesn’t matter if you’re changing a lot of mappings e.g. on a process switch. But it could be pretty detrimental for things such as a JIT compiler.

I wonder if it’s practical to instead ensure that an iTLB refill isn’t happening in the same clock cycle?

How would that even happen? Because the SFENCE.VMA is in the last bytes of a 4k page? Because the other instruction issued in the same cycle is a branch to a different 4k page? Or at least is predicted to branch to a different 4K page.

Something like that?

If so, it seems to me that could be avoided with careful coding. Not as easily as just using a global SFENCE.VMA instead, but it might be worth it.

1 Like

As bruce says, it seems ok to use global sfence.

It still happens on 5.14.