this post was submitted on 25 May 2024
108 points (98.2% liked)
Linux
48287 readers
647 users here now
From Wikipedia, the free encyclopedia
Linux is a family of open source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991 by Linus Torvalds. Linux is typically packaged in a Linux distribution (or distro for short).
Distributions include the Linux kernel and supporting system software and libraries, many of which are provided by the GNU Project. Many Linux distributions use the word "Linux" in their name, but the Free Software Foundation uses the name GNU/Linux to emphasize the importance of GNU software, causing some controversy.
Rules
- Posts must be relevant to operating systems running the Linux kernel. GNU/Linux or otherwise.
- No misinformation
- No NSFW content
- No hate speech, bigotry, etc
Related Communities
Community icon by Alpár-Etele Méder, licensed under CC BY 3.0
founded 5 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
You haven't given us much information about the CPU. That is very important when dealing with Machine Check Errors (MCEs).
I've done a bit of work with MCEs and AMD CPUs, so I'll help with understanding what may be going wrong and what you probably can do.
I've done a bit of searching from the microcode & the Dell Wyse thin client that you mentioned. From what I can garner, are you using a Dell Wyse 5060 Thin Client with an AMD steppe Eagle GX-424 [1]? This is my assumption for the rest of this comment.
Machine Check Errors (MCEs) are hard to decipher find out without the right documentation. As far as I can tell from AMD's Data Sheet for the G-Series [2], this CPU belongs to family 16H.
You have two MCEs in your image:
Now, you can attempt to decipher these with a tool I used some time ago, MCE-Ryzen-Decoder [4]; you may note that the name says Ryzen - this tool only decodes MCEs of Ryzen architectures. However, MCE designs may not change much between families, but I wouldn't bank (pun not intended) on it because it seems that the G-Series are an embedded SOC compared to the Ryzen CPUs which are not. I gave it a shot and the tool spit out that you may have an issue in:
Wouldn't bank (pun intended this time) on it though.
What you can do is to go through the AMD Family 16H's BIOS and Kernel Developer Guide [3] (Section 2.16.1.5 Error Code). From Section 2.16.1.1 Machine Check Registers, it looks like Bank 01 corresponds to the IC (Instruction Cache) and Bank 04 corresponds to the NB (Northbridge). This means that the CPU found issues in the NB in core 0 and the IC in core 1. You can go even further and check what those exact codes decipher to, but I wouldn't put in that much effort - there's not much you can do with that info (maybe the NB, but... too much effort). There are some MSRs that you can read out that correspond to errors of these banks (from Table 86: Registers Commonly Used for Diagnosis), but like I said, there's not much you can do with this info anyway.
Okay, now that the boring part is over (it was fun for me), what can you do? It looks like the CPU is a quad core CPU. I take it to mean that it's 4 cores * 2 SMT threads. If you have access to the linux command line parameters [5], say via GRUB for example, I would try to isolate the two faulty cores we see here: core 0 and core 1. Add
isolcpus=0,1
to see the kernel boots. There's a good chance that we see only two CPU cores failing, but others may also be faulty but the errors weren't spit out. It's worth a shot, but it may not work.Alternatively, you can tell the kernel to disable MCE checks entirely and continue executing; this can be done with the
mce=off
command line parameter [6] . Beware that this means that you're now willingly running code on a CPU with two cores that have been shown to be faulty (so far).isolcpus
will make sure that the kernel doesn't execute any "user" code on those cores unless asked to (viataskset
for example)Apart from this, like others have pointed out, the red dots on the screen aren't a great sign. Maybe you can individually replace defective parts, or maybe you have to buy a new machine entirely. What I told you with this comment is to check whether your CPU still works with 2 SMT threads faulty.
Good luck and I hope you fix your server 🤞.
Edited to add: I have seen MCEs appear due to extremely low/high/fluctuating voltages. As others pointed out, your PSU or other components related to power could be busted.
[1] https://www.dell.com/support/manuals/en-us/wyse-5060-thin-client/5060_wie10_ug/system-specifications?guid=guid-cbeecec5-25ac-4103-8b4b-7d3a975e91f0&lang=en-us
[2] https://www.amd.com/content/dam/amd/en/documents/archived-tech-docs/datasheets/52259_KB_G-Series_Product_Data_Sheet.pdf
[3] https://www.amd.com/content/dam/amd/en/documents/archived-tech-docs/programmer-references/52740_16h_Models_30h-3Fh_BKDG.pdf
[4] https://github.com/DimitriFourny/MCE-Ryzen-Decoder
[5] https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html
[6] https://elixir.bootlin.com/linux/v6.9.2/source/Documentation/arch/x86/x86_64/boot-options.rst
Amazing. I'm not OP and have no use for this info, but it was fun to learn it still.