this post was submitted on 05 Aug 2024
33 points (90.2% liked)

Linux

48328 readers
614 users here now

From Wikipedia, the free encyclopedia

Linux is a family of open source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991 by Linus Torvalds. Linux is typically packaged in a Linux distribution (or distro for short).

Distributions include the Linux kernel and supporting system software and libraries, many of which are provided by the GNU Project. Many Linux distributions use the word "Linux" in their name, but the Free Software Foundation uses the name GNU/Linux to emphasize the importance of GNU software, causing some controversy.

Rules

Related Communities

Community icon by Alpár-Etele Méder, licensed under CC BY 3.0

founded 5 years ago
MODERATORS
 

Hi All. I'm having an issue that I am hoping I can get some help with.

I have been using linux on this particular laptop for over a year now, and for the past 6 or so months (right around the time I upgraded to Plasma 6, but I think it is just a coincidence) about 50% of the time, when I update all my packages via package manager, the whole system freezes. Like, hard freezes. Waiting any amount of time won't get me out of it. I have to hold the power button to power it down. I can't use ctrl+alt+F3 or whatever to get another TTY. Mouse doesn't move. Nothing works.

It originally happened with OpenSUSE Tumbleweed on btrfs. I thought maybe it was btrfs, so I reinstalled with ext4. Same issue. I tried Manjaro. Same issue. I tried EndeavourOS (wasn't really expecting different behavior at this point). Same issue.

Now I am thinking, what could cause an issue like this? Well, a package manager update just is a ton of file I/O operations, right? Could I have bad RAM and that is getting written to disk? Well, I did a memtest today and it came back perfect. So now I'm thinking it might be the SSD, but I'm not even sure how to check that.

Does anyone have any ideas of what might be going on or what I should do to fix it or debug it?

all 26 comments
sorted by: hot top controversial new old
[–] thayer@lemmy.ca 13 points 3 months ago* (last edited 3 months ago) (1 children)

It definitely sounds like a hardware issue since it has survived multiple disk wipes and distro changes.

  1. Make and verify your backups now if you don't already have them
  2. Are you using the command line package manager or GUI?
  3. What is your current distro?
  4. Are you near capacity on your storage?
  5. Run a S.M.A.R.T. test and review the results
[–] dandroid@sh.itjust.works 2 points 3 months ago* (last edited 3 months ago) (1 children)
  1. Already done. :)
  2. It started with GUI, but I switched to command line, and even did it in a separate TTY to make sure it wasn't something weird going on with updating plasma from plasma. After switching to Arch-based distros, I have only use CLI.
  3. Currently I'm on EndeavourOS, but after the most recent time this happened, it won't boot, and I can't even mount and chroot to it (I get an I/O error message)
  4. No, I'm at about 1% capacity.
  5. I ran this from a live USB, and it came back with no errors detected, but returned instantaneously, so I'm not sure if it ran the right thing. Doing more research on it now. Edit: I did it wrong. The test is running now. Edit 2: Smart says it passed. :/
[–] thayer@lemmy.ca 5 points 3 months ago (1 children)

If you're now getting I/O errors that won't even get you booted, it sounds to me like drive failure is imminent.

[–] dandroid@sh.itjust.works 2 points 3 months ago (1 children)

Yeah, I just ordered a new SSD. I'll give that a try when it arrives tomorrow. Smart says it passed, but suspect my SSD enough that I think it's worth it to just try another SSD.

[–] terminhell@lemmy.dbzer0.com 2 points 3 months ago

Make sure smart is enabled in the bios. Too often I see it flipped off by default.

[–] eugenia@lemmy.ml 4 points 3 months ago (1 children)

I have the same problem with an older macbook air and linux, and I bought a new SSD for it (will be here at the end of the month, since it's a special pre-ssd model). Husband, who's an engineer, said that such hard freezes are usually the ssd's fault, and not the memory's (he said memory creates other kinds of crashes, but not this kind of hard freeze).

[–] dandroid@sh.itjust.works 1 points 3 months ago

Well, I just reinstalled on a new SSD this morning. Fingers crossed it all works out! It usually takes a few weeks for the issues to start happening each time, so I guess I'll just wait in agony until then. 🥲

[–] Drathro@dormi.zone 3 points 3 months ago (1 children)

Try swapping to BFQ io scheduler and see if that makes a difference.

[–] dandroid@sh.itjust.works 1 points 3 months ago (2 children)

Thank you for the suggestion. I am unfamiliar with this, but I am reading about it now.

[–] thayer@lemmy.ca 3 points 3 months ago* (last edited 3 months ago)

For what it's worth, I've never had to change my io scheduler in the nearly twenty years I've used Linux. You can check your current scheduler with the following command: cat /sys/block/sda/queue/scheduler (change the block device to whatever yours is...sda, nvme0n1, etc.).

In my case, it was already bfq: one mq-deadline kyber [bfq]

[–] Drathro@dormi.zone 2 points 3 months ago* (last edited 3 months ago)

Even with nvme drives which supposedly "don't need" to use BFQ, I STILL always swap it since it maintains responsiveness across the system during heavy IO loads. I used to have similar full system freezes when downloading steam games which notoriously overload your IO in Linux. BFQ was the solution every single time.

Edit Try following the instructions detailed in this post to add a systemd rule to set the scheduler: https://stackoverflow.com/questions/1009577/selecting-a-linux-i-o-scheduler

The second answer that shows an actual rules.d file example has always worked for me. If using nvme or old school spinning rust you'll need to change it up a bit. Instead of "noop" set it to "BFQ".

[–] ChojinDSL@discuss.tchncs.de 3 points 3 months ago (1 children)

How long did you run the memtest for? Ideally it should run a couple of times, since just a single pass might not detect any errors.

But it's weird that it happens when you try to update. Could it maybe be related to your network hardware, either LAN or WiFi? If you're using WiFi, try LAN, or vice versa. Perhaps even a USB dongle, and disable the onboard network hardware completely.

[–] dandroid@sh.itjust.works 1 points 3 months ago

It did 4 passes by default. It took almost 3 hours to run on 16 GB of RAM.

It's possible there is an issue with the WiFi. I had lots of WiFi issues on this laptop when I used Windows (micro freezing when using typing into an SSH shell, and pings would drop at the same time), but since switching to Linux, those went away. I'll definitely keep that in mind. I'll try using wired network if the issues come back after swapping SSDs.

[–] nublug@lemmy.blahaj.zone 2 points 3 months ago (2 children)

sounds like maaaaybe too little ram and no swap? what is your ram size and do you have any swap or zram enabled? i kinda doubt it because multiple distros should have a swap space or zram on by default on a fresh install but maybe not or you explicitly chose not to and it's running out of memory.

[–] dandroid@sh.itjust.works 2 points 3 months ago (1 children)

I have 16GB of RAM and 16GB of swap as a swap partition. Though I have also tried a 16GB swapfile and saw no difference. I don't know about zram.

[–] seaQueue@lemmy.world 0 points 3 months ago

If you haven't intentionally setup zram swap you're almost certainly not using it.

[–] seaQueue@lemmy.world 2 points 3 months ago

Even then the recent LRU swap changes have largely eliminated pathological swap behavior cascading into an unusable system state. Those changes went in like 2y ago and should have been picked up by most distros by now.

[–] seaQueue@lemmy.world 2 points 3 months ago* (last edited 3 months ago) (1 children)

Try firing up btop in a terminal before you kick off an update in another, that should give you a better indication of what's happening when the system hangs. Turn on kernel "show kernel threads" so you can spot anything kernel side eating CPU.

Check your kernel journal from the last boot after a freeze, there should be some indication of what went wrong before you rebooted the machine. journalctl -k -b -1 will show you what was going on with the kernel before the machine froze.

Edit: things to watch for in btop: CPU pegged at 100% and no disk activity? Look at the top process, there's your offender. Super high IO latency but otherwise the system looks normal? Try another drive. Memory completely used and swap endlessly thrashing? Find something to kill to make more memory available.

Turn on "show kernel threads" in btop, they're off by default, so you can see if something in the kernel is eating CPU time.

[–] dandroid@sh.itjust.works 1 points 3 months ago (1 children)

I will try this once I get my system back up and running tomorrow! I'm going to install the distro on a new SSD and see what happens.

[–] seaQueue@lemmy.world 2 points 3 months ago

I made a couple of edits above re: btop and troubleshooting, if you're not used to diagnosing hardware and kernel issues they might help

[–] forbiddenlake@lemmy.world 2 points 3 months ago* (last edited 3 months ago) (1 children)

What hardware? And can you narrow down when during updates?

I had this problem on Arch on a 5 year old Lenovo laptop with an Nvidia 1660ti GPU. With judicious use of set -x I narrowed it down to systemd daemon-reload.

I actually changed my ext4 journal mode and added a pacman hook in that calls sync before any systemd hooks ran, after the second time half of the package updates got lost due to the freeze.

Because the problem only happened most times, and usually not soon after a reboot, I can't prove it, but the problem hasn't reoccurred since I switched the Nvidia driver to the open flavor.

[–] dandroid@sh.itjust.works 2 points 3 months ago (1 children)

It's a 2021 Asus Zephyrus G15 with an AMD CPU and an Nvidia GPU. I got an aftermarket SSD off of Amazon so I could dual boot with Windows, but I haven't booted back into Windows a single time since installing Linux. Though that might be a good test.

I can try set -x once I reinstall my distro and get it back up and running tomorrow, as it is currently borked. Since zypper does all the downloads then all of the installs, I was able to see that it always happened during the install phase, not the download phase.

I am definitely interested in the possibility of it being related to the proprietary Nvidia driver. When it happened yesterday, the proprietary Nvidia driver was being updated (not sure at that exact time. But it was in the list of packages to be updated). I'll keep an eye on that for sure.

[–] SilverCode@lemm.ee 3 points 3 months ago (1 children)

I had a similar problem with hard lockups especially when doing package updates (Arch). After seeing a report on Gaming on Linux about the Nvidia 550 driver (I think it was that one) causing freezes, I uninstalled it and just ran on the intel igpu. Never had a single freeze again. Waited for 555 driver, installed that, and immediately got lockups during package updates (and randomly sometimes) again. I've now installed the nvidia-open package to see if it fixes it, and so far so good.

[–] dandroid@sh.itjust.works 2 points 3 months ago

Well, I reinstalled this morning and I'm keeping the nouveau driver this time instead of going with the proprietary. I don't play play games on this laptop anymore since I set up sunshine/moonlight, so it shouldn't be a problem.

[–] Boxscape@lemmy.sdf.org 1 points 3 months ago* (last edited 3 months ago)

So now I'm thinking it might be the SSD, but I'm not even sure how to check that.

There's definitely some utilities for that. SMART or something? They'll at least show you rough drive health/wear.

Edit: nevermind, another responder recommended that already.