Has anyone seen weirdness with proxmox guests using 100% CPU for no reason?

ralf · January 2023

So, bit of a weird question this, and not really sure I've got enough information to go on, but thought I'd ask anyway...

So, my two dedis are running manually set-up VMs, that I've set up using virsh and made my own script to create cloud-init configs, etc. Neither of these machines ever seem to have any problems with the VMs.

But when I upgraded my home network to use fibre, I decided to get one of those multi-port NUCs, and run pfSense (FreeBSD) virtualised on the free version of proxmox, which was set up with pretty much the default configuration, and following the guide from STH. This pfSense instance has been 100% reliable, never had any downtime with it.

But I also decided to make use of the spare capacity with a couple of smaller VMs, the main one being an extra borg backup destination for my most critical VMs from the dedis, but also a couple of ancillary ones that get very little use e.g. a proxy to a site that requires a whitelisted IP, so it uses wireguard to make a tunnel to one of my dedis, and an SSH jump host into my home network (that the borg clients use).

Now, all of these VMs have very little use, and so I don't often notice when they have issues (apart from borg which would trigger an email when the client fails to backup), so I don't know exactly when they fail, but sometimes after a couple of weeks of not really doing much, the CPU on these instances will suddenly jump to 100% and the machine won't respond to pings or input on the (serial) console. Interestingly, it's usually not borg that this happens to, but the really idle machines that are basically just forwarding things.

Anyway, when this happens, it seems like it's using 100% CPU, but it seems like the linux kernel itself isn't even running as it doesn't even respond to pings. The only solution is to connect to the proxmox interface, and force a stop/reset. The safer reboot/shutdown options are ignored, and because that just sends an ACPI event to the client.

Today, this just happened while I happened to the borg instance while I was able to jump on the proxmox interface to examine it, and it looks like the load suddenly jumped up while extra disk space was being allocated to the VM (proxmox seems to use these LVM sparse / lite volumes, which is the main difference between the instances on my dedis which are all allocated space using a pre-allocated volume group).

Now, the graphs only seem to have minute accuracy, but first jump in disk allocation corresponded to the start of the CPU spike from 0% to 50% (which is understandable as the borg server was actually doing something) and spike in network traffic, a minute later CPU hits 90% and traffic increases more, and another minute later, CPU is still 90%, traffic hits it maximum and the disk allocation increases again. The next minute, CPU is stuck at 100%, and network traffic is negligible (presumably ssh retrying to connect), and zero the minute after that.

So now my theory is that maybe this random proxmox thing is being triggered whenever a LVM lite allocation is increased, maybe because of logging, log rotation, or something else. Mostly, I'm concluding this as I've changed everything about the virtual hardware on these machines, so no matter what disk type, video card type, network card type, they all seem to end up like this stuck CPU state at some random point. Of course, it could just be a genuine hardware issue, but the fact that the pfsense instance has been rock solid (only rebooted once in the 7 months I've been using this setup, and that was because I moved the router to a new location and had to unplug it) makes me think there's some weirdness with proxmox itself...

There's no real reason for proxmox to be trying to use sparse LVM volumes anyway, as I've only allocated about half my 500G SSD to VMs, but it was the default setting when I set it up, so I left it alone.

Has anyone else seen weirdness like this in proxmox before? Or just as valid feedback, do most people just use the sparse LVM allocation and have no issues at all?

ralf · January 2023

I should maybe add I'm using Proxmox VE 7.2-3.

ralf · January 2023

Oh, and the reason for using the thinpool stuff in the first place is that I thought it was supposed to be useful for doing snapshots/backups.

AlwaysSkint · January 2023

Not as drastic as your scenario but I had similar about 2 years ago but it resolved itself, likely through Proxmox updates. I did find that VMs were more stable when I made sure they had quemu-guest-agent running. You may wish to check that.

ralf · January 2023

@AlwaysSkint said:
Not as drastic as your scenario but I had similar about 2 years ago but it resolved itself, likely through Proxmox updates. I did find that VMs were more stable when I made sure they had quemu-guest-agent running. You may wish to check that.

I don't have qemu-guest-agent on my dedi's VMs, but it's worth a try, so I've enabled it and I'll keep watching out for it happening again.

tetech · January 2023

Your post had too many words for me to read it all, but how many CPUs/cores? Do you know if the CPU usage is coming from the VM or the host - can you try limiting the number of cores on the VM to leave at least one for the host so that it responds? When you say "CPU" is 100%, what does that mean - sys, user, si, so, ...? What is the memory usage - specifically, is it swapping?

ralf · January 2023

The host is fine. It has 4 cores, all the VMs have only 1 core assigned. When I say the CPU of the guess is 100%, I mean exactly that. In the proxmox admin page, the summary page for that VM shows 100% of 1 CPU and the host summary chart shows 25% of 4 CPUs used. Memory is not overcommitted, there's no swap space and I've only allocated about 10GB of the 16GB installed.

But perhaps, if you want to help, reading the description of what was actually going wrong would be useful. As I said, the guest VM is no longer doing anything useful - the kernel no longer responds to pings, the userspace isn't doing anything, etc, just the CPU is maxed out.

Also, as I've said, I've been running VMs on a dedi for years using virsh and never experienced anything like this before. It also randomly affects all of the VMs on this machine, with the exception of the pfSense VM, which has been rock stable. I've only ever seen this behaviour on this machine where I've decided to experiment with proxmox, which I why I was asking if anyone has every experienced this behaviour with proxmox before. It's also using lvm-thin for storage, which I suspect is the culprit, but again, I don't know.

AlwaysSkint · January 2023

@ralf said: there's no swap space

Hmm, always allocate at least a little, even a 256MB swapfile and set vm.swappiness=0
(My advice: take it or leave it.)

shallow · January 2023

I like saying swappiness in a Sean Connery accent.

tetech · January 2023

@ralf said:
The host is fine. It has 4 cores, all the VMs have only 1 core assigned. When I say the CPU of the guess is 100%, I mean exactly that. In the proxmox admin page, the summary page for that VM shows 100% of 1 CPU and the host summary chart shows 25% of 4 CPUs used. Memory is not overcommitted, there's no swap space and I've only allocated about 10GB of the 16GB installed.

But perhaps, if you want to help, reading the description of what was actually going wrong would be useful. As I said, the guest VM is no longer doing anything useful - the kernel no longer responds to pings, the userspace isn't doing anything, etc, just the CPU is maxed out.

Also, as I've said, I've been running VMs on a dedi for years using virsh and never experienced anything like this before. It also randomly affects all of the VMs on this machine, with the exception of the pfSense VM, which has been rock stable. I've only ever seen this behaviour on this machine where I've decided to experiment with proxmox, which I why I was asking if anyone has every experienced this behaviour with proxmox before. It's also using lvm-thin for storage, which I suspect is the culprit, but again, I don't know.

The short answer to your question is that I use proxmox on a variety of hardware with LVM thin and do not have this problem.

Encoders · January 2023

if the crash is reproducible / happens more than once, why not just setup kdump?

havoc · January 2023

Yep. Seen it too. Something with 7.2 I think

But only on one of my proxmox machines. No idea why only the one & no meaningful clues as to what is happen.

The firewall like appliance - also sth style N6005 - is fine, the ryzen minipc (4700U) has the issue.

AlwaysSkint · January 2023

@ralf said: I'm using Proxmox VE 7.2-3
@havoc said: Seen it too. Something with 7.2 I think

I'm on 7.3-4, as of now.

ralf · January 2023

@Encoders said:
if the crash is reproducible / happens more than once, why not just setup kdump?

That's the thing, it doesn't seem like the guest kernel is crashing - it's just that it's occupying 100% CPU on the host for some reason.

I'll try to find a convenient time to upgrade proxmox to the newer version.

Has anyone seen weirdness with proxmox guests using 100% CPU for no reason?

Comments