KVM Node Outages (5/20/2016)

Here is a copy of the e-mail sent out to all clients impacted by the outages:

This morning we experienced instability on both of our KVM nodes and one or more of your VPSs was impacted by this.

The Timeline of Events

At roughly 9AM EST we received alerts that some KVM VPSs on both nodes were offline and then we started receiving tickets about client KVM VPSs also being unreachable. Upon investigating, we noticed our Tampa KVM VPS node had a load of over 1000% (load average: 158.91, 150.52, 127.89) and the node was barely responding to commands. A kernel error kept scrolling in SSH so we decided to reboot the server in hopes of fixing the kernel error and the high loads since the CPU cores were barely being used. After waiting 20 minutes for the node to reboot we ended up contacting our data center to forcefully power cycle the server. Our Denver KVM node had loads over 300%, but was more responsive so we were able to reboot the node and restore connectivity shortly after logging in.

When it became apparent that the Tampa KVM node was not going to come back online, we went to the data center with a spare server expecting to have to rebuild the node from scratch and restore from our latest backups. Luckily when we got there we found that the server was not booting because our 5th hard drive we use for backups was preventing the fsck from completing so removing that drive allowed us to boot the server into the OS. Once the server was booted into the OS, the server became unresponsive while the same kernel error kept scrolling on the crash cart. Attempts to SSH into the node were timing out so we were unable to get into the OS to check logs or troubleshoot.

After power cycling the node for the 5th or 6th time, we noticed the first kernel error message was different than the rest and referred to kernel memory which made us think about the KernelCare software applied to our servers that allow us to receive security updates to our kernel without needing to reboot it. The latest KernelCare patch was applied to the nodes just before 9AM EST this morning so now we had something to work with. The kernel error kept showing up randomly on the Denver KVM node so we decided to remove KernelCare from that node and sure enough the error stopped appearing, the server load dropped to normal, and the VPSs were accessible again. It took a few reboots but we were finally able to remove KernelCare from our Tampa KVM node and once we did the server booted up correctly without any errors and all of the VPSs came back online around 2PM EST.

Shortly after posting this information on our Twitter, another person mentioned to us that they had the same experience this morning and pointed out that KernelCare removed their latest patch around 2PM EST. We have since received notice that KernelCare is aware of this issue and is working to fix it.

After Action Items

One thing that prevented us from addressing the issue much sooner was the lack of remote management access for our Tampa KVM node (our management network was not correctly configured). This was resolved while we were on site in the data center so we can properly work on the node remotely.

KernelCare is releasing a fix for this so we will be re-installing it on our KVM nodes in the future, our OpenVZ nodes were unaffected by this. We are discussing internally on whether to disable the automatic update for kernel patches, since this is the first issue we've ever had with KernelCare we are debating on whether the security concerns outweigh the chance of issues going forward.

We have removed the extra hard drive from our KVM VPS nodes and will be performing backups to remote servers instead to prevent the fsck issue we experienced today.

Client Actions

At this time we ask all clients login to their KVM VPSs and validate them to ensure they are working as expected. We had a few clients report that their VPSs were in a boot loop or were not booting into the OS at all. If you need assistance please open a ticket with our Support department with your login credentials for your VPS and we will do our best to assist (we also have backups from the 16th available to restore from if needed). For clients that are experiencing issues and would like to fix it themselves, we do provide a helpful ISO called System Rescue CD that can be used to fix many issues (it also offers a GUI to make some things easier).

Conclusion

We are deeply sorry for those clients impacted by this issue today and will be providing a 20% SLA credit for all active KVM VPS services. Feel free to contact us if you have any questions regarding this or any other issues you might have.

Friday, May 20, 2016

« Back

KVM Node Outages (5/20/2016)

By Month

By Month

Support

KVM Node Outages (5/20/2016)

By Month

By Month

Support

Generate Password