Deja vu? No, we had originally posted an announcement earlier saying the problem was resolved while in fact it was just laying dorment. The actual problem has been found and has been remedied. The high CPU loads were caused by a VPS that was using a massive amount of disk IO that was undetectable by any of our monitoring software we use which made it extremely hard to troubleshoot. The VPS has been moved to another server by itself so it can continue to process a huge amount of MySQL queries without impacting anybody else.

Yesterday the CPU usage was spiking to 1000% due to massive disk IO which was not being displayed on any monitoring software we had installed, today it hit 2000% while I was sleeping. Until we found the VPS causing this our only option was to reboot the whole node which took 20-30 minutes due to the massive CPU queue. It was taking close to 30 minutes to shut down each VPS and using the process of elimination method so we had to abandon that rather quickly.

We want to apologize to all of our users and want to assure you that we take such issues very seriously and both of us have spent many hours over the past 2 days working specifically on this issue and have made this a priority. We have determined a better method of finding such disk IO usage in the future and while it’s not pretty, it will have to do until OpenVZ finds a way to limit or monitor IO usage properly. In response to this problem we have configured a new, identical server that will remain empty and will be used specifically for when such issues like this occur so that we can selectively migrate a small batch of VPSs at a time in order to hopefully lessen the impact on other clients and determine the VPS(s) responsible.

If you would like to discuss this issue or our future response strategy in more detail, we welcome you to open a support ticket or contact me directly on Google Talk or e-mail via JOE(@)SECUREDRAGON.NET.



Sunday, July 17, 2011





« Back