gocept.net [en]: October 2013

After the kernel upgrade applied yesterday to our KVM servers we are experiencing compatibility issues on some VMs in our cluster causing them to loose network connectivity.

All VMs that have been affected by the issue have been configured with a watchdog process to reboot them cleanly and immediately the next time the issue occurs - this should keep service downtimes to a minimum while we're working on a long term fix.

Likely, we will have to provide a newer kernel for virtual machines within the next days. We will review this tomorrow and hopefully get back to a completely stable environment quickly.

Unfortunately we did not discover this incompatibility in our extensive staging and test environments and we have not found a pattern to trigger this reliably.

We apologize for any inconvenience and kindly ask for your patience while we we're getting everything back to normal.

Update 1 (2013-10-07 09:19)

We have started to provision an updated virtual machine kernel in our development environment. Our goal is to have the kernel available for selected virtual machines until noon to gather data about its stability on the VMs affected until now. Based on that data we will make a decision how to proceed later today.

Update 2 (2013-10-07 12:02)

We have successfully prepared an updated kernel (3.10) for virtual machines in our development environment. We have not found any adverse affects regarding application compatibility and are now rolling it out in our staging environment and in selected production VMs that run data-center internal services. If those prove stable we will roll out updates to all previously affected VMs in the next 2 hours.

Update 3 (2013-10-07 16:00)

We deployed the new kernel to selected machines but have not achieved the desired result. However, additionally we acquired information that a specific checksum offloading feature tends to break under certain conditions. We were able to reproduce the problem and also appear to have worked around it by disabling this specific offloading feature. We are currently monitoring a few affected VMs for stability and will continue rolling out the prepared kernel and offload disabling later this evening.

Update 4 (2013-10-08 01:48)

The combination of a new kernel and disabling the troublesome offloading feature appears to have fixed the problem (or at least resulted in a reliable workaround). We have not seen a VM network crash since about 7 hours now and all virtual machines have been updated.

We are currently in contact with upstream developers to make a root cause analysis how this problem came into existence. Hopefully this will result in an upstream open source bugfix that will also become available to other parties.

Update 5 (2013-10-08 15:40)

Since yesterday evening, we have been continuing monitoring our infrastructure closely and did not see any VM network crash again. We would therefore consider the problem as solved.

In the meantime we have started debugging the issue together with the upstream developers in an isolated environment. This will hopefully help to find the root cause and support the upstream developers in providing a fix.

[Solved] Some VMs affected by network issues [update 5]

E-mail server maintenance on Sun 2013-10-06, 21:00-22:00 CEST

Reboot of all machines on Sat 2013-10-05 21:00-23:30 CEST