gocept.net [en]

Slow response times last night

In the night from 2013-11-26 21:00 and 2013-11-27 07:00 we temporarily experienced increased response times for services hosted with us.

According to our data center operator this was caused due to an issue at DECIX (Frankfurt) where maintenance operations caused temporary outages and reduced peering sessions.

This caused packet loss and slowed down aspects of various processes involved in delivering your services, like DNS lookups, downloads, etc.

We apologize for any inconvenience - unfortunately there is nothing we can directly do about the availability of major internet exchanges. However, the people at DECIX are aware of their importance and we are sure they will respond to this incident accordingly.

[Solved] Some VMs affected by network issues [update 5]

After the kernel upgrade applied yesterday to our KVM servers we are experiencing compatibility issues on some VMs in our cluster causing them to loose network connectivity.

All VMs that have been affected by the issue have been configured with a watchdog process to reboot them cleanly and immediately the next time the issue occurs - this should keep service downtimes to a minimum while we're working on a long term fix.

Likely, we will have to provide a newer kernel for virtual machines within the next days. We will review this tomorrow and hopefully get back to a completely stable environment quickly.

Unfortunately we did not discover this incompatibility in our extensive staging and test environments and we have not found a pattern to trigger this reliably.

We apologize for any inconvenience and kindly ask for your patience while we we're getting everything back to normal.

Update 1 (2013-10-07 09:19)

We have started to provision an updated virtual machine kernel in our development environment. Our goal is to have the kernel available for selected virtual machines until noon to gather data about its stability on the VMs affected until now. Based on that data we will make a decision how to proceed later today.

Update 2 (2013-10-07 12:02)

We have successfully prepared an updated kernel (3.10) for virtual machines in our development environment. We have not found any adverse affects regarding application compatibility and are now rolling it out in our staging environment and in selected production VMs that run data-center internal services. If those prove stable we will roll out updates to all previously affected VMs in the next 2 hours.

Update 3 (2013-10-07 16:00)

We deployed the new kernel to selected machines but have not achieved the desired result. However, additionally we acquired information that a specific checksum offloading feature tends to break under certain conditions. We were able to reproduce the problem and also appear to have worked around it by disabling this specific offloading feature. We are currently monitoring a few affected VMs for stability and will continue rolling out the prepared kernel and offload disabling later this evening.

Update 4 (2013-10-08 01:48)

The combination of a new kernel and disabling the troublesome offloading feature appears to have fixed the problem (or at least resulted in a reliable workaround). We have not seen a VM network crash since about 7 hours now and all virtual machines have been updated.

We are currently in contact with upstream developers to make a root cause analysis how this problem came into existence. Hopefully this will result in an upstream open source bugfix that will also become available to other parties.

Update 5 (2013-10-08 15:40)

Since yesterday evening, we have been continuing monitoring our infrastructure closely and did not see any VM network crash again. We would therefore consider the problem as solved.

In the meantime we have started debugging the issue together with the upstream developers in an isolated environment. This will hopefully help to find the root cause and support the upstream developers in providing a fix.

E-mail server maintenance on Sun 2013-10-06, 21:00-22:00 CEST

On Sunday, 2013-10-06, we are going to increase the disk space of our mail system. Because of that, mail delivery will be delayed between 21:00 and 22:00 CEST. Sending of emails as well as the access to the mailboxes may not be possible during this time.

Reboot of all machines on Sat 2013-10-05 21:00-23:30 CEST

We are updating to a new version of the Linux kernel on VM hosts and need to reboot all hosts. The reboot is scheduled for

Saturday, 2013-10-05 between 21:00 and 23:30 CEST (19:00-21:30 UTC).

During this period, expect one or several short service interruptions. We try to minimize outages but during a reboot of the hosts, the virtual machines must be restarted as well.

We apologize for any inconvenience.

Storage server outage - few VMs affected [Update 1]

Today around 19:30 CEST one of our storage servers encountered a massive hard drive failure: 3 out of 7 disks stopped working at once and crashed the system entirely.

However, only a few number of VMs are affected by this and none of our central services are affected: if you did not experience any issue by now then you are not affected.

We are currently restoring VMs that were located on this storage server (some have been restored already). You should see your VMs and services come back within the next few hours.

We will provide an aftermath update later.

[Update 1: 2013-04-18 01:04 CEST]

We finished restoring all business critical machines and all customer services are back online. Our disaster recovery plan estimates about 24 hours for restoring a completely failed storage server. Today it took about 5:30h to perform the necessary analysis, write some skripts, answer customer inquiries, and get services running again.

2013-03-05 00:00 Urgent network maintenance

Following up on the recent IPv6 outage our provider has received updates for their Cisco equipment that will be deployed tonight between 2013-03-05 00:00 (CET) and 06:00.

Due to the bugs in the existing firmware a "hot update" is not possible and the provider expects multiple outages of about 10 minutes in connectivity to the data center.

We do not expect any network issues within our cluster but expect to see unavailability from and to the internet.

IPv6 connectivity outage [2013-02-26 10:17 CET, update 2, solved]

There is currently an unplanned outage of our IPv6 connectivity in the data center ongoing.

We are working with our uplink providers to restore connectivity.

IPv4 connectivity is not affected, however, services that rely on outside services that provide double-stack networking (IPv4+IPv6) may experience timeouts and delays.

We will post updates here as we work towards a solution.

We are sorry for any inconvenience.

[Update 1 - 11:52 CET]

The cause seems to have been in the data center upstream router infrastructure. We are getting improvements and see traffic on IPv6 picking up again. We do know of a couple of remaining edge cases and continue to work on restoring full connectivity soon.

[Update 2 -14:51 CET, solved]

IPv6 connectivity has been restored. Some nodes are still recovering from the outage but we are seeing continuous improvement in our monitoring.

The root cause was traced to a Cisco IOS bug involving more than 10.000 IPv6 routing entries and our upstream provider has implemented a work-around for the problem. We expect network maintenance some time in the next weeks to provide a fixed IOS version at the network operator's equipment.