gocept.net [en]: March 2014

Retiring this status blog - introducing our new status page

The last incidents have shown that using this blog for status and maintenance announcement does not deliver reliable information to our customers in time and is also quite cumbersome for us.

In an effort to improve this part of our service, we're now announcing our official status page:
http://status.flyingcircus.io

This page should give you a quick overview of the current state of things in our hosting environment, lets you see past incidents and some performance data. It also allows you to subscribe not only via RSS but also by email and SMS.

This is the last post in this blog - see you on the other side!

Performance problems during storage maintenance

Today we are moving some of the storage servers within our data center to balance power consumption, performance, and failure zones. We are also adding new routers and remove the (now unused) old servers.

Although all of this was not expected to have user-visible impact, the migration did not go as smooth as planned. The main reason is that hardware defects hit us in a critical phase.

As a result, some VMs suffer currently from reduced storage performance. This also affects our mail server.

We are working hard to restore our storage cluster to its full I/O capacity by the end of the day. Spare parts have already arrived at the data center and we are putting them in service now.

Finally, once all this is over we'll give a more comprehensive review of the events and make a plan on how to communicate ongoing low-level maintenance tasks we perform better. We apologize for any inconvenience.

Short outage during storage server roll-out

Unfortunately we had a short partial outage today around 16:20 CET for roughly 20 minutes.

We are currently rolling out new storage hardware: bigger and faster machines with SSDs, better CPUs, and a lot more RAM.

As we finished migrating to CEPH a while ago those servers can be integrated into our live cluster without needing to take the system down.

Unfortunately, when integrating the first server, the whole system got stuck. We diagnosed the issue being a conflicting setting on switch ports on the storage cluster. We took the new server out of the cluster again, getting everything unstuck. We resolved the setting and were able to continue integrating the new servers after that. Over the next days you should see reduced "iowait" on virtual machines with a lot of throughput.

Technical details

The conflict consisted of switch ports configured with both "jumbo frames" and "flow control" which caused partial packet loss on the storage backend network. "flow control" is known to interfere with jumbo frames, but the ProCurve switches tend to be safe regarding conflicting settings. Today we changed ports that were previously used on a different (non jumbo frames) VLAN to a jumbo frame VLAN. When noticing that the cluster was stuck we investigated on multiple levels and found the switches to emit warnings about this setting. Disabling it, removed the package loss.

As we do not require flow-control in our network at all, we disabled this setting on all ports and will integrate this into our default configuration.