Short outage during storage server roll-out

Unfortunately we had a short partial outage today around 16:20 CET for roughly 20 minutes.

We are currently rolling out new storage hardware: bigger and faster machines with SSDs, better CPUs, and a lot more RAM.

As we finished migrating to CEPH a while ago those servers can be integrated into our live cluster without needing to take the system down.

Unfortunately, when integrating the first server, the whole system got stuck. We diagnosed the issue being a conflicting setting on switch ports on the storage cluster. We took the new server out of the cluster again, getting everything unstuck. We resolved the setting and were able to continue integrating the new servers after that. Over the next days you should see reduced "iowait" on virtual machines with a lot of throughput.

Technical details

The conflict consisted of switch ports configured with both "jumbo frames" and "flow control" which caused partial packet loss on the storage backend network. "flow control" is known to interfere with jumbo frames, but the ProCurve switches tend to be safe regarding conflicting settings. Today we changed ports that were previously used on a different (non jumbo frames) VLAN to a jumbo frame VLAN. When noticing that the cluster was stuck we investigated on multiple levels and found the switches to emit warnings about this setting. Disabling it, removed the package loss.

As we do not require flow-control in our network at all, we disabled this setting on all ports and will integrate this into our default configuration.