Heartbleed bug and the Flying Circus

tl;dr: The Flying Circus is not affected by the Heartbleed bug

As reported by several media there is a serious bug in the OpenSSL library, widely known as the Heartbleed bug. The bug was introduced in the OpenSSL development tree on January 1st, 2012 and was finally released with OpenSSL version 1.0.1.

The Flying Circus platform makes use of the Gentoo Linux distribution. The OpenSSL version maintained for the Flying Circus is 1.0.0j, so it is not affected by the Heartbleed bug.

To be sure we are not affected by the bug, for example by possible backports from the Gentoo maintainer, we also audited the OpenSSL sources that are in use to not implement the vulnerable heartbeat function.

The Flying Circus is not affected by Heartbleed and there was no time in the past when we had rolled out a vulnerable version.

There is no need to replace your certificates or keys.

Retiring this status blog - introducing our new status page

The last incidents have shown that using this blog for status and maintenance announcement does not deliver reliable information to our customers in time and is also quite cumbersome for us.

In an effort to improve this part of our service, we're now announcing our official status page:
http://status.flyingcircus.io

This page should give you a quick overview of the current state of things in our hosting environment, lets you see past incidents and some performance data. It also allows you to subscribe not only via RSS but also by email and SMS.

This is the last post in this blog - see you on the other side!

Performance problems during storage maintenance

Today we are moving some of the storage servers within our data center to balance power consumption, performance, and failure zones. We are also adding new routers and remove the (now unused) old servers. 

Although all of this was not expected to have user-visible impact, the migration did not go as smooth as planned. The main reason is that hardware defects hit us in a critical phase.

As a result, some VMs suffer currently from reduced storage performance. This also affects our mail server.

We are working hard to restore our storage cluster to its full I/O capacity by the end of the day. Spare parts have already arrived at the data center and we are putting them in service now.

Finally, once all this is over we'll give a more comprehensive review of the events and make a plan on how to communicate ongoing low-level maintenance tasks we perform better. We apologize for any inconvenience.

Short outage during storage server roll-out

Unfortunately we had a short partial outage today around 16:20 CET for roughly 20 minutes.

We are currently rolling out new storage hardware: bigger and faster machines with SSDs, better CPUs, and a lot more RAM.

As we finished migrating to CEPH a while ago those servers can be integrated into our live cluster without needing to take the system down.

Unfortunately, when integrating the first server, the whole system got stuck. We diagnosed the issue being a conflicting setting on switch ports on the storage cluster. We took the new server out of the cluster again, getting everything unstuck. We resolved the setting and were able to continue integrating the new servers after that. Over the next days you should see reduced "iowait" on virtual machines with a lot of throughput.

Technical details

The conflict consisted of switch ports configured with both "jumbo frames" and "flow control" which caused partial packet loss on the storage backend network. "flow control" is known to interfere with jumbo frames, but the ProCurve switches tend to be safe regarding conflicting settings. Today we changed ports that were previously used on a different (non jumbo frames) VLAN to a jumbo frame VLAN. When noticing that the cluster was stuck we investigated on multiple levels and found the switches to emit warnings about this setting. Disabling it, removed the package loss.

As we do not require flow-control in our network at all, we disabled this setting on all ports and will integrate this into our default configuration.

IPv4 and IPv6 addressing now fully enabled

As announced in the previous post, we are going to enable dual-stack address resolution in the server network (SRV) now for all VMs.

The new scheme goes as follows (as already announced):
  • vm.gocept.net resolves to both IPv4 and IPv6 addresses,
  • vm.ipv4.gocept.net resolves only to IPv4 addresses,
  • vm.ipv6.gocept.net resolves only to IPv6 addresses.
Please note that private IPv4 address will not be visible outside our data center. In this case, our DNS servers will respond only with IPv6 addresses to queries from the public Internet.

However, the private IPv4 addresses can be used for connections inside our data center. When queried from the inside, our DNS servers will happily provide private IPv4 addresses as well. Thus, there is no need to use hard-coded IP addresses instead of proper DNS names in service configurations.

IPv4 and IPv6 addressing in the backend (SRV) network

We are going to add IPv6 addresses to the canonical VM names in DNS. This will improve backend reachability and does not require configuration changes in most cases.

Our VMs used to have IPv4 and IPv6 addresses for a long time, but the canonical VM name (VM.gocept.net) used to resolve to the IPv4 address only. The IPv6 address used to be reachable only via VM.ipv6.gocept.net.

We already started to introduce private IPv4 addresses from the 172.16.0.0/12 space in the SRV network. This makes it harder to reach the SRV interface from the outside.

Thus, we will introduce a DNS change during the next days that includes both the IPv4 and IPv6 address with a VM's canonical name. It will look like this:
  • VM.gocept.net resolves to both the IPv4 and IPv6 address
  • VM.ipv4.gocept.net resolves only to the IPv4 address
  • VM.ipv6.gocept.net resolves only to the IPv6 address.
We will activate the new addressing scheme for VMs in staging and test environments first. A few days later, we will activate it for all other VMs.

What does this mean for SSH logins and awstats reachability?

SSH logins and awstats usually go to the backend (SRV) interface. This means that you will be able to login to your VM at VM.gocept.net via IPv6 as well. Logins to VMs with private IPv4 backend addresses must use IPv6 or go to the frontend network.

We are currently working on an alternative solution to keep awstats reachable on VMs with private SRV addresses for customers without IPv6 connectivity.

What does this mean for inter-service communication within resource groups?

In a typical scenario, the frontend web server (nginx) talks to application instances, which talk to a database server in turn. In the vast majority of cases, just keep using the canonical VM name to address a backend service (for example, use "proxy_pass VM.gocept.net:8080" in nginx.conf). Usually the internal services will be reached regardless of the IP address family. If this does not work for some reason, state the IP address family explicitly (using VM.ipv6.gocept.net or VM.ipv4.gocept.net).

New SSL certificate for mail.gocept.net

We updated the SSL certificate for our mail server today. SSL encryption is used to send and receive mail and to access the management frontend https://mail.gocept.net in a secure way. The fingerprint of the new certificate is:

SHA1 Fingerprint
06:7C:19:B3:CD:3C:08:0A:A5:E6:73:B2:55:9A:8F:BD:E4:E7:A9:84