IPv4 and IPv6 addressing in the backend (SRV) network

We are going to add IPv6 addresses to the canonical VM names in DNS. This will improve backend reachability and does not require configuration changes in most cases.

Our VMs used to have IPv4 and IPv6 addresses for a long time, but the canonical VM name (VM.gocept.net) used to resolve to the IPv4 address only. The IPv6 address used to be reachable only via VM.ipv6.gocept.net.

We already started to introduce private IPv4 addresses from the 172.16.0.0/12 space in the SRV network. This makes it harder to reach the SRV interface from the outside.

Thus, we will introduce a DNS change during the next days that includes both the IPv4 and IPv6 address with a VM's canonical name. It will look like this:
  • VM.gocept.net resolves to both the IPv4 and IPv6 address
  • VM.ipv4.gocept.net resolves only to the IPv4 address
  • VM.ipv6.gocept.net resolves only to the IPv6 address.
We will activate the new addressing scheme for VMs in staging and test environments first. A few days later, we will activate it for all other VMs.

What does this mean for SSH logins and awstats reachability?

SSH logins and awstats usually go to the backend (SRV) interface. This means that you will be able to login to your VM at VM.gocept.net via IPv6 as well. Logins to VMs with private IPv4 backend addresses must use IPv6 or go to the frontend network.

We are currently working on an alternative solution to keep awstats reachable on VMs with private SRV addresses for customers without IPv6 connectivity.

What does this mean for inter-service communication within resource groups?

In a typical scenario, the frontend web server (nginx) talks to application instances, which talk to a database server in turn. In the vast majority of cases, just keep using the canonical VM name to address a backend service (for example, use "proxy_pass VM.gocept.net:8080" in nginx.conf). Usually the internal services will be reached regardless of the IP address family. If this does not work for some reason, state the IP address family explicitly (using VM.ipv6.gocept.net or VM.ipv4.gocept.net).

New SSL certificate for mail.gocept.net

We updated the SSL certificate for our mail server today. SSL encryption is used to send and receive mail and to access the management frontend https://mail.gocept.net in a secure way. The fingerprint of the new certificate is:

SHA1 Fingerprint
06:7C:19:B3:CD:3C:08:0A:A5:E6:73:B2:55:9A:8F:BD:E4:E7:A9:84

Happy holidays! (and how we handle support)

Christmas holidays are around the corner as is 2014. We thank all of our customers for their
trust and wish everyone a restful Christmas.

Our offices will be closed between 2013-12-23 and 2014-01-06.

Our support team will be watching out for major incidents and we will respond on regular German working days to your mails at support@gocept.com. Feature-requests and non-urgent issues will be delayed until 2013-01-07 when we will resume our regular work schedules.

Here's a schedule of our support for the next weeks:

2013-12-21 Saturday SLA-covered emergency support only
2013-12-22 Sunday SLA-covered emergency support only
2013-12-23 Monday regular support, limited to major incidents
2013-12-24 Tuesday regular support, limited to major incidents
2013-12-25 Wednesday SLA-covered emergency support only
2013-12-26 Thursday SLA-covered emergency support only
2013-12-27 Friday regular support, limited to major incidents
2013-12-28 Saturday SLA-covered emergency support only
2013-12-29 Sunday SLA-covered emergency support only
2013-12-30 Monday regular support, limited to major incidents
2013-12-31 Tuesday regular support, limited to major incidents
2014-01-01 Wednesday SLA-covered emergency support only
2014-01-02 Thursday regular support, limited to major incidents
2014-01-03 Friday regular support, limited to major incidents
2014-01-04 Saturday SLA-covered emergency support only
2014-01-05 Sunday SLA-covered emergency support only
2014-01-06 Monday SLA-covered emergency support only

Upcoming OS update will break old lxml installations

The upcoming Flying Circus platform update will introduce a libxml2 version that is not compatible with lxml 2.x anymore. We ask all users who compiled their own lxml library to check if this is lxml 2.x or 3.x.

All applications should switch to lxml 3.x now, which supports both old and new libxml2 versions.

Slow response times last night

In the night from 2013-11-26 21:00 and 2013-11-27 07:00 we temporarily experienced increased response times for services hosted with us.

According to our data center operator this was caused due to an issue at DECIX (Frankfurt) where maintenance operations caused temporary outages and reduced peering sessions.

This caused packet loss and slowed down aspects of various processes involved in delivering your services, like DNS lookups, downloads, etc.

We apologize for any inconvenience - unfortunately there is nothing we can directly do about the availability of major internet exchanges. However, the people at DECIX are aware of their importance and we are sure they will respond to this incident accordingly.

[Solved] Some VMs affected by network issues [update 5]

After the kernel upgrade applied yesterday to our KVM servers we are experiencing compatibility issues on some VMs in our cluster causing them to loose network connectivity.

All VMs that have been affected by the issue have been configured with a watchdog process to reboot them cleanly and immediately the next time the issue occurs - this should keep service downtimes to a minimum while we're working on a long term fix.

Likely, we will have to provide a newer kernel for virtual machines within the next days. We will review this tomorrow and hopefully get back to a completely stable environment quickly.

Unfortunately we did not discover this incompatibility in our extensive staging and test environments and we have not found a pattern to trigger this reliably.

We apologize for any inconvenience and kindly ask for your patience while we we're getting everything back to normal.

Update 1 (2013-10-07 09:19)

We have started to provision an updated virtual machine kernel in our development environment. Our goal is to have the kernel available for selected virtual machines until noon to gather data about its stability on the VMs affected until now. Based on that data we will make a decision how to proceed later today.

Update 2 (2013-10-07 12:02)

We have successfully prepared an updated kernel (3.10) for virtual machines in our development environment. We have not found any adverse affects regarding application compatibility and are now rolling it out in our staging environment and in selected production VMs that run data-center internal services. If those prove stable we will roll out updates to all previously affected VMs in the next 2 hours.

Update 3 (2013-10-07 16:00)

We deployed the new kernel to selected machines but have not achieved the desired result. However, additionally we acquired information that a specific checksum offloading feature tends to break under certain conditions. We were able to reproduce the problem and also appear to have worked around it by disabling this specific offloading feature. We are currently monitoring a few affected VMs for stability and will continue rolling out the prepared kernel and offload disabling later this evening.

Update 4 (2013-10-08 01:48)

The combination of a new kernel and disabling the troublesome offloading feature appears to have fixed the problem (or at least resulted in a reliable workaround). We have not seen a VM network crash since about 7 hours now and all virtual machines have been updated.

We are currently in contact with upstream developers to make a root cause analysis how this problem came into existence. Hopefully this will result in an upstream open source bugfix that will also become available to other parties.

Update 5 (2013-10-08 15:40)

Since yesterday evening, we have been continuing monitoring our infrastructure closely and did not see any VM network crash again. We would therefore consider the problem as solved.

In the meantime we have started debugging the issue together with the upstream developers in an isolated environment. This will hopefully help to find the root cause and support the upstream developers in providing a fix.

E-mail server maintenance on Sun 2013-10-06, 21:00-22:00 CEST

On Sunday, 2013-10-06, we are going to increase the disk space of our mail system. Because of that, mail delivery will be delayed between 21:00 and 22:00 CEST. Sending of emails as well as the access to the mailboxes may not be possible during this time.