We are updating to a new version of the Linux kernel on VM hosts and need to reboot all hosts. The reboot is scheduled for Saturday, 2013-10-05 between 21:00 and 23:30 CEST (19:00-21:30 UTC). During this period, expect one or several short service interruptions. We try to minimize outages but during a reboot of the hosts, the virtual machines must be restarted as well. We apologize for any inconvenience.
Reboot of all machines on Sat 2013-10-05 21:00-23:30 CEST
Storage server outage - few VMs affected [Update 1]
Today around 19:30 CEST one of our storage servers encountered a massive hard drive failure: 3 out of 7 disks stopped working at once and crashed the system entirely.
However, only a few number of VMs are affected by this and none of our central services are affected: if you did not experience any issue by now then you are not affected.
We are currently restoring VMs that were located on this storage server (some have been restored already). You should see your VMs and services come back within the next few hours.
We will provide an aftermath update later.
[Update 1: 2013-04-18 01:04 CEST]
We finished restoring all business critical machines and all customer services are back online. Our disaster recovery plan estimates about 24 hours for restoring a completely failed storage server. Today it took about 5:30h to perform the necessary analysis, write some skripts, answer customer inquiries, and get services running again.
[Update 1: 2013-04-18 01:04 CEST]
We finished restoring all business critical machines and all customer services are back online. Our disaster recovery plan estimates about 24 hours for restoring a completely failed storage server. Today it took about 5:30h to perform the necessary analysis, write some skripts, answer customer inquiries, and get services running again.
2013-03-05 00:00 Urgent network maintenance
Following up on the recent IPv6 outage our provider has received updates for their Cisco equipment that will be deployed tonight between 2013-03-05 00:00 (CET) and 06:00.
Due to the bugs in the existing firmware a "hot update" is not possible and the provider expects multiple outages of about 10 minutes in connectivity to the data center.
We do not expect any network issues within our cluster but expect to see unavailability from and to the internet.
Due to the bugs in the existing firmware a "hot update" is not possible and the provider expects multiple outages of about 10 minutes in connectivity to the data center.
We do not expect any network issues within our cluster but expect to see unavailability from and to the internet.
IPv6 connectivity outage [2013-02-26 10:17 CET, update 2, solved]
There is currently an unplanned outage of our IPv6 connectivity in the data center ongoing.
We are working with our uplink providers to restore connectivity.
IPv4 connectivity is not affected, however, services that rely on outside services that provide double-stack networking (IPv4+IPv6) may experience timeouts and delays.
We will post updates here as we work towards a solution.
We are sorry for any inconvenience.
[Update 1 - 11:52 CET]
The cause seems to have been in the data center upstream router infrastructure. We are getting improvements and see traffic on IPv6 picking up again. We do know of a couple of remaining edge cases and continue to work on restoring full connectivity soon.
[Update 2 -14:51 CET, solved]
IPv6 connectivity has been restored. Some nodes are still recovering from the outage but we are seeing continuous improvement in our monitoring.
The root cause was traced to a Cisco IOS bug involving more than 10.000 IPv6 routing entries and our upstream provider has implemented a work-around for the problem. We expect network maintenance some time in the next weeks to provide a fixed IOS version at the network operator's equipment.
We are working with our uplink providers to restore connectivity.
IPv4 connectivity is not affected, however, services that rely on outside services that provide double-stack networking (IPv4+IPv6) may experience timeouts and delays.
We will post updates here as we work towards a solution.
We are sorry for any inconvenience.
[Update 1 - 11:52 CET]
The cause seems to have been in the data center upstream router infrastructure. We are getting improvements and see traffic on IPv6 picking up again. We do know of a couple of remaining edge cases and continue to work on restoring full connectivity soon.
[Update 2 -14:51 CET, solved]
IPv6 connectivity has been restored. Some nodes are still recovering from the outage but we are seeing continuous improvement in our monitoring.
The root cause was traced to a Cisco IOS bug involving more than 10.000 IPv6 routing entries and our upstream provider has implemented a work-around for the problem. We expect network maintenance some time in the next weeks to provide a fixed IOS version at the network operator's equipment.
Major OS update roll-out starting from 2013-03-04
It has been a long time since we provided the last operating system update - longer than we expected to. However, we are happy to announce that we have finished our QA on a major set of package updates which brings updates for many packages maintained in our platform.
We will roll out the update incrementally between 2012-03-04 and 2012-03-09. All machines will receive a maintenance slot in the next days informing you of the assigned slots per email.
If you own any testing or staging environments then those will be updated at least 24 hours before the production systems.
Downtime expectations
Your VMs will be restarted at least two times:
- once during the week in their assigned slot to activate a new kernel configuration, and
- once on Saturday 2013-03-09 around 10:00-14:00 CET to restart all physical KVM hosts, this will take less than 30 minutes for any specifiv VM.
Important changes
There have been many improvements and bugfixes, most notably on our infrastructure, to make the platform more reliable and faster without you having to worry about it.
However, there are a few changes that we would like you to be aware of to avoid pitfalls:
- Varnish will be upgraded to version 3 - most configurations from Varnish 2 will continue to work correctly, but some subtle changes may break your config. The Varnish community has provided a nice upgrade document that summarises relevant changes.
- Due to a change in the license for Oracle/Sun JDK we decided to switch the Java VM to OpenJDK as we are no longer allowed to automatically distribute the Oracle/Sun JDK.
- Python 2.4 is now in "sunset" period: we will no longer install it to any additional machines and we will announce a roadmap to uninstall Python 2.4 within the next months. This version of Python is very old, unsupported and has known security issues.
- We no longer install "swftools" and will even actively uninstall them as the upstream developers do not maintain them any longer.
If you are interested in more technical details, feel free to take a look at our change log. A detailed list of updated package versions will appear there.
What's next?
Having finished this major release we are already looking forward for the next big thing. In the coming months we expect to drastically improve our storage system by introducing Ceph instead of iSCSI. Preliminary work on the new storage systems has started already and we are really excited to get this done.
If you look closely you will notice that we haven't upgraded the kernel with this update: this is currently intentional and related to the upcoming storage overhaul: the existing Linux iSCSI stack unfortunately doesn't port very well to newer kernel versions and instead of switching our iSCSI implementation while working on Ceph we decided to stay at a kernel that allows a smooth transition to Ceph.
[Update 1, 2013-02-14]
The date for the second reboot had a typo that showed 2013-02-09 for the KVM server reboot which should have read 2013-03-09
Filesystem integrity checks
During the next days, we will run a filesystem integrity check on all machines. This means that all machines will experience a minor downtime (less than 10 minutes for most machines). Technical contacts will be informed about the individual times in advance.
The global filesystem check is a preventive measure to ensure that we will not run into problems during the upcoming OS update.
A few machines in the development network showed incongruent filesystems under certain stress conditions. Although we did not notice any current problems in the production network, we identified some scenarios that could be triggered by the OS update process. The integrity check will effectively remediate any of those potential problems.
We apologize for the inconveniences caused by this safety measure.
The global filesystem check is a preventive measure to ensure that we will not run into problems during the upcoming OS update.
A few machines in the development network showed incongruent filesystems under certain stress conditions. Although we did not notice any current problems in the production network, we identified some scenarios that could be triggered by the OS update process. The integrity check will effectively remediate any of those potential problems.
We apologize for the inconveniences caused by this safety measure.
Filesystem migration to ext4
As part of our ongoing platform improvement, we will migrate ext3 filesystems to the ext4 format. The migration requires an off-line filesystem check and thus a little downtime for each machine to reboot.
The downtimes will be scheduled individually according to the agreed maintenance windows. We will inform the technical contacts about the exact times for their VMs in advance. Please write to our support if any reboot time does not fit.
Except for the reboot, the change will be completely transparent.
The downtimes will be scheduled individually according to the agreed maintenance windows. We will inform the technical contacts about the exact times for their VMs in advance. Please write to our support if any reboot time does not fit.
Benefits of the new filesystem
The ext4 filesystem offers a number of advantages over the old ext3 filesystem:- Reduced filesystem check time: this means that both planned machine reboots and error recovery after outages will be sped up significantly.
- Large file handling: access to large files like ZODB databases require less system overhead.
- Large directory handling: file lookup in directories that contain a lot of files is more efficient.
- More subtle run-time optimizations like delayed allocation and others.
Except for the reboot, the change will be completely transparent.
Subscribe to:
Posts (Atom)