gocept.net [en]

Major OS update roll-out starting from 2013-03-04

It has been a long time since we provided the last operating system update - longer than we expected to. However, we are happy to announce that we have finished our QA on a major set of package updates which brings updates for many packages maintained in our platform.

We will roll out the update incrementally between 2012-03-04 and 2012-03-09. All machines will receive a maintenance slot in the next days informing you of the assigned slots per email.

If you own any testing or staging environments then those will be updated at least 24 hours before the production systems.

Downtime expectations

Your VMs will be restarted at least two times:

once during the week in their assigned slot to activate a new kernel configuration, and
once on Saturday 2013-03-09 around 10:00-14:00 CET to restart all physical KVM hosts, this will take less than 30 minutes for any specifiv VM.

Please note that you will receive a specific maintenance email for the first downtime but not for the second.

Important changes

There have been many improvements and bugfixes, most notably on our infrastructure, to make the platform more reliable and faster without you having to worry about it.

However, there are a few changes that we would like you to be aware of to avoid pitfalls:

Varnish will be upgraded to version 3 - most configurations from Varnish 2 will continue to work correctly, but some subtle changes may break your config. The Varnish community has provided a nice upgrade document that summarises relevant changes.
Due to a change in the license for Oracle/Sun JDK we decided to switch the Java VM to OpenJDK as we are no longer allowed to automatically distribute the Oracle/Sun JDK.
Python 2.4 is now in "sunset" period: we will no longer install it to any additional machines and we will announce a roadmap to uninstall Python 2.4 within the next months. This version of Python is very old, unsupported and has known security issues.
We no longer install "swftools" and will even actively uninstall them as the upstream developers do not maintain them any longer.

Further details

If you are interested in more technical details, feel free to take a look at our change log. A detailed list of updated package versions will appear there.

What's next?

Having finished this major release we are already looking forward for the next big thing. In the coming months we expect to drastically improve our storage system by introducing Ceph instead of iSCSI. Preliminary work on the new storage systems has started already and we are really excited to get this done.

If you look closely you will notice that we haven't upgraded the kernel with this update: this is currently intentional and related to the upcoming storage overhaul: the existing Linux iSCSI stack unfortunately doesn't port very well to newer kernel versions and instead of switching our iSCSI implementation while working on Ceph we decided to stay at a kernel that allows a smooth transition to Ceph.

[Update 1, 2013-02-14]
The date for the second reboot had a typo that showed 2013-02-09 for the KVM server reboot which should have read 2013-03-09

Filesystem integrity checks

During the next days, we will run a filesystem integrity check on all machines. This means that all machines will experience a minor downtime (less than 10 minutes for most machines). Technical contacts will be informed about the individual times in advance.

The global filesystem check is a preventive measure to ensure that we will not run into problems during the upcoming OS update.

A few machines in the development network showed incongruent filesystems under certain stress conditions. Although we did not notice any current problems in the production network, we identified some scenarios that could be triggered by the OS update process. The integrity check will effectively remediate any of those potential problems.

We apologize for the inconveniences caused by this safety measure.

Filesystem migration to ext4

As part of our ongoing platform improvement, we will migrate ext3 filesystems to the ext4 format. The migration requires an off-line filesystem check and thus a little downtime for each machine to reboot.

The downtimes will be scheduled individually according to the agreed maintenance windows. We will inform the technical contacts about the exact times for their VMs in advance. Please write to our support if any reboot time does not fit.

Benefits of the new filesystem

The ext4 filesystem offers a number of advantages over the old ext3 filesystem:

Reduced filesystem check time: this means that both planned machine reboots and error recovery after outages will be sped up significantly.
Large file handling: access to large files like ZODB databases require less system overhead.
Large directory handling: file lookup in directories that contain a lot of files is more efficient.
More subtle run-time optimizations like delayed allocation and others.

After the VMs' disks have been migrated to the new format, applications will benefit from the performance optimizations incrementally as their files are modified. ZODB databases are usually packed once a week, so they will pick up the new on-disk format automatically. Blob directories and PostgreSQL databases will use the new on-disk format only for newly written data. We can assist users with the conversion of old data on request.

Except for the reboot, the change will be completely transparent.

Maintenance on 2012-10-09 22:00 CEST - update 2

To finally finish the exchange of the faulty switch we will perform a series of hardware maintenance steps tomorrow (2012-10-09) night between 22:00 and 24:00 CEST.

The following tasks will be performed:

Finish the migration to a new power distribution system in our racks
Move our standby switch next to the faulty switch, verify correct operation
Move the existing network connections from the faulty switch to the replacement
Install a new standby switch

As our switches do not have a redundant power supply there will be a short outage of the whole network for about 1-2 minutes. We do not expect any failures in operation but existing connections may hang for this period.

Also, when moving the cables from the faulty switch to its replacement there will be short lags in storage or server network connectivity for a few seconds but no outages.

We are sorry that this preventive measure has taken multiple attempts to implement. We think that our decisions to support a stable environment with careful small adjustments is in the interest of your operational needs.

Update 1 [2012-10-08 12:26]

The previous version of this post mentioned 2012-10-08 as the date of the maintenance. The actual scheduled date is 2012-10-09. The text above was corrected.

Update 2 [2012-10-09 23:32]

The faulty switch has, finally, been successfully replaced. In a window of about 5 minutes the redundant routers where in an inconsistent state causing some outgoing connections to fail. Otherwise all interruptions where short and intermediate without further consequences.

Switch maintenance on 2012-10-07 - update 1

On Sunday, 2012-10-07 between 22:00 and 24:00 CET we will replace a switch which has accumulated defect ports in the last months. We do not expect any services to have any visible interruption due to the change.

The exchange will be performed by adding the new switch and slowly migrating all connections from the old switch to the new one causing only small, intermittent interruptions (time needed to move the cable plus a few seconds for RSTP to enable the port).

Individual affected services will probably show a temporary increase in response times but no actual failures.

Update 2012-10-08

Unfortunately we had a cascade of technical difficulties that stopped us again from exchanging the faulty switch. We are currently reviewing the needed steps to perform at the data center and are planning another attempt in the next days, probably tomorrow evening (Tuesday, 2012-10-09).

We intend to stick to a tight schedule currently as we want to avoid any further issues of this switch to cause actual issues in the operations.

We will write another announcement once the details are fixed.

Switch maintenance on 2012-08-24 – cancelled

The maintenance has been canceled due to organizational reasons. A new date will be announced separately.

On Friday, 2012-08-24 between 22:00 and 24:00 CEST we will replace a switch which has accumulated defect ports in the last months. We do not expect any services to have any visible interruption due to the change.

The exchange will be performed by adding the new switch and slowly migrating all connections from the old switch to the new one causing only small, intermittent interruptions (time needed to move the cable plus a few seconds for RSTP to enable the port).

Individual affected services should show a temporary increase in response times but no actual failures.

Intermittent connection problems with various hosted web sites

Since Monday (2012-07-09), several users report connection problems with web sites hosted at gocept.net. Typical symptoms include painfully slow pages and browser timeouts.

Until now we have a hard time to diagnose the problems as we cannot consistently reproduce them. It looks like only some users and some sites are affected. To get this issue fixed soon, we would appreciate user feedback.

So if you are experiencing problems like intermittent hangs or connection timeouts with gocept.net hosted websites right now, please go to http://supportdetails.com/?recipient=support@gocept.com, please fill in your name and e-mail address and send a report. This would help us greatly to gather relevant data and work towards a solution.