gocept.net [en]

Unplanned VM outage 2011-09-06 16:25-17:00 CEST

Unfortunately some VMs experienced an unplanned outage yesterday (2011-09-05) between 16:25 and 17:00 CEST as their root disks turned read-only.

To quickly recover from this state we forcedly shut down the VMs and rebooted the associated KVM server. The VMs recovered fine after this reboot.

Owners of affected VMs have been notified individually directly after the incident.

It appears that we have hit a bug in our iSCSI initialisation code that caused all VMs on this physical host to loose connectivity to their storage server.

The issue was triggered while we were bootstrapping a new virtual machine.

The bug is currently undergoing further analysis and replication in our development environment and will be fixed soon.

For the time being we have ceased to perform actions that trigger this bug and do not expect your VM to be affected from this again.

System updates between August 29th and September 2nd

We proudly announce that the pending system updates will be performed during the week of August 29th and September 2nd.

Stability issue resolved

The stability issue with our iSCSI storage server that required us to postpone the system update has been resolved. It was caused by an incompatibility between the iSCSI server software and Linux 2.6.39. We resolved the issue by selecting 2.6.38 as the new kernel version for this update instead.

Order of events

With the system update next week the order of the next events will be:

This week:

Announcement of general resource group maintenance settings by email
Announcement of specific maintenance windows for every VM by email

Next week:

Monday morning: update of infrastructure machines, watch for any deviations, fix of last minute-bugs if necessary, no downtime expected. If anything goes wrong this will be our chance to cancel any customer-affecting updates.
Monday evening: update of a representative set of machines including selected customer VMs.
Tuesday evening: initiate update on 20% of all machines
Wednesday evening: initiate update on 30% of all machines
Thursday evening: initiate update on the remaining machines
Saturday: reboot of all VM host machines

During the week we have also scheduled an early-morning and late-night shift of our support personnel that will keep an eye on the services during and after the system update and fix any issues arising or contact you in case of issues that we can not fix for you.

Please note that all VMs will be rebooted twice: once during their regular maintenance window to activate the new kernel and a second time on Saturday without a regular maintenance window due to the necessary restart of the KVM hosts.

Automatic maintenance scheduling

With the growing number of machines that we support in gocept.net and the ultimate goal of providing a transparent, flexible yet automated service we took your feedback from the last months and chose to implement a better mechanism to support maintenance activities than the existing "automatic reboot" scheduler that only paid attention to system load and did not communicate.

The new implementation allows any machine to queue "maintenance activities" like memory resizes, changes on the number of CPUs, kernel updates, or larger system updates. Depending on the settings of the resource group our central directory is then able to automatically schedule a window for those activities and notify the machine of the time that the activity should be performed at.

Every resource group now has settings for:

your technical contact email addresses
your preferred timezone
a daily interval that can be used for scheduling regular maintenance automatically
what period you need to be informed before any maintenance activity.

Every machine has an additional setting that controls whether this machine is allowed to automatically schedule new maintenance windows.

To introduce this feature you will first receive individual emails for each resource group that display the values we used to initialize this for you. After preparing the full roll-out schedule for next week, you will then receive emails about the windows that we schedule for each individual machine.

More swap for small VMs

All VMs now get at least 1GB of swap space. This will help smaller VMs that need to run the same system administration tools that stay in memory but get swapped out by the kernel regularly. This way you can make more effective use of the memory in smaller VMs without our system management getting in your way.

Multi-core VMs

Due to popular demand we are now introducing multi-core VMs: you can choose to run up to 12 cores per virtual machine.

However, we still recommend to use the multi-core feature wisely: dividing up your application over multiple smaller VMs has positive side effects like load-balancing and higher fault tolerance on the infrastructure level and it ultimately scales to many more cores. It also usually means the setup of each individual VM is much simpler, more testable, and thus more maintainable. Also remember: the operating system for VMs is still at 32-bit so you're limited to a total of 4GB of memory in the VM and 3GB per process.

We will charge 25 EUR per additional core per month.

Software updates

The package catalog has been updated and now includes, amongst others, Linux 2.6.38, Python 2.5.4, 2.6.6, 2.7.1 and 3.1.3. A more detailed list of package updates is shown in our official ChangeLog.

To perform the package updates faster then in the past we have now improved our binary host system for pre-compiling the packages in our development environment and then directly pushing them onto the data center mirrors. As we use a well-adjusted check-summing mechanism to ensure binary compatibility this ensures that the packages are already in the data center when the machines start updating.

CPU visibility

Do you wonder which CPU actually powers your VM? Running `uname -a` now shows the actual physical CPU identification instead of the generic `qemu virtual CPU`.

As we run a mixed environment that constantly get updated you might see different CPU numbers on different machines. If you think you could benefit from a better CPU then drop us a note and we'll see whether we have some free space on a system with more power.

New configuration schema for PostgreSQL 9

The PostgreSQL configuration schema has been adjusted by Gentoo to be more Unix-like and thus locate the configuration files in /etc/postgresql-9.0 instead of /srv/postgresql/9.0/data.

More documentation

In case you haven't noticed: we have also silently been updating our gocept.net documentation to give a better overview of our architecture, help you get started and explain the typical tasks in our environment.

We're very excited putting those improvements into good use and hope that they will further improve your experience of our hosting services.

Your gocept.net system administrators

System updates delayed

Unfortunately the preparations of the planned system updates next week have uncovered a stability issue in the storage system that might endanger the availability and reliability of your services.

Therefore we have decided to delay the planned update as long as necessary to provide a reliable solution.

We would like to re-schedule the update as soon as possible but can not give a specific date at the current time. However, we will inform you at least one week upfront, giving a detailed schedule.

We apologize for any inconvenience,
Your gocept.net system administrators

System updates August 15th–19th

We'd like to pre-announce a comprehensive system update of our hosting infrastructure in the week of August 15th–19th.

The update will require some downtime to all of our services. Details about the specific downtimes will be published separately.

Backup, Nagios, DNS, and locking issues [Update 5]

At 2:45pm today we performed a hardware upgrade on the machine that is running our disk backup server.
We added additional memory to reduce some performance limitations that occurred in the last days. We performed this upgrade during the day as this machine does not provide actual services that are needed for production but is needed in the night for performing the backups.

Unfortunately the system disks which run in a RAID 1 both did not spin up again and appear to have died. We are still diagnosing the situation but as the deadline for ordering spare parts has already passed today we are putting off further diagnosis until tomorrow morning.

Please note: this means that although all primary production systems are working correctly we are currently not able to run backup or restore jobs currently. The data in our backups is not affected by this problem.

In the meantime this machine also carried a few other secondary services: one of the DNS servers, the Nagios server, and the locking daemon, which we are working on bringing back on other machines quickly.

DNS is available on two more machines and we improved the situation by taking out the failed server from the resolver list so any hangs due to a missing DNS should have vanished by now.
ETA: solved
The Nagios server will be migrated to another machine which is generally unproblematic due to our fully managed environment.
ETA: Friday, 12:00 noon
The locking daemon which is responsible for fencing distributed storage access during VM startup and shutdown. This daemon is also being migrated to another host.
ETA: Thursday, 9:00pm

[Update Thursday, 2011-05-19 21:22 CEST]

The locking daemon is fully operational again.

[Update Friday, 2011-05-20 09:58 CEST]

Nagios has been operating again since about yesterday at 23:30 CEST.

However, the web UI had a configuration issue in DNS which was resolved a few minutes ago.
Unfortunately any custom Nagios checks are currently not available and will be restored.

[Update Friday, 2011-05-20 17:35 CEST]

The backup server has been recovered to an intermediary working state. We are currently scanning the backup volumes to restore the catalog and resume the backup service in general. This will probably take until Monday 2011-05-23.

Additionally, as the server was due for replacement in the near future, we started our procurement process for replacement hardware.

The previous Nagios performance data and availability archives will be integrated later but are available from backups.

[Update Tuesday, 2011-05-24 17:15 CEST]

The backup server's catalog was restored successfully after 2 tries. Starting from tonight we will resume making backups and are able to restore data from the backups as created before 2011-05-19.

We still need to restore the customer-specific Nagios checks as well as the archive data (availability and performance) which we will follow-up on during the next days.

[Update Tuesday, 2011-05-31 13:09 CEST]

We struggled a bit with restoring the performance data into the newly accumulated databases but found a solution to do so. We are currently running the restore scripts which has succeeded for about 30% of all machines by now. We expect the historical performance data to be available again later today.

Also, we restored the historical Nagios availability data.

Limited support available due to connectivity issues in the office [update 3]

At around 11:30am CEST a machine from a nearby construction broke a cable that services all our communication lines in the Halle office.

For this reason we are currently unable to answer requests on our regular phone line but we are answering requests by email as usual.

Our services in the Oberhausen data center will experience a few limitations regarding system configuration updates and access to services that require live LDAP (such as web statistics access).

However, we expect regular application operations to continue without any further issues.

[Update 2011-05-09 6:42pm CEST]

Our ticket system is unfortunately currently also not available as email cannot be forwarded from our primary email server to the ticket system. If you need urgent assistance, please send an email to --- directly or call ---.

Our communications provider reviewed the situation at our premises around 5pm and hopes to start work on restoring connectivity by 7am tomorrow which would result in restored services at 8am.

We are also currently in the process of migrating our LDAP service to the data center to restore system configuration services and authentication.

[Update 2011-05-09 6:58pm CEST]

A few minutes ago a technician arrived and started repairing the broken cable. We thusly hope that services might be restored in a few hours already.

[Update 2011-05-09 8:54pm CEST]

The technician successfully repaired the broken cable. Connectivity to the office and thusly all services have been restored since around 7:30pm CEST.

Problems with IPv6 on 2011-04-14 [update 10:54am CEST]

We are currently experiencing problems with our IPv6 uplink. Users who access our services with IPv6 may experience outages. Our data center provider is informed and expects the problem to be resolved by 9:00am CEST.

[Update 10:07am]

Our provider is still working on the issue but has revoked the ETA for a solution. We'll provide another update as soon as we know more specifics or get another ETA.

[Update 10:54am]

IPv6 connectivity has been restored. Minor clean-up tasks are still being performed at the data center so currently as light chance remains for minor connectivity issues in the near future.