Backup, Nagios, DNS, and locking issues [Update 5]

At 2:45pm today we performed a hardware upgrade on the machine that is running our disk backup server.
We added additional memory to reduce some performance limitations that occurred in the last days. We performed this upgrade during the day as this machine does not provide actual services that are needed for production but is needed in the night for performing the backups.

Unfortunately the system disks which run in a RAID 1 both did not spin up again and appear to have died. We are still diagnosing the situation but as the deadline for ordering spare parts has already passed today we are putting off further diagnosis until tomorrow morning.

Please note: this means that although all primary production systems are working correctly we are currently not able to run backup or restore jobs currently. The data in our backups is not affected by this problem.

In the meantime this machine also carried a few other secondary services: one of the DNS servers, the Nagios server, and the locking daemon, which we are working on bringing back on other machines quickly.

  • DNS is available on two more machines and we improved the situation by taking out the failed server from the resolver list so any hangs due to a missing DNS should have vanished by now.
    ETA: solved
  • The Nagios server will be migrated to another machine which is generally unproblematic due to our fully managed environment.
    ETA: Friday, 12:00 noon
  • The locking daemon which is responsible for fencing distributed storage access during VM startup and shutdown. This daemon is also being migrated to another host.
    ETA: Thursday, 9:00pm
[Update Thursday, 2011-05-19 21:22 CEST]

The locking daemon is fully operational again.

[Update Friday, 2011-05-20 09:58 CEST]

Nagios has been operating again since about yesterday at 23:30 CEST.

However, the web UI had a configuration issue in DNS which was resolved a few minutes ago.
Unfortunately any custom Nagios checks are currently not available and will be restored.

[Update Friday, 2011-05-20 17:35 CEST]

The backup server has been recovered to an intermediary working state. We are currently scanning the backup volumes to restore the catalog and resume the backup service in general. This will probably take until Monday 2011-05-23.

Additionally, as the server was due for replacement in the near future, we started our procurement process for replacement hardware.

The previous Nagios performance data and availability archives will be integrated later but are available from backups.


[Update Tuesday, 2011-05-24 17:15 CEST]

The backup server's catalog was restored successfully after 2 tries. Starting from tonight we will resume making backups and are able to restore data from the backups as created before 2011-05-19.

We still need to restore the customer-specific Nagios checks as well as the archive data (availability and performance) which we will follow-up on during the next days.

[Update Tuesday, 2011-05-31 13:09 CEST]


We struggled a bit with restoring the performance data into the newly accumulated databases but found a solution to do so. We are currently running the restore scripts which has succeeded for about 30% of all machines by now. We expect the historical performance data to be available again later today.

Also, we restored the historical Nagios availability data.

    Limited support available due to connectivity issues in the office [update 3]


    At around 11:30am CEST a machine from a nearby construction broke a cable that services all our communication lines in the Halle office.


    For this reason we are currently unable to answer requests on our regular phone line but we are answering requests by email as usual.


    Our services in the Oberhausen data center will experience a few limitations regarding system configuration updates and access to services that require live LDAP (such as web statistics access).


    However, we expect regular application operations to continue without any further issues.

    [Update 2011-05-09 6:42pm CEST]


    Our ticket system is unfortunately currently also not available as email cannot be forwarded from our primary email server to the ticket system. If you need urgent assistance, please send an email to --- directly or call ---.


    Our communications provider reviewed the situation at our premises around 5pm and hopes to start work on restoring connectivity by 7am tomorrow which would result in restored services at 8am.


    We are also currently in the process of migrating our LDAP service to the data center to restore system configuration services and authentication.


    [Update 2011-05-09 6:58pm CEST]

    A few minutes ago a technician arrived and started repairing the broken cable. We thusly hope that services might be restored in a few hours already.


    [Update 2011-05-09 8:54pm CEST]


    The technician successfully repaired the broken cable. Connectivity to the office and thusly all services have been restored since around 7:30pm CEST.

    Problems with IPv6 on 2011-04-14 [update 10:54am CEST]

    We are currently experiencing problems with our IPv6 uplink. Users who access our services with IPv6 may experience outages. Our data center provider is informed and expects the problem to be resolved by 9:00am CEST.

    [Update 10:07am]

    Our provider is still working on the issue but has revoked the ETA for a solution. We'll provide another update as soon as we know more specifics or get another ETA.

    [Update 10:54am]

    IPv6 connectivity has been restored. Minor clean-up tasks are still being performed at the data center so currently as light chance remains for minor connectivity issues in the near future.

    Infrastructure maintenance in the computing center, 2011-04-14 05:00 - 07:00 CEST

    Our computing center operator announced that he will perform firmware updates on network components on 2011-04-14 05:00 - 07:00 CEST. Due to device restarts there will be short interruptions in network traffic.

    If you have problems or questions, please send an email to support@gocept.com. We are also available via telephone: +49 345 12298890.

    Unexpected downtime due to storage failure [update 3:48pm]

    gocept.net is currently experiencing an infrastructure-wide downtime due to an error in the storage layer.

    We are currently recovering the affected services and will provide further updates soon.

    2011-04-12 2:58pm CEST


    We identified an issue in the iSCSI server software which has a vendor-side patch available. After initial tests in our staging environment we are currently preparing a package to apply this patch in our production systems within the next 30 minutes.

    2011-04-12 3:26pm CEST

    The patch has been applied on the storage server that was currently hung and the VMs are being brought back into a working state now.

    The other storage server that had the same issue but is currently working will receive this patch immediately as well, however, we do not expect further interruptions on VMs that are currently working as the patch applied cleanly in the staging systems while under load.

    2011-04-12 3:48pm CEST

    The patch was applied cleanly on all storage servers now. VMs that had a read-only filesystem have been rebooted and are back online now.

    If you are affected, please check that your services are back online. Nagios still needs some time to get back to a complete green state and we'll go after any remaining issues in the next time.


    We apologize for any inconvenience.

    Comprehensive system update on Tuesday 2011-03-01 20:00 CET

    On Tuesday next week we will perform a regular but comprehensive system update on all of our servers in the computing centre Oberhausen. Unfortunately we are not able to avoid downtime of services as the operating system updates will require a reboot of the systems.

    Due to that, all services will be affected for periods during the time frame of 8pm CEST until midnight. CEST. We expect the interruptions of each individual service to be short.

    We perform updates to sustain a secure, modern, and highly productive environment and to provide you with new features.

    We apologize for any inconvenience caused by the update and will gladly respond to any questions you have. To contact us, send an email to support@gocept.com.


    Update 2011-03-02 08:00

    The update unfortunately required more fine-tuning than expected from the development and staging environments. We are currently in the process of rebooting the last servers and expect all services to be back to normal until 09:00 CET.

    Storage server failure on 2011-02-17

    Today we experienced a failure of the iSCSI storage server between 09:36 and 10:00 CET. This caused short service interruptions on some VMs. There was also a short outage of our mail server (mail.gocept.net), but no mails were lost.

    We apologize for any inconvenience.