Backup, Nagios, DNS, and locking issues [Update 5]

At 2:45pm today we performed a hardware upgrade on the machine that is running our disk backup server.
We added additional memory to reduce some performance limitations that occurred in the last days. We performed this upgrade during the day as this machine does not provide actual services that are needed for production but is needed in the night for performing the backups.

Unfortunately the system disks which run in a RAID 1 both did not spin up again and appear to have died. We are still diagnosing the situation but as the deadline for ordering spare parts has already passed today we are putting off further diagnosis until tomorrow morning.

Please note: this means that although all primary production systems are working correctly we are currently not able to run backup or restore jobs currently. The data in our backups is not affected by this problem.

In the meantime this machine also carried a few other secondary services: one of the DNS servers, the Nagios server, and the locking daemon, which we are working on bringing back on other machines quickly.

  • DNS is available on two more machines and we improved the situation by taking out the failed server from the resolver list so any hangs due to a missing DNS should have vanished by now.
    ETA: solved
  • The Nagios server will be migrated to another machine which is generally unproblematic due to our fully managed environment.
    ETA: Friday, 12:00 noon
  • The locking daemon which is responsible for fencing distributed storage access during VM startup and shutdown. This daemon is also being migrated to another host.
    ETA: Thursday, 9:00pm
[Update Thursday, 2011-05-19 21:22 CEST]

The locking daemon is fully operational again.

[Update Friday, 2011-05-20 09:58 CEST]

Nagios has been operating again since about yesterday at 23:30 CEST.

However, the web UI had a configuration issue in DNS which was resolved a few minutes ago.
Unfortunately any custom Nagios checks are currently not available and will be restored.

[Update Friday, 2011-05-20 17:35 CEST]

The backup server has been recovered to an intermediary working state. We are currently scanning the backup volumes to restore the catalog and resume the backup service in general. This will probably take until Monday 2011-05-23.

Additionally, as the server was due for replacement in the near future, we started our procurement process for replacement hardware.

The previous Nagios performance data and availability archives will be integrated later but are available from backups.


[Update Tuesday, 2011-05-24 17:15 CEST]

The backup server's catalog was restored successfully after 2 tries. Starting from tonight we will resume making backups and are able to restore data from the backups as created before 2011-05-19.

We still need to restore the customer-specific Nagios checks as well as the archive data (availability and performance) which we will follow-up on during the next days.

[Update Tuesday, 2011-05-31 13:09 CEST]


We struggled a bit with restoring the performance data into the newly accumulated databases but found a solution to do so. We are currently running the restore scripts which has succeeded for about 30% of all machines by now. We expect the historical performance data to be available again later today.

Also, we restored the historical Nagios availability data.

    Limited support available due to connectivity issues in the office [update 3]


    At around 11:30am CEST a machine from a nearby construction broke a cable that services all our communication lines in the Halle office.


    For this reason we are currently unable to answer requests on our regular phone line but we are answering requests by email as usual.


    Our services in the Oberhausen data center will experience a few limitations regarding system configuration updates and access to services that require live LDAP (such as web statistics access).


    However, we expect regular application operations to continue without any further issues.

    [Update 2011-05-09 6:42pm CEST]


    Our ticket system is unfortunately currently also not available as email cannot be forwarded from our primary email server to the ticket system. If you need urgent assistance, please send an email to --- directly or call ---.


    Our communications provider reviewed the situation at our premises around 5pm and hopes to start work on restoring connectivity by 7am tomorrow which would result in restored services at 8am.


    We are also currently in the process of migrating our LDAP service to the data center to restore system configuration services and authentication.


    [Update 2011-05-09 6:58pm CEST]

    A few minutes ago a technician arrived and started repairing the broken cable. We thusly hope that services might be restored in a few hours already.


    [Update 2011-05-09 8:54pm CEST]


    The technician successfully repaired the broken cable. Connectivity to the office and thusly all services have been restored since around 7:30pm CEST.