gocept.net [en]: Backup, Nagios, DNS, and locking issues [Update 5]

At 2:45pm today we performed a hardware upgrade on the machine that is running our disk backup server.
We added additional memory to reduce some performance limitations that occurred in the last days. We performed this upgrade during the day as this machine does not provide actual services that are needed for production but is needed in the night for performing the backups.

Unfortunately the system disks which run in a RAID 1 both did not spin up again and appear to have died. We are still diagnosing the situation but as the deadline for ordering spare parts has already passed today we are putting off further diagnosis until tomorrow morning.

Please note: this means that although all primary production systems are working correctly we are currently not able to run backup or restore jobs currently. The data in our backups is not affected by this problem.

In the meantime this machine also carried a few other secondary services: one of the DNS servers, the Nagios server, and the locking daemon, which we are working on bringing back on other machines quickly.

DNS is available on two more machines and we improved the situation by taking out the failed server from the resolver list so any hangs due to a missing DNS should have vanished by now.
ETA: solved
The Nagios server will be migrated to another machine which is generally unproblematic due to our fully managed environment.
ETA: Friday, 12:00 noon
The locking daemon which is responsible for fencing distributed storage access during VM startup and shutdown. This daemon is also being migrated to another host.
ETA: Thursday, 9:00pm

[Update Thursday, 2011-05-19 21:22 CEST]

The locking daemon is fully operational again.

[Update Friday, 2011-05-20 09:58 CEST]

Nagios has been operating again since about yesterday at 23:30 CEST.

However, the web UI had a configuration issue in DNS which was resolved a few minutes ago.
Unfortunately any custom Nagios checks are currently not available and will be restored.

[Update Friday, 2011-05-20 17:35 CEST]

The backup server has been recovered to an intermediary working state. We are currently scanning the backup volumes to restore the catalog and resume the backup service in general. This will probably take until Monday 2011-05-23.

Additionally, as the server was due for replacement in the near future, we started our procurement process for replacement hardware.

The previous Nagios performance data and availability archives will be integrated later but are available from backups.

[Update Tuesday, 2011-05-24 17:15 CEST]

The backup server's catalog was restored successfully after 2 tries. Starting from tonight we will resume making backups and are able to restore data from the backups as created before 2011-05-19.

We still need to restore the customer-specific Nagios checks as well as the archive data (availability and performance) which we will follow-up on during the next days.

[Update Tuesday, 2011-05-31 13:09 CEST]

We struggled a bit with restoring the performance data into the newly accumulated databases but found a solution to do so. We are currently running the restore scripts which has succeeded for about 30% of all machines by now. We expect the historical performance data to be available again later today.

Also, we restored the historical Nagios availability data.