Community Digital Archiving

Self-Sufficient Culture, Heritage and Free Software

Complete disk failure – DR plan to the rescue

  • January 17, 2013 4:44 pm

The Assynt Community Digital Archive is built mostly on five virtual servers, separating services across virtual machines to aid resilience and ease system updates by containing risk to just the service in question.  This tactic has other advantages, as we shall see.  Four of the virtual servers are required, and the fifth is a remote access virtual desktops system, not strictly required, but definitely useful, for a disaster recovery situation.  This article uses technical terms which may be most meaningful to a technical reader but the outcome should be of general interest.

Just before Christmas 2012, remote monitoring software showed that the entire archive was down, all five virtual machines, but the firewall, a separate little unit, was still accessible via ssh, and through the firewall access was possible to a tiny £100 test server Stevan uses (an HP microserver with just 1GB of memory and a 2TB disk.)  The test server is also used as a staging post for backups, the main systems backing up to the small server, and from there, copied to removable disks for offsite storage.  As it happened, Stevan was unable to go to the Archive for a few weeks, but this represented a good opportunity to look at disaster recovery.  One of the glories of Free and Open Source software is that it tends to be quite modest in terms of hardware requirements.  With virtual systems, you can also cut your cloth to suit, as will be shown.

The backups already on the little microserver already contained all the data, but, as it happened,  copies had been taken of the main virtual server containers just a few weeks before.  As a test server, it was already set up as a KVM virtual system host. It was therefore simple to copy over the various XML definition files for each virtual system.  This meant modifying each file slightly, as there were differences between the file format on the Debian Squeeze installation on the microserver, and the Scientific Linux installation on the main server.  (This difference is now a thing of the past, as the main server now also runs Debian.)  Other edits were to change the location of the container files, though it may  have been possible to put in a symlink to the location as well.  However, edits were additionally required to make sure the memory requirements for each server fitted into the miserably small 1GB of memory available.  This was done without much trouble, and each server was brought up with manual commands.  This is quite important, as automatic starts are not wanted on this server.  If it rebooted for some reason, and a second instance of the same virtual machine tried to come up on the network , address clashing would occur.

The commands are quite straight forward – “virsh create <XML Definition file>” to set up the virtual machine followed by “virsh start <MACHINE_NAME>” to start it.

So within an hour or two the basic system was back up and running.  We were a tiny bit fortunate in that no work at all had been done on the Archive itself since the last backup, so it was possible to restore the systems to the last backup point quite quickly.  What is more, the entire process was done remotely, in this case from just 7 miles away, but it would not have mattered had the systems been on the the side of the planet.

Later, Stevan was able to go in to the Archive, the additional flexibility of virtual system then came to the fore.  A small desktop machine was brought into service running Debian Squeeze, and some of the virtual machines were transferred to this machine, reducing the load on the brave little microserver.

This was obviously a full DR exercise rather than a mere test, and worked well.  Some learnings to ease the process have been noted for future reference, but in these instances, it is often less of a case of blindly following a disaster recovery plan and more of a case of having the required flexibility to work with what one has.  A virtualised Free Software environment certainly provides that.  While a community archive suffers merely inconvenience rather than business or financial peril in the event of a complete disaster it is still a source of professional pride to be able to deliver such a robust and resilient solution.

The cause of the main server failure, as noted in the title, was the failure of three of the four disks in the IBM RAID array.  All three (bit not the fourth) had the same manufacturing date.  All three were Seagates, which, had they been bought from Seagate, would have had a five year warranty, but IBM only offers a year on the exact same disks.  Details about this event are included here