It happens, we all know that, no matter how good was your program and how expensive was the infra-structure, it will fail at least once in a while. There is no code without bugs, there is no user that always follow the rules, there is no perfect filesystem or database. If it's not in your software it'll be in the filesystem, if not in the database or network or a faulty memory chip, but the inevitably truth is: it'll happen.
Therefore, you must be prepared before it happens otherwise you'll run like a headless chicken and the fixes will be much worse than the original problem, creating many other collateral effects that in turn will create more problems. The few rules of software quality apply to disaster recovery as well even if it takes more time than a quick fix. The main argument is, again: the uglier the hack, the longer its life.
When quick fixing disasters, new problems will reappear sooner or later (normally sooner) with a higher magnitude and the pressure to fix it will be even bigger, requiring an even quicker fix than the previous one completing the cycle and increasing every turn the magnitude of the problem and the pressure to fix. It is therefore imperative that you stop, think and do it right.
The most effective disaster recovery strategies are done before it happens. Backup often (but not too often) and spread the knowledge (though distributed filesystem and database replication) and you'll be quite safe when bad things happen. It's likely it won't save the day but at least it'll relieve the pressure that forces you to quick fix and hack it through.
There are many backup strategies and most databases and filesystems already have native backups (much faster than hand-made ones), the problem then is how often you need backups. Too often can make you spend more time backing up and more space holding unnecessary copies of old data when your failure rate is quite low, but too seldom will make your backup useless to restore when so much has changed since the last snapshot.
Normally, the weekday/month strategy is quite useful. Keep one backup for each day of the week (Monday, Tuesday, etc) and recycle the same tape/file, so every Tuesday you write on the Tuesday tape. Keep a tape of every 1st day of the month and you have an historical backup just in case you had something that is gone for a while and you need to know what it was. With that, only 7 + 12 = 19 tapes/files are necessary for each dataset.
But if your data changes much faster (by the hour) and restoring it is very important (like bank transactions) you really need a live backup (see replication below). Though, it's very seldom the case in bioinformatics when even the daily backup is sometimes too often.
The other problem is where to backup. Tapes are a good and cheap alternative but the restore is far too slow, especially when you don't know which tape you want to restore. Having them in disk would be much better for that but would use an important lump of your disks, what can be very expensive. Mixing the storages is a good alternative, keep the new backups in disk and put into tape the older ones, recycling as before.
Although backup robots and programs can be very easy to deal with and cheap in the long run, they must not be the first source of data in case of a crash. Everything that delays the fix will add pressure to the ones fixing the problem and backup recoveries tend to be very slow and uncertain. Using backups should be in parallel to trying other things like restoring the replicated database or recovering the distributed filesystem.
Most databases have a replication feature, where all data inserted or changed in one database is automatically added (or changed) to a second (sometimes more) databases. It gives you a live snapshot of your data and can be used when the master database fails. That doesn't protects you, though, from data corruption due to your programs fault. As every statement is executed in all other databases you'll propagate the error to all of them.
This setup is used for high-availability, when you can change the database in case of hardware or software failure but it also have an additional benefit: the log. The communication between databases generates a log with every single transaction executed since the last time you (or the automatic administration routine) cleaned it, so it's possible to debug and even roll back manually (by updating it back to the original value) just by reading the logs. Of course, triggers and procedures must be taken into account, so it's not that simple, but for most cases it's enough.
Replication can also serve as a snapshot for read-only programs, avoiding them to break the main database. This way you separate your main database with your main application from the rest, not allowing any other application to access the master database and avoiding unwanted software problems from external or third-party programs and procedures.
NFS (Network File Systems) are for a long time outdated. Any central repository is due to failures (by Murphy's law) and therefore can't be accepted as a safe setup. In the past, NFS was the only solution (as it's easier to implement in software and hardware) and lots of hardware vendors came with very clever ideas on how to maximize availability and throughput but every major enterprise filesystem vendor (including EMC, NetApp, Acopia, Bluearc) had several catastrophic failures with several days of timeout, terabytes of data loss and loads of faulty hardware being switched.
The solution to this problem is also the cheaper solution, a truly distributed filesystem, a parallel virtual filesystem (PFS). PFSs don't rely on any single point of failure, the machines don't have to be the same, the disks don't have to have the same size, you can grow as large as you want and the throughput will be kept at the lowest rate possible by calculating the distance of your data and all the disks available.
Also, your data will be multiplied on those disks (as many times you want) and therefore even loosing one disk or a whole machine or even more than one machine will still assure your data will be intact and ready for use. A success story for using real massive parallel filesystems is Google. Real massive parallel filesystems are also the cheapest possible as they rely on commodity hardware and because it's distributed you can scale as fast as you want (or can).
Following the same logic above, big vector machines, massive fileserver solutions won't scale as fast and cheap as you want and still are a single point of failure. Using (high-quality) commodity hardware in clusters for high-performance computing will scale much better, outperform most vector machines and be at a fraction of the cost. The same goes to filesystem, as discussed above.
Software have the same effect on scalability. Paid software may require expensive licenses and will block the growth of your group or at least reduce the money you have for other things like manpower, more commodity computers and power supply. Basic software like operating systems, databases and web servers should be easily scalable and readily fixed whenever you find a bug or request a feature.
At last, but not least, infra-structure is fundamental to high-availability. Organized cabling and redundant power supply (with generators tested weekly) is sine qua non to a good computing infra-structure. This has little to do with software quality and therefore deserves a little space in this text but nevertheless it's an important background where programmers can feel confident and not to worry with dodgy networks and frequent outages.
Even with all the protection, backups and replication, disasters will happen and you have to fix whenever it happens wherever you are. If the protection guidelines were followed and your software follows the good standards of software quality you shouldn't be worried too much, the fix will be obvious, quick and clean. When the ball of mud is big the problems can be a bit more worrying but still you can use that to your advantage if you follow a few guidelines.
If the problem was in a production pipeline, normally re-running the same procedure again with the fix in-place is enough, but for that you need the pipeline to be stable enough. Unfortunately, most of the time that's not the case. When scripts call programs that call scripts again, when the original file is replaced by a (bugged) copy, when you have multiple code repositories in multiple filesystems, when there is no documentation about how programs perform tasks (instead of how running the programs) it becomes impossible to understand and therefore re-run safely the missing bits through the same pipeline.
Disasters can be scary enough and the pressure to fix the problem is too big but with a clear mind and the argument of quality as an ally, you can use this stress moment for a very good purpose: understand the procedure well enough to fix it properly. If instead you change the output, move files around manually, update the production database or write very quick scripts you're just adding more mud to the ball, reducing clarity, understanding and consequently, software quality.
Whenever manual processes occur disasters will follow. If they're an important part of your production schedule you're doomed to have a disaster every day. Smaller programs tend not to have manual interactions as much as production pipelines but still it doesn't guarantee safety nor quality.
To use the disaster as a positive thing you need to take the following actions (in a bug tracker would be great), while understanding the problem and searching for the solution:
Even if it's too overwhelming on small cases, try to be more verbose than you would like. It's much more for you in the future than for other people in the present. Later on, when the problem is fixed and you have time again to refactor your code, those bugs will give you plenty of information on where and how to optimize and increase quality of your code.
Database crashes are rather serious and should not be dealt with manually. Most manual updates in the database, especially during the stress of a recent disaster, create more problems than they solve. First of all, you should never have put in production a code that wasn't fully tested with real data in a stage testing database. Second, if the code is tested, the problematic changes will be consistent (ie. all the same) and would be easy to rollback or update the data without much hassle.
If you didn't fall in those two categories, restore the most recent backup. It's not worth trying to go further when you have no idea in which state your data is, especially if you have loads of triggers updating lots of other tables. Not all triggers are consistent (sometimes for performance reasons) and therefore updating the data may not be enough to restore the original state.
As an example, you have two tables Main and Accessory and there is a trigger upon update on Main that updates Accessory automatically: when you change Main.value to B the trigger will change Accessory.value to F.
| Main.value | Accessory.value |
|---|---|
| B | G |
| C | K |
| A | A |
Than you run an update that changes Main.value to B and Accessory.value automatically changes to F:
| Main.value | Accessory.value |
|---|---|
| B | F |
| B | F |
| B | F |
But you suddenly find out that the update was wrong and you have to fix the problem. If you update Main.value to C and A again the trigger will not restore the original value in Accessory.value. Also, there was some rows that already had value B so you can't possibly know which rows you have to change unless you have the source of your data elsewhere, but normally the database is the source of all data and therefore, a backup must be restored.
Although outages like database, OS or hardware failure can completely stop you from doing anything there are ways of preventing small or localised outages by generalising your production pipelines.
If your programs are independent of filesystem structure (no hard-coded directories), network structure (no hard-coded IP address or server name) or cluster configuration (no hard-coded number of nodes) and you can change those values quite easily (via command line or configuration file), temporary localised outages would normally be easily worked around.
If you have a filesystem outage, just running the command on a different directory is not that simple as you have probably written scripts to help you with the boring task of running production procedures. Depending on the complexity of your pipeline you can just copy&paste the command from within the scripts and run it but if your production is large and has lots of steps it'll be a complete nightmare to do so.
The solution to that is to have a set of global variables, either environment variables or in a configuration file, that can be easily changed and put in production. You can then change the global file, run all your programs and rollback when the filesystem is fixed, but that won't be too simple if some of your programs are not using the same variables (ie. if at least one variable is hard-coded in at least one program).
The same apply to other outages, if your server died and it's easy to setup a temporary one (or you have a stage one) just switch to it, if the network is down and you can run in one machine, change the configuration for it. If your cluster grows and you can use more nodes, this should be a global variable too so all your programs could use the same value.
It doesn't matter what you use to centralise your configurations as long as all programs use the same infra-structure. You can store them in text files, databases or even a configuration server. One other possibility is also to have on single configuration file and write a few converters to different formats such as shell and Perl scripts, Java properties and INI files so all your programs can use the same set of variables no matter in which language it's written. That will save you writing a configuration management for each language and provide you with a single source for all values, be it a text file in revision control or a table in a database.