It's all about data. Every software is written to consume, manipulate or produce data, therefore, efficient programs means efficient data manipulation. Most of the time it's enough to apply efficient algorithms over the data to have a good program, so for instance a program that uses quicksort will be much more efficient than those that use bubble sort if they spend most of their time sorting lists. However, when the amount of data is huge even very efficient algorithms running commonly at speeds like log(N) (where N is the number of items) still take a considerable time.
For example, searching for an item in 5million would take log2 (5m) = 22.25 iterations, so to look up 200.000 entries in a 5 million database would take 4.5million iterations. Normally that magnitude is still small compared to other time consuming operations like I/O (reading and writing). But in reality it's quite rare to find algorithms that are as efficient as the binary search, most common algorithms run on N.log(N), linear, N² or worse that could easily take days, sometimes weeks to run. I/O is also an issue and although the performance is normally linear it's very slow per operation so minimizing I/O is sometimes more important than optimizing algorithms. The bottom line is simple, the less you need to read or write, the faster your program will be.
The first optimization to do then is to reduce the unnecessary I/O by avoiding write too much to the disk, reducing interprocess communications and storing in memory only the minimum necessary for the program to run. Use pipes or FIFOs instead of creating temporary files, parallelize programs as independent as you can and streamline the memory usage are the most common techniques. After that is done, reducing the complexity of your programs is the best shot. Simpler programs are easier to spot bottlenecks and to change algorithms when you find an inefficiency.
After each optimization you can guarantee that your program will run faster for the same amount of data but what happens if your data grows in an exponential rate? Optimizations have limits, the time you have to do it also have but it seems that the growth of your data hasn't. It's not uncommon to find extremely inefficient programs in the scientific community and quite often just a bit of thought can optimize it 10- or 20-fold sometimes hundreds, but it's still not enough when the data itself keeps growing at a pace technology can compete.
Bioinformatics are facing a big inefficiency today that can be easily optimized hundreds of times but still, with the huge growth of the data and the exponential growth of the connections between datasets it'll become impossible to continue in the next few years without a major change in how data is manipulated today in a global scale.
As said earlier, most biological databases are exposed in full using lots of different flat-file formats for the general public. There are, however, some systems where you can search for a particular subset and get the files for that subset only but funny enough, most internal processes are still using the huge flat-files all over the globe.
Not only it's necessary to have the flat-files for historical reasons but the whole scheme is set on its idiosyncrasies thus banal things like a particular (non-alphabetical) order and used tokens are critical and must be carried on to the database or to the numerous programs written to deal with the files. Whenever you change one of them or get rid of them all the rest stops working.
Databases can reorder the results in almost any order you want but the order of information should never change its quality. Tokens, separators and line breaks are all flat-file concepts as database engines deal with them in the schema level (tables, columns, rows, graphs, objects, lists etc) which, again, is using standards to avoid re-inventing the wheel and at the same time achieve performance and consistency.
Scientists really don't care if one information comes before or after the other as long as it's there. They don't care if you use colon or comma to separate it as long as it's separated. Actually scientists would very much prefer to have the information in a graphical display or at least in a key=value disposition so, the only ones that really care about order and tokens are the programs that read the flat-files.
It's said among the non-technical circles that ”computers came to solve problems that they've created themselves” and when the software quality is low they're absolutely right. Professional software engineers know that good software can solve lots of problems without creating much more but that was not the rule for decades when standards were being created.
It's not uncommon to see major failures of software upgrades, the recent refactor of NHS' infra-structure software was a good example but they had chosen to upgrade exactly because the cost of maintenance was already too high and the possibilities of incrementing it were getting lower and lower. Migration failures have nothing to do with the necessity of migration, it's absolutely not better off using legacy software because of the fear of migration, incompetence in migrations is as bad as incompetence in not migrating when it's needed.
Understandably, most scientists today that deal with gene or protein data are used to read the complicated flat-files and they quite like it. When they're redirected to websites some just can't find the information they need at all and end up requesting a flat-file version of the entries to read. It is probably the biggest excuse why things don't change but the real reasons are the ones mentioned so far.
It happens in the industry as well, whenever a new system is developed the old one still linger for a while, some times for a too long while. Some users are more keen in changing and renewing, others end up using the same legacy system for the rest of their lives. Users have their role in defining development paths but they should never be used as an excuse to reduce software quality. They can learn new interfaces and they'll like it if the results are better, if the interface is easier to use and if it's much faster than the previous one and as those concepts are tied up with software quality, changing the user interface is a good price to pay to achieve quality.
Unlike users, data does not care at all what the interface is and programs can be taught whatever interface you like and as long as it remains consistent it'll work. That's the reason why legacy software can keep running for decades after they become obsolete but that's also the reason why there is nothing besides work blocking software evolution.
It doesn't matter if the data is in a flat-file or in a database for a website software, it'll read and present the user with a webpage. It doesn't matter if the database is relational, object-oriented or a graph-like structure, the only requirement is to have a proper way of reading and writing it to the storage engine. So, in the data perspective it really doesn't matter in what format or engine or operating system or machine architecture it's stored as long as it works.
For algorithms it's a different matter, though. It's much easier to perform a numerical or alphabetical search in a binary tree than in an unordered list, to establish connections between different concepts in a generic graph than in a fixed structure tree and so on, so the real decision of which data format and which storage technology will be used depends exclusively on what you want to do with your data. Furthermore, no matter which format, it's essential to have an efficient storage engine to be able to retrieve only the relevant data.
Providing complete flat-file to your users comes with a very high price in several forms: you'll need to keep track of releases, provide the indexes and accessory information in separate files, synchronize all copies which increases the complexity of production systems and the worst of all, force your users to download everything when they normally want only one or two entries.
To help users downloading only a subset of the data the common approach is to make the hack live longer, start creating smaller files that doesn't exactly cover all the necessities of all users and the list grows bigger every time, making the whole production of those files take much longer than the actual internal work in the database.
If instead of all that hassle the users were instructed to use websites and web-services to get the data, none of those files would be necessary, and you could even create pre-formatted searches on the website for those special users, so every time there is an update or when the user wants the data he/she just need to run the search again and get the new results. Adding saved searches or a list of previous searches is not difficult to do and that's the kind of thing you start thinking when you stop trying to solve the big flat-file problem and start solving the real problem: the user need his own particular data, nothing else.
Not exactly funny, but the internal procedures have the same treatment than the external users. Because the flat-files are treated as the source of all truth, normally all internal procedures use to take only them to do their computations. They'll do all over again, identify those entries that have changed, calculate whatever they need to again and will most definitely read a load of useless stuff just to get a few bytes of information.
Not to mention that errors can happen with hardware, filesystems, new programs or even old programs in new environments and, unlike storage engines, files accepts whatever you want to write to it, so running the same syntax checks after every single procedure that change the files is compulsory. Because the information comes from different places there can be duplicated information in different files so duplicate checks must also run before any data is marked as checked. Adding these steps to the already mentioned splitting and cutting it's easy to show why flat-file production is really that much slower than complex data mining algorithms.
Users are better off with smaller customized resultsets or intelligent integrators, the amount of data in increasing exponentially, internal procedures repeat over and over again the same computation and full runs every time obviously does not scale at all. The schema must change once and for all.
An incremental run will definitely scale much better. The increase in the amount of changes is much flatter than the increase in the amount of data (as the derivative of a function is much smaller than it's value for high values), so procedures that once handled a few thousands of entries in the past and are handling a few millions can go back handling only a few thousands again.
Changes in internal procedures can be very painful because one little change can propagate to disastrous proportions for the next procedures or even in the next runs, so the steps required to reduce the complexity of such production systems must be done in steps. With better computers today it's easy to show that this will give plenty of time to do it properly.
Together with incremental run, I/O optimizations and parallelization are very welcome, but the big problem is the concept of release. Release concept appeared because programs were written by different people and one process needed to run after the other to complete the whole production procedure and while the dataset was small there was no problem.
But with the amount of data available today and with the amount of different checks and lists there is to make it's impossible for a change in the database to go public in less than a month after it was made. It's not time to run all the procedures but the whole cycle is tied-back and the string human factor just makes it slower, sometimes you'll only see your change public half a year later.
Again, data does not care if it's handled in bundle or independently, as long as it's handled correctly and that's the reason why using an on-demand pipeline is not worse than incremental updates, and there are also good reasons why it's actually better.
The pipeline is quite an old concept and it was first shown effective in the 18th century during the industrial revolution. It has the problem of standardizing too much the goods, that's true, ignoring personal preferences and cultural differences but for the massive scale production it's the only thing that worked so far. Nowadays you'll find many post-production customization with cars (check the long tail theory) so the biggest problem in pipelines has its solutions as well.
The shock pipelines brought to the craftsman is that every part didn't know about the other's work, so the worker didn't have anymore the control over the quality of the final product. Also, each worker would never more receive a block or clay to build painted cooked teapots, if they received clay they should do the pot and someone else should put the handle, others would put the lid, others would paint it, cook it, sell it and so on. Each worker just received one thing and produced one other thing but most importantly, they received one at a time.
It's a waste of space and time to keep a bundle of cups ready waiting for the other cups to be made and as normally you can only make one cup at a time, why bother? Just use a conveyor belt to deliver clay blocks to you, one by one and put the cups in another belt as they're finished.
In computing, pipelines are used processors to perform all operations necessary to a set of data. Because of its design, several optimizations are possible by putting more than one dataset in the pipeline at the same time using different instructions, one after the other, exactly as the conveyor belt. The most notable feature of using such pipelines is that it's trivial to scale each part independently.
For example, it's much easier to cut the clay into blocks than to produce cups from them, so having one cutter and four cuppers would do the job if you have four conveyor belts from the cutters to the cuppers. You can have as many belts as you want from whatever point to whatever other point, as long as the flow remains balanced.
Applying the same concept to software, it's easy to show that this solution scales much better than the incremental approach as long as some rules are followed. It is not trivial to implement such pipelines but the solution is the same for all procedures so once you've solved for one you've solved for all. You can even use the same infrastructure for all parts, treated as black boxes, and whenever you fix problems for one part you'll have fixed for all of them.
There are a few ways of doing the inter-process communication (IPC) but the most generic is through sockets, either network or Unix sockets. Internally, the best way of store the list of datasets to deal with is using queues, but other approaches like the pre-fork standby of Apache is a very efficient architecture. But once you'll only receive one set, you won't be able to rely on other datasets in the queue (otherwise you go back to incremental updates), so you must have your data in an efficient and parallel storage engine.
To achieve a good pipeline infrastructure you need to assure:
The same on-demand concept also applies to the output. Building huge release files is only necessary when programs use the whole dataset to run but once they're incrementally updated or pipelined there is no need to have the release files at all. Users, be it humans or programs, only need subset of the data most of the time, especially when they store the previous data themselves in storage engines or even, what's much better, don't store it at all and rely on efficient web-services to grab data on-demand.
Moreover, using on-demand infrastructure, users won't need to download all files from all places and build programs to read them all, whenever you provide online services like the DAS Server, that integrates data from a lot of different databases, an user just needs to start searching, saving their searches for future use. Therefore, what is really needed is to focus on a very good search engine (such work is in progress already for the new UniProt website, for example) instead of providing the raw data to the users, after all it'd increase the redundancy and duplication of the data across the internet, what would be (already is) a nightmare to control all releases and versions available.
It is true that there an awful lot of users that have a huge code base relying on flat-files and it'd be painful to them to convert all those software to an on-demand basis, but as said earlier, there is no necessity from either user or data, to have the whole dataset every time, it's just work… a long and painful work, but doable.
Whenever data is on-demand, for input and output, there isn't the necessity of releases. History can be retrieved on a per-entry basis and the links between different databases are as simple as pointing to right entry on the remote site.