Wiki : softeng:standards
 

Standards

When you need to store large portions of a very complex data with thousands of inter-relationships between each other and you need to constantly search through it from different points of view retrieving different sets of data each time, what kind of storage engine would you use?

Back in '86, when the protein database Swiss-Prot was created, relational databases were still mostly academical so the decision was taken to create a database management system from scratch. The idea was to store every single piece of information available in the world about every single protein known and to curate and improve even further once this database was up and running, and that was done.

Heroically, I must admit, they stretched their computing skills to build a flat-file system where all the information was stored into a single huge file and you had to manually build the indexes and save in dozens of other files and, of course, read them again every time you wanted a single information.

By that time I was only a child and I did the same thing with TV shows for my mum. I've built a flat-file system to input TV shows to later search for them ordered by any field. The solution was pretty much the same, I've just used ** as record separator instead of //, but the rest was similar. I've learned a lot about data integrity, complexity of indexes and especially that, as data grows, storing all data into one single file simply does not scale. Later on I've found out about relational databases like Ingres, DB2, dBASE and later Oracle, MySQL and PostgreSQL.

I was amazed by how they thought of everything, they had papers and papers about every single detail, they had the best programmers working full time on it to make it better than all others and they've done it right. If I had continued with my TV show search engine I would probably move completely to some sort of database system, but I was only a child.

If you check Swiss-Prot you'll see that it's still the same flat-file system, with all legacy structures and indexes and in one huge chunk that takes ages to search for a single date that would take milliseconds in an indexed table on any modern database system available today, even for free. I had the opportunity to re-write all indexers, splitters and search engines for Swiss-Prot and TrEMBL, it was really fun to do it all over again but let's be realistic, why do we still spend time on it?

I'm absolutely sure that I've made lots of horrid mistakes, certainly my indexer is much worse than dBASE in it's first version decades ago so, at the end, we would be better off using dBASE instead of my programs! Why not use standards? Why re-invent the broken wheel? Honestly, I don't know.

Core business

Suppose you're a physicist and you have an idea for a start-up: Wireless Network Cards using teleportation of electrons. You build the first prototype and it works, excellent! Now you got your funding and need to start making the boards to sell. You're the owner but face it, you're a physicist, you can't run a company all by yourself! You'll need someone to actually manufacture the boards, take care of the money, sell the boards, fix them later if they have problems, to deliver them to the shops, etc. Would you only hire physicists for all jobs?

Ok, you got your company with professionals from all areas now and the sales manager is going nuts, he can't remember all his clients and he asks you for a small database to store them. Would you hire programmers to write it or buy some cheap software from the nearest tech shop? And what if your CFO asked for a financial calculator, would you hire programmers to write that too instead of getting a cheap HP?

Why the answer is always the same? Because your company does not build software, it builds network cards!. The core business of your company is hardware, more specifically network cards, even more specifically quantum network cards and NOT software.

There isn't anything wrong on having a few programmers on board for some emergency fix ups, of course. If you need an extension to that cheap database that the vendor cannot produce in time (or at all) you may use a few coders to make it. You may need software to test your card or to simulate them to reduce the cost of new developments. There are lots of places for programmers today in any company, but never, ever, to spend time re-writing programs that are already ready in the market (or for scientific purposes).

In a nutshell, whenever you write commodity software in-house you end up with two problems: First, you'll just wast your time and money and second, it'll most definitely be much worse than the original.

Software evolution

You don't need to go far to see that technology evolves much faster than most things you know. In the last century, humanity invented more gadgets than in all other years before (millions) and more will come this century and software is no exception to this rule.

Imagine yourself today still using a PC XT (one of the first versions of the IBM PC computer) running PC-DOS on it and being able to run only one program at a time. Browsing the internet, receiving emails and chatting on the messenger altogether was not an option, not to say playing 3D games and embedding images and video to your documents.

If you're using computers for a while you probably have changed your computers a few times and your operating system a few dozen times even when keeping the same machine. It's painful, you always loose something along the way, everything changes, programs cease to exist while other appear from nowhere, but still, you do it, don't you? You face the challenge and better, you don't even mind loosing a few things just to get your hands on the next generation of operating systems available. This is the new culture of technology because it evolves too fast.

There are hundreds of developers at Oracle or MySQL AB making most of their salaries by incrementing and securing their products to unprecedented levels because that's their core business. They have to spend their time building database storage engines, indexing algorithms, replication, distribution, search, update and so on, that's their job.

So, here's the punchline: Why didn't Swiss-Prot evolve as much as all the other technologies around it during the last 25 years? Because their were all busy building storage engines, and still are.

De facto standards

Not only software products have standards, network technologies, algorithms, programming languages, APIs, almost everything has an international body that regulates its standards and that people should follow to increase quality. The web has W3C, network protocols are regulated by IETF, mobile developments follow OMA, and so on.

Why invent some weird rendering language to display documents if you can use HTML? Why write your own server to store emails when there are lots of open (and closed) programs that implements IMAP, POP and SMTP much more efficiently than you could ever imagine? Why write a class called HashMap in C++ when there are much better implementations of maps in the STL? (I confess, I did it).

There are other standards that do not come from international organization but built their reputation from quality over time. Some examples are the boost library for C++ and the Gtk and Qt interfaces for desktop development. There isn't any organization that endorses them but people still use when they need a bit more than the basics.

The developers of the boost library are not only the best, they are the best team also. Their work is almost flawless, they think on every detail and whenever something isn't exactly perfect they accept critics from anyone and if it's for the best, they change accordingly. Someone alone in a laboratory simply can't compete with such perfection, with such collaborative environment of the best programmers around, that have thought for years on those problems.

I'm not saying that there aren't better programmers than them or in any other standard, the fact is, for those programmers, that's their core business, that's what they do best and spend many years of their lives.

Third-party software

Standards are built with time, need and great effort. Not everything is standardized, not everything needs it anyway but for those that really need there will be a time where you'll have lots of solutions until one (or more) of them will become the standard, either by consortium definition or market preference. So, there will always be a time while the standards are build built or the market is still defining what to use (VHS vs. BetaMax, CD vs. MiniDisc, HD DVD vs. BluRay etc) that you'll have to rely on non-standard third-party software.

There is also a good chance that no third-party software will meet your needs and the only way is to do it yourself. It doesn't contradicts the core business approach, it's sometimes called strategic development in the industry. It is a bit weird to see biologists developing databases or banks developing instant messaging systems but the fact is, at the time they did it there was nothing like it and that give them a great advantage in their core business.

But, as I said before, software evolves and especially when it's not your core business, your own solution will not. Also, when you rely on third-party solutions they will eventually evolve but the risk that they won't become the next standard is usually greater than the opposite (normally there are two or more solution in the market), so you need to come up with a migration strategy whenever you rely on in-house or third-party solutions.

Sometimes the migration is painful but it must be made at some point and the sooner the better. Imagine if you had chosen BetaMax for your internal CCTV recordings, you would have had, at some point, to change it to VHS. BetaMax tapes stop selling long way before VHS and even being of a much superior quality they still have a limit of recordings. With software it a bit more complicated because it doesn't depend on the media it's running so the reason to migrate to newer standards is generally performance, connectivity and stability.

Legacy systems and upgrades

Legacy systems exists in almost every place that there is a software running and sometimes stability is thought as not upgrading. There is a fuzzy threshold, depending on many variables, that defines when to migrate and in the industry it normally reduces to money. Whenever the cost of maintenance is higher than the return they either deactivate or migrate to a newer standard but again, academia is quite different.

Maintenance cost is also a problem, of course, but there is no return to threshold when to switch and normally it becomes a scientific decision instead of a software quality decision and while the results are acceptable the software will run, the scheme will be kept forever.

There must be clear thresholds, defined by software engineers, defining when systems will be written in-house, used third-party or standards and whether the cost of maintenance is unacceptable or not for are software engineers that maintain software, they are the ones that know how long the whole structure can hold given the amount of predicted data and what future changes can be made in the current code base.

Furthermore, software engineers should always focus in the future, avoiding quick hacks, predicting failure scenarios and be prepared for them way before they happen. Also, whenever standards are set, the sooner you migrate to them the better and because new standards take an awful lot of care not to conflict with previous standards, if you did follow the past standards the migration will be much quicker and easier.

Unique identifiers

Almost every table in a database needs a primary key to index the data. It's a code (usually a sequential number) to uniquely identify each row on it so you can reference it from other tables directly to the row you want without the need to search the whole table. Because the primary key is a ordered you can always perform an ultra-fast binary search on it (not slower than log(n) hits per search) and that's the big magic of having those keys.

In bioinformatics, even when using databases, the exchange of data is done by generating huge flat-files that needs to be read in full by the third party and therefore needs a primary key as well. Normally those files have well defined indexes or accession numbers, they're called unique identifiers and are supposed to uniquely identify each row (aka. entry).

But primary keys in databases are per-table objects and shouldn't even be exposed to the general public because it has no meaning at all, it's just an index, and because the database uses the index internally and only exposes the data required by the query, so there is no need for a unique identifier in that case at all.

Unfortunately even today, when most biological data is stored in databases they still use the flat-file concept for data exchange and thus need to keep the unique identifier. The big problem of the unique identifier is that there is no way to have a standard repository and still be agile. If every biological database that had information about proteins or DNA needed to contact a central repository to ask for a new identifier it would take hours or even days until they could insert anything new, and that situation clearly does not scale at all.

What happened during all these years was the creation of infinite ad-hoc unique identifiers that couldn't even be unique within it's own data set nor identify a single entity along the time. Some databases even have more than one non-unique-non-identifier to complicate things a bit. But to share data without any kind of identifier would be killer because there would be no way of identifying the same information across different databases.

Unique identifiers must be unique and identify the same thing in both space and time, which means that you should never change your identifier, ever. But what if you do a mistake? What if you create several things and at the end figure out that they were all the same thing? Because science is an evolutive process mistakes can happen at any stage.

It's not easy to solve this problem and adding more bureaucracy can kill the whole process but there simply is no way of assuring uniqueness and consistency when you have no control over the data. It gets worse when the data is not even compatible, so you cannot make any assumption whatsoever of what to expect from the remote database and you must write one rule for every connection between your database and others (Cross References) and every time the unique identifier changes in the remote database you need to change in *all* connected databases.

Universal data

The unique identifier problem is practically unsolvable. Even though you can come up with several solutions and international organizations to help it'll just add more time to take decisions, bureaucracy and inconsistencies. Taking the easy path in this case is not easier, one must change the very nature of biological data and come up with a data model that fits every single available data in the world. It's clear that the more generic your data model is the slower it'll run and the universally generic data model is the worst case possible.

One way to achieve it is to create a table with three columns: id, key, value. You don't need any other table or constraint but you also can't do many smart queries on it, some will return in exponential time while others are just theoretically impossible to do. There are other data models that solve the query problem, most notable RDF (the semantic data model) but still it doesn't solve the speed issues.

Using RDF would at least solve the unique identifier problem for every object (or entry) is uniquely referred by a URI and the HTTP protocol guarantees that every URI should be unique. It also solves the reference between different databases as you won't need to store any external information on your database, just point to the external URI and let their database to do the rest.

One thing you just can't do with RDF is try to store everything in your database, it just doesn't scale for the very nature of it. Instead, you need to keep only the links and retrieve any external data on demand. Keeping data distributed is the only way to keep the search fast, for each dataset is smaller, and unique, for each dataset control its own uniqueness.

RDF is just one example and should not be taken as the next miracle but it does express the generality needed and practically solve most common problems current faced by bioinformatics. Furthermore, it's being actively maintained and incremented by one of the most influential technology organizations, the W3C and there are already a huge amount of work done to build storage engines, tools and libraries with reasonable performance.

It's very common for people to mistake between local performance and global performance. Graph data structures (such as RDF) are locally much slower than most other data structures because the algorithms used to search for connections need to be much more generic. So, the same search in a database that takes 1 second can take hours in a graph, and that's quite often the reason why people just give up using such data formats.

But what most people don't see is that, because a graph can connect anything to everything else, you can enhance (annotate) your data much easier and clearer than in most other data formats and it'll be universal. It means that your queries will be much slower in the beginning but with time you'll not only get more, you'll get better results. It'll worth to wait 10 times longer if the results will be 100 times better or sometimes getting results that would be impossible (due to data structure constraints) to get.

However, there is no guarantee that graph structures will give you better results, it clearly depends on what you do with it and what you want from it. Normal data like “employee / boss / salary” are better off with a relational database but when you have the problems that bioinformatics have, the additional work might very well pay off even to become far simpler in the future when every database have the same data model (with different structure).

A few good reasons

As we could see, there are a few good reasons on why to follow standards and the fundamental reasons are:

  1. to focus in your own core business
  2. to avoid re-inventing the broken wheel
  3. to evolve together with technology
  4. to produce scalable non-redundant data

It is possible to reach those targets without following standards, though, but your success will be limited in time or space or both. If every team do it their own way they can eventually reach only a local success and will have serious problems when communicating with others. If you re-invent technologies to one project you'll end up re-inventing for subsequent projects as well and your results would never be easily comparable with others. If you focus too much on the accessories you can't spend much time in what you committed to do in the first place.

Some things become standards because there wasn't anything else like it at the time it was invented and not always they keep their stigma because the architecture is still valid or effective. Normally in the industry they get replaced by newer technology like LCD replaced CRT, DVD replaced VHS, CD replaced Vinyl and so on. Normally it's worth the money spent as the money they make afterwards is huge but in the industry the argument is never quality or just to do things right.

In academia, however, things are free (as in speech) so money is not an argument, but doing things right and increasing the quality of the available knowledge is always worth as it'll augment new ideas and often create new areas of research. The problem arises when two academical fields find their way together as did information systems and biology and the quality of one field is driven by the other.

Software should follow and create software standards, should follow and augment software quality assurance and should be implemented by software engineers and experts in information systems, independent of where and how they're used.



 
softeng/standards.txt · Last modified: 05 09 2007 19:15 (external edit)
 
Recent changes RSS feed Creative Commons License Driven by DokuWiki