Wiki : softeng:programming_languages
 

Programming Languages

Choosing a programming language is not trivial. Many people learn a particular programming language and consider themselves programmers, that's just not true. A programmer is the one that understand about logic, mathematics, code and more importantly, problem solving. Solving a problem doesn't correlate with the language it's solved, and programs shouldn't care too, but there is a catch when it comes to programs.

Some spoken languages are more compact, others are more rhythmic, or flexible. The same thing happens to programming languages, some are designed to teach, others to be very efficient, others to be expressive or to model complex data structures and so on. It's often indeed the case to have more than one programming language for the same purpose. So choosing the right programming language for the right program can change completely the cost of maintenance and future developments and shouldn't be based on those you're use to but the best matches for the problem.

In the beginning, bioinformaticians used to learn Perl but now they're more compelled on learning Java and you'll see loads of Perl/Java code for biological research but that's one of the big problems of bioinformatics, we're not teaching them the right lessons, we're just giving them the tools. I wouldn't expect anyone to prepare sushi correctly just because they have a sharp knife.

Low-level languages

When developing device drivers or very low-level libraries that deals with a huge amount of I/O and need to answer in real-time you must use low-to-mid-level programming languages. The biggest problem of choosing low-level languages is portability, in time as well because software and hardware evolves so future changes to the libraries will happen one day or another. Because of that, unless device drivers are really your core business, assembly is out of the question.

You won't loose too much performance and real-time issues by going to C or C++ (although C++ have some issues with I/O) and the big deal is that virtually every other language can communicate directly with C. Anything else would just be too slow to worth it, so no, Java and Perl are out of question.

High-Performance languages

Several programs in bioinformatics have to deal with loads of data, think for loads of hours and write to loads of databases and other files and that takes a lot of time. Parallelization, good numerical libraries and efficient I/O are some of the most important features the standard library of a programming language must have, but there is more. Languages need room for manual optimization of the internals as they can account to speed ups of more than 10 times, but that comes with a cost: expressibility. Expressive languages can be compact in lines of code but it's very difficult to manually optimize loops and assertions when you know the data structure better than the compiler (or interpreter).

Another common problem is on-the-fly interpretation (including virtual machines) can slow down algorithms too much and even (again) refrain the parser to do optimizations in the code. Some interpreted languages like Perl can contain syntax errors and still run without problems for a long time, as long as that particular line is not run by the interpreter.

The most used languages for high-performance computing, especially in the scientific community are Fortran and C/C++. Fortran is the best as it allows plenty of room for manual optimizations, the compiler can assume behaviours that the C++ or Java compiler can't (because of the language specifications) and it also have lots and lots of extremely fast parallel libraries to run in massive clusters. There is also the special purpose High-Performance Fortran, a special release with even more fine-tuned libraries and a special syntax to ease parallel development.

C and C++ also have some very good numerical libraries, parallel execution, message passing interfaces, a very efficient standard library and plenty of room for manual fine-tuning.

Graphical languages

You can write graphical programs in any language, of course, but there are some that are easier than others. Note that I'm not talking about games or 3D applications where graphical performance is a must, this is about displaying results, graphics, helping the user to understand better the results by placing them in meaningful places etc.

Today, for this purpose, most is done on the web. Desktop applications are just not that portable and generic, so using web interfaces with the plain old HTML is the way to go, especially now with the Web 2.0 and its nice Ajax on-demand queries, DHTML rendering, XHTML meta-information and so on, they're becoming more powerful than most desktop user interfaces.

The common web languages in use today are Java and PHP, but the later is a very lousy programming language on top of a very weak infra-structure (the parser) and simply there aren't enough good solutions for the common problems such as Java. On the other hand, Java requires a bit more of infra-structure, bureaucracy and it's a bit slower for the common uses, but with time and the right setup Java is more stable and reliable, have more features and plug-ins and there are more intelligent solutions from planning, to development to deployment.

Concurrent development languages

This is seldom the case in bioinformatics but in some cases there are lots of people developing the same program. It raises some problems of concurrency, that should be handled by the revision control system, and style but things much worse can happen. Two developers could write the same method in two different places because the code is too big, or because there is no good communication between them, one developer can change the code previously written by another developer and change its behaviour completely breaking the system and so on.

Languages like Perl and C just don't have any requests whatsoever for the development structure, so there should be a very strict rules and policies that should be enforced every day to every developer and that tires too much. What you need is a language that is a bit more bureaucratic from start, that not only the syntax but the whole infrastructure is bureaucratic so if two developers want to build the same method they would go to the same place (and eventually find the other method in place already).

The most bureaucratic language in existence today is Java. If you get two programs for the same purpose written by two completely different developers, it's very likely that the solutions, the structure and even the method names would be the same. In big projects, with thousands of classes, this problem becomes a feature and is very desirable.

Scripting languages

There are hundreds of scripting languages and all of them are quite nice but you need to define the purpose and scope before deciding which one to use.

Automation

For simple automation, basic shell scripts like Bash or Ksh are enough. They provide the basic programming structure, can make use of the extensive programmer-friendly Unix tools (Gnu and others) and run on any Unix environment with virtually no change whatsoever. The only thing that you must worry are command syntax and options. For instance be explicit when you're using the advanced features of a Gnu program, normally Gnu Unices (like Linux) symlink all Gnu tools to the standard Unix names (such as gmake to make) thus the scripts written in a Gnu/Linux might not run in a Sparc or Tru64. On the other hand if you explicitly call gmake whenever you're using Gnu extensions you assure the correct program will be called on a non-Gnu system.

Shell scripts are rather limited in functionality and cumbersome, medium tasks can take lots of lines. Also, because almost everything is an execution system call, it tends to be slower than the other scripting languages. The real advantage in using scripts is to simplify program calls and check for basic errors. Programs can get too generic and several command-line options may be required in order to fully specify all possible behaviours so a shell script can be quite handy in specifying it for you for your most common cases.

Programming

When you need a bit more than just IF/ELSE, other scripting languages may be required. Perl and Python are two very good generic scripting languages. They can be very powerful, have lots of modules and be very easy to program but there are some things that should be avoided.

First, they are scripting languages and not programming languages. Although you can write entire programs with them their rules are very loose, the syntax can become quite cryptic and maintenance can become a nightmare. Perl grew from a practical extraction language, when you could run a powerful awk without forking or do some advanced math without calling bc, to a major revolution in the web and later in bioinformatics.

Object orienting, exceptions and generic programming were added to Perl as sort of a hack on top of the other and should not be considered as proper programming or good practices, they are what they were: hacks.

Second, even without forking for every instruction as shell, the performance is still lousy. Python was created long after Perl and have object orienting and other interesting stuff embedded from birth so the syntax is a bit cleaner, but still it's an interpreted language. They are much more expressive than shell or C++ or Java but that comes with a cost, no fine-tuning optimizations. So, they're still scripting languages build to enhance the automation to a new level not to downgrade programming.

Database programming

Most professional databases today have some sort of embedded procedural language. Because they're running inside the database, index searches and storing temporary data is all done in the server, where several optimizations are possible, but they're generally very limited in structure and, as well as shell script, can become a nightmare if the task is too big.

But writing everything in an external code is bad, you spend too much time transferring the data back and forth and converting data types that the gain in programming power is lost in performance. Not just that, most programming languages have database connectors that you must do a very complex setup for every query (prepare, bind variables, execute, get results) and the code end up cluttered the same way.

The prize for bad database libraries goes to Oracle with C/C++, the OCI library. It is so horrible that it was necessary to develop a new language, called ProC to help with Oracle queries but the result was even worse.

Good database environments are found in Perl and Java and although Python have an easy way to deal with queries and resultsets the good Oracle libraries are deprecated and there isn't much work done in that direction ever since. MySQL have good drivers for all languages and they're quite good, easy and fast.



 
softeng/programming_languages.txt · Last modified: 05 09 2007 19:15 (external edit)
 
Recent changes RSS feed Creative Commons License Driven by DokuWiki