Wiki : softeng:code_organization
 

Code organization

In the early 80s the Revision Control System (RCS) was developed and used to help programmers manage changes on their code. But organization was a problem since RCS couldn't control files in multiple directories in an integrated way so a bit later (still in the 80s) the CVS, a much more powerful version control was developed. After several years many other revision control programs were written, both commercial and open source, completely revolutionizing the software management, but after all those years, code in RCS or no revision control at all are far more common than they should.

The lack of revision control is common for scripts, as they're often seen as quick hacks and commonly put in the work directory for production tasks. Also, most quick hacks have reference to other quick hacks by full name (ie. hard-coded paths) which will again refer other scripts and so on. Because there is no revision control developers start creating backups whenever they need to change something and creativity plays a very important role even for temporary files: .bak, .old, .renato, .renato.old, .20070916, .temp etc. The possibilities are endless. Worse, sometimes there is a change that doesn't work and the developer start using the old file, so the version that is actually in production is my_program.sh.renato.old.20030105.

Another common situation, whenever you have program and data in the same directory you can't just clean it or the scripts would go too. Also, when problems happens during the production, you do whatever it takes to fix it quickly which normally involves re-running manually the same tasks the script was doing and re-creating the same files with the creativity for extensions mentioned above. Now picture a directory with old data, new data, temporary data, old programs, new programs and temporary hacks as well as logs, text files explaining how to run (several versions of it without any particular order) and dumps of debug runs once in a while.

Filesystems don't grow too easily, they normally get replaced by new ones, faster and bigger and because bioinformatics normally handles huge amount of data, several filesystems should be used at the same time to handle different groups doing different jobs. It's easy to see that you end up having directories in all different filesystems, sym-linking from all to all others, creating variables and macros to refer them in an easier way to understand and it becomes quite impossible to know which program is doing what, when and how.

Figure those huge directories with everything on it, no revision control, spread all over the file system with sym-links and macros on top of more sym-links and environment variables. Now imagine the sysadmins announcing the decommissioning of half of those filesystems, how would you know what does what? It's as bad as it seems.

Right things in the right place

As redundant as it may seem: code with code, data with data, logs with logs.

Every project or team should have its own root directory (a bit like chroot) and inside it there should be source trees, data directories, temporary storage and logs. A good structure to follow is the Filesystem Hierarchy Standard (FHS) which is pretty much what most Unix distributions have for filesystems.

Normally data directories are the biggest. Some hold temporary working data, others archive or previous runs, others official releases and external data. All of them should be stored into one single virtual directory and the space managed using the actual filesystems with sym-links. The directory structure should be easy to understand, well documented in manuals AND readme files in the root directory and there should be environment variables/macros included by anyone running the programs referring, at least, the root directory.

Code directory should be in the home directories of each developer and should never be used for production as they may change frequently and during development it falls back again in the same problems of the RCS. Some sort of revision control must be used and automatically installation scripts should copy the scripts, compile the programs and install them to production binary directories.

Logs and temporary storage should be stored in a different directory, recycled and archived separately from data and code. A sample of such structure is described below:

$root = All projects' root.

Home dirs: $root/teamA/homes/[users]
 Bin dirs: $root/teamA/bin/
Data dirs: $root/teamA/srv/[projects]
Logs dirs: $root/teamA/var/log/
Shrd dirs: $root/teamA/share/[libraries, fonts]
Conf dirs: $root/teamA/etc/[projects]

and so on...

As in the Unix filesystem you don't need to store all those directories in the same mount point, you can mount $root/teamA/ in one filesystem and the big ones on its own like $root/teamA/srv/, $root/teamA/var etc. Any good Unix administrator should be able to do it fairly easy and because it's standard it's not difficult to maintain and understand for later administrators.

Code directory structure

If your project is to build a single program using a single programming language in a standard way it's not very difficult to organize your code. Most of the structure was already thought in previous projects so you'll quite often see C++ programs using Autoconf/Automake, having a Makefile in the root directory and one or more source directories further divided into sub-directories if needed. Java programs tend to spread a bit more in the filesystem and have fewer sources per directory (called packages) and lots of empty directories giving the name resolution for the packages.

But when it comes to production programs, where you'll probably deal with an awful lot of programming languages, every one has it's own particular way to organize code. Some separate by programming language and put the project names inside, others separate by projects and put the programming languages inside. Most will even put all compiled source under src and all scripts under script and call done. It doesn't matter really what structure you follow unless you keep the standard on all places.

But if you're using the chroot structure per team / per project, separating your code per project seems more logical. Also, inside the project, having directories per programming language doesn't seem right because programs interact with others irrespective of their language, so grouping by functionality makes more sense. A possible structure for code would be:

$cvs = The root of your own CVS (or other version control software) tree

# Shared libraries and tools
$cvs/shared/lib/[libs]
$cvs/shared/tools/[tools]
$cvs/shared/env/[environment settings]

# Project 1 is a big program
$cvs/project1/src/
$cvs/project1/test/
$cvs/project1/build/
$cvs/project1/conf/

# Project 2 runs the Project 1 every week
$cvs/project2/lib/[internal libs]
$cvs/project2/tools/[tools]
$cvs/project2/production/[scripts]

You can even organize your projects into bundles if you feel appropriate, so in the case above project2 is the production of project1 and runs it every week. It handles preparation and run, archives all logs, move the result data to the correct places etc. They could be in the same bundle, so:

# Project 1 is a big program
$cvs/bundle1/project1/src/
$cvs/bundle1/project1/test/
$cvs/bundle1/project1/build/

# Project 2 runs the Project 1 every week
$cvs/bundle1/project2/lib/[internal libs]
$cvs/bundle1/project2/tools/[tools]
$cvs/bundle1/project2/production/[scripts]

And whenever you check out bundle1 you get the whole thing.

It's important to keep the directories as clean as possible and structured in a tree without repetitions. An example of a good structure is the Linux kernel source tree, it has all filesystem code inside fs, all drivers under driver and so on. CVS is filesystem based and that helps a lot on organizing in a tree structure. The rule is always avoid structuring with names instead of directories, like:

# BAD
/fs_ext/
/fs_ext_drivers/
/fs_ext_utils/
/fs_reiser/
/fs_reiser_interface/
(...)
# GOOD
/fs/ext
/fs/ext/drivers/
/fs/ext/utils/
/fs/reiser/
/fs/reiser/interface/
(...)

Environment and configurations

When creating programs, especially for production purposes, assuring the correct environment is fundamental. A few steps can assure the same configuration will be used by all your programs and scripts, but you must separate global configurations (project wise) of specific configurations (program wise).

Any program can make use of environment variables so those must be used as global configurations. This is why they must be on a general source directory ($cvs/shared/env). Those files, generally shell scripts, must be included by every shell script you have and also by your initialization scripts (.bashrc, .tcshrc and so on) so you don't need to source them manually every time.

Apart from that, some general configurations can be stored in text files that can be read by any program like “key=value”, INI and properties' formats. The more you centralise your configuration and environments, more efficient you'll be to maintain and change whenever needed. Because most environments will be running some sort of network filesystem, the availability of the configurations and the environment is not an issue.

Shared code

Any code shared between two or more projects must be stored outside of any project's directory. This is why there is a $cvs/shared/lib/ and $cvs/shared/tools/. Those directories must be organized by tool and library name (instead of project or language) and a thoroughly documentation must be available. JavaDoc and Doxygen are good ways of documenting your libraries.

Testability is much more crucial with shared libraries than with user programs because many user programs will use that library. Creating one test case for every feature that every user program needs for every library is a must and maintaining this structure is the recipe to tranquillity on future deployments.



 
softeng/code_organization.txt · Last modified: 05 09 2007 19:15 (external edit)
 
Recent changes RSS feed Creative Commons License Driven by DokuWiki