Wiki : softeng:architecture
 

Architecture

Since the very beginning the architecture should be designed with performance in mind, but instead of deciding which sorting algorithm to use you must first decide if you really need the data sorted in the first place.

Data

The first problem you need to solve is where to store your data. Using databases for configuration and temporary data is not wise as they're normally small and simple files where indexing and complex relationships won't be of much help but storing your complex core-data with more relationships than meets the eye in flat-files have the opposite effect.

If your storage have no infrastructure at all (ie. flat-files) you need to write a good deal of code to maintain the files, index, query, split and relate to other databases. This can be accounted for much more time and maintenance cost than the rest of your project's code.

Some time ago, at the dawn of database engines, it was said that everything should be in a database, including temporary files and configurations. This clearly is over-engineering and only in very specific cases (where filesystem semantics can vary or there are some complications in deployment) it does apply, but in the majority of cases it doesn't.

Databases also offer extraction languages for more meaningful retrievals and control over the data that clearly pay off when the cost of maintaining the data itself is going to be big. Primary keys, foreign keys, triggers, functions and procedures can be a great help when they deal with the data structure itself instead of writing loads of scripts running in cronjobs to maintain data in a file.

In some other cases, temporary storage can be kept in the database, for instance when the output of some queries are the input for other queries. Temporary (heap) tables and views are perfect for such thing. On the other hand files are much easier to copy to remote places, control revisions and manually edit them, what makes perfect for communication (XML over HTTP) and configuration (text) files.

Modularization

Separating classes and methods in modules is generally a good idea but over-engineering of code normalization can lead to disastrous situations. The Java programming language is a good (or rather bad) example of that, everything is a module or a class and every class derive from the class Object that has lots of properties and methods. This is especially bad for type conversion, I/O and when deriving from the basic classes. For instance, the class FileReader extends InputStreamReader that extends Reader that extends Object, if you want to build your own MyFileReader (as you're probably encouraged to) your basic class will have five generations.

This is not that big problem during in runtime (apart from its size and type casts) but is particularly bad when compiling. Lots of additional checks must be done in order to assure that any of your classes conflict or if the data you're trying to send to a stream can actually match one of the implementations of it. Not to mention the complexity in programming, you never know which class can actually fit in the container and sometimes they do fit without your consent. The possibilities are just too big.

The language syntax is very restrictive, what helps a bit and most development environments (IDE) today help a lot with those problems by checking additional constraints and giving some advices but that should never be the role of an IDE, which means that it's much more difficult to produce good quality Java code without an IDE than it is with one.

Nevertheless, modularization is good and should be encouraged, its benefits extend to more than just joining similar concepts in the same block. It helps a lot unit testing because the tests will also be correlated and you don't have to test several different units when you make a local change. Also compilation will be faster upon changes as fewer libraries will be affected by the same code change.

There are also some hidden optimizations in hardware. Libraries need to be fast and by putting them close together you're helping the CPU to optimize the execution for you as if they're close enough the processor won't invalidate the instruction cache every time (the code will be already pre-fetched in the cache or probably already used), so the idea is to put the methods that execute at similar times together.

High-Cohesion, Low-Coupling

In a broader sense, modularization increases the cohesion of software and decreases coupling between unrelated parts. When methods are aggregated in classes and those in modules, interfaces becomes necessary to enable communication between modules. While before you could access directly the data and functions in a global scale, with modularization you need to write interfaces to access them meaning you can protect your data and the correct execution of the code.

Intermediate results

With global access, the user can change the internal properties while running your program, and if that's required for the correct execution you'll end up giving unnecessary power in your user's hands. You'll have to rely on their software quality as well as any bug in their code can pass on to your code and you might not be prepared to handle it.

It often happens when the procedure is a list of smaller procedures and when schedule is tight. There is the inertia of allow external users access intermediate results when the whole procedure is not finished yet and if a later step proves the data to be wrong, it should invalidate all previous data, which have to be propagated to all affected users getting intermediate data.

Decoupling procedures

It's not only harder to manage, but it can take much longer and require a much higher control from all involved parts, and this only helps to increase the coupling of schedules. Once you're in that intrinsically complicated multi-schedule it's even harder to get out and see the solutions clearer, as every move you make seems only to make it worse.

Weird as it may seem, optimizations can also get things worse. Once you manage to optimize some parts, the schedule would automatically shrink or more procedures would fit in (now that you've go the time) and it could be even more complicated to get off of that new situation.

Coupled procedures could also be an excuse for not optimizing. Since you have access to early data you don't need to get your procedure faster and you feel very smart because you're actually running lots of things in parallel, which normally means optimization. But whenever one of them fails you understand why you shouldn't rely on intermediate results. Not only you have to wait again for new data but you were counting with the exact time to run it and depending on how long does it take to recreate your input data it may not be enough time.

To decouple procedures you really need attitude before optimization and understand that parallelization only means optimization when you're solving a single problem at a time (or you have a huge asynchronous-on-demand infrastructure). Once you have assured your procedures are cohere, external users will begin to complain about performance of their own procedures. That is the time to start optimizing on it's really the problem.

Distributed execution

Some programs need to scale better than others and some parts of the programs need to scale better than others. Also, sometimes the data you need is not local or sometimes you don't know what to do with the data and need some remote program to tell you.

Not necessarily in parallel, but distributed execution needs some special care during the design phase. It's a very sensible part of the plan as generally third-party teams are involved and much is at stake for an unreliable or slow communication might invalidate the whole project.

First, you need to define what is remote and what is local, to all parts involved and calculate the costs (average and maximum, never minimum) of transferring data from one place to the other. It normally is the case that the inter-sites costs are too high to allow more than one transfer per transaction so never rely on it. Avoid at all costs synchronized communications where the whole program blocks while waiting for data off-site, search for good asynchronous libraries as they usually pay off.

Then you need to test what takes considerably more time to run than the rest and prepare that part to scale. It's normally a better idea to scale parts separately than parallelizing the whole program. It'll consume much less memory and CPU time and can be scattered throughout a cluster of any size without much problems. Remember that the communication between the local parts of your program will usually run in gigabit ethernet interfaces or faster, scaling much easier than inter-site connections. Take that into account when designing a scalable solution.

Last, be sure to use standard (or third-party) communication libraries as they're too complex to develop in-house and most libraries already solved all major problems in the past. It'll most definitely not be your core-business so let the experts do they're job. MPI is the best intra-site communication library available for Fortran and C/C++, Java have the standard JMS library that is also very good. For external communication HTTP is still the best and most reliable channel. It is possible to implement asynchronous communication using HTTP only and it's not too complicated to be developed in-house in case there is no good standard to be used.

Containers and Algorithms

After all high level decisions have been made, it's time to define containers and algorithms. If you use the standard libraries (like STL or Perl's and Java's native) the same algorithms will apply to most containers the same way, so the big decision is, in fact, about containers.

There are some easy decisions like using a set instead of a list for indexes because they're ordered and both insertion and retrieval are logarithmic in time. But some others are not clear like whether to use vectors or native arrays when storing lists of numbers. There are some excellent books written by Scott Meyers named Effective (.*) talking about how to choose the best approach using C++ and STL containers and they're not just rules to follow but fundamental knowledge to carry on while taking such decisions. You shouldn't get that deep in this phase yet or you'll end up optimizing prematurely but you need to have such knowledge to be able to wisely take high level code decisions at the first development phases.

There is a problem on using Java native containers discussed earlier about performance, you might end up with too many inheritances and mess up your code, but nevertheless, the first shot should always be standard and native. Later on, when on profiling phase (much later on, really) you can change it to optimized versions if it really trades off.

Normally the standards also provide very efficient implementations of the common algorithms like search, patter match, order, trees, graphs etc. Again, always use them in the first phase, even if that's your core business. It may seem like a contradiction but here's the catch:

When developing in groups, it's a good idea to split the core business part to the best engineers you have and the rest to the less experienced. While developing the core business, the experts will build unit tests, performance tests, regressive tests and everything to assure the library is good enough for its purpose while the other engineers will be building the interfaces, usability tests, reports, etc but they can't wait until the experts have a fully-functional library to start their part. The best to do is to write a basic functional method or return some dummy results to test the rest. It'll be easier then to test both parts separately and take care of integration problems only in a later phase.

There is another situation, sometimes your core business is not the algorithm itself but the proper use of traditional algorithms in a certain way to extract information from data you have. For this purpose you better use the standards first and profile at the end instead of trying to build everything from scratch, applying for experts too.



 
softeng/architecture.txt · Last modified: 05 09 2007 19:15 (external edit)
 
Recent changes RSS feed Creative Commons License Driven by DokuWiki