header image
Gtk example
September 26th, 2009 under Devel, OSS, rengolin, Software, Unix/Linux. [ Comments: none ]

Gtk, the graphical interface behind Gnome, is very simple to use. It doesn’t have an all-in-one IDE such as KDevelop, which is very powerful and complete, but it features a simple and functional interface designer called Glade. Once you have the widgets and signals done, filling the blanks is easy.

As an example, I wrote a simple dice throwing application, which took me about an hour from install Glade to publish it on the website. Basically, my route was to apt-get install glade, open it and create a few widgets, assign some callbacks (signals) and generate the C source code.

After that, the file src/callbacks.c contain all the signal handlers to which you have to implement. Adding just a bit of code and browsing this tutorial to get the function names was enough to get it running.

Glade generates all autoconf/automake files, so it was extremely easy to compile and run the mock window right at the beginning. The rest of the code I’ve added was even less code than I would add if doing a console based application to do just the same. Also, because of the code generation, I was afraid it’d replace my already changed callbacks.c when I changed the layout. Luckily, I was really pleased to see that Glade was smart enough not to mess up with my changes.

My example is not particularly good looking (I’m terrible with design), but that wasn’t the intention anyway. It’s been 7 years since the last time I’ve built graphical interfaces myself and I’ve never did anything with Gtk before, so it shows how easy it is to use the library.

Just bear in mind a few concepts of GUI design and you’ll have very little problems:

  1. Widget arrangement is not normally fixed by default (to allow window resize). So workout how tables, frames, boxes and panes work (which is a pain) or use fixed position and disallow window resize (as I did),
  2. Widgets don’t do anything by themselves, you need to assign them callbacks. Most signals have meaningful names (resize, toggle, set focus, etc), so it’s not difficult to find them and create callbacks for them,
  3. Side effects (numbers appearing at the press of a button, for instance) are not easily done without global variables, so don’t be picky on that from start. Work your way towards a global context later on when the interface is stable and working (I didn’t even bother)

If you’re looking for a much better dice rolling program for Linux, consider using rolldice, probably available via your package manager.


SLC 0.2.0
September 12th, 2009 under Algorithms, Devel, OSS, rengolin. [ Comments: none ]

My pet compiler is now sufficiently stable for me to advertise it as a product. It should deal well with the most common cases if you follow the syntax, as there are some tests to assure minimum functionality.

The language is very simple, called “State Language“. This language has only global variables and static states. The first state is executed first, all the rest only if you call goto state;. If you exit a state without branching, the program quits. This behaviour is consistent with the State Pattern and that’s why implemented this way. You can solve any computer problem using state machines, therefore this new language should be able to tackle them all.

The expressions are very simple, only binary operations, no precedence. Grouping is done with brackets and only the four basic arithmetic operations can be used. This is intentional, as I don’t want the expression evaluator to be so complex that the code will be difficult to understand.

As every code I write on my spare time, this one has an educational purpose. I learn by writing and hopefully teach by providing the source, comments and posts, and by letting it available on the internet so people can find it through search engines.

It should work on any platform you can compile to (currently only Linux and Mac binaries provided), but the binary is still huge (several megabytes) because they contain all LLVM libraries statically linked to it.

I’m still working on it and will update the status here at every .0 release. I hope to have binary operations for if statements, print string and all PHI calculated for the next release.


The LLVM compilation infrastructure
August 25th, 2009 under Algorithms, Devel, rengolin, Software. [ Comments: none ]

I’ve been playing with LLVM (Low-Level Virtual Machine) lately and have produced a simple compiler for a simple language.

The LLVM compilation infrastructure (much more than a simple compiler or virtual machine), is a collection of libraries, methods and programs that allows one to create a simple, robust and very powerful compilers, virtual machines and run-time optimizations.

As GCC, it’s roughly separated into three layers: the front-end, which parses the files and produce intermediate representation (IR), the independent optimization layer, which acts on the language-independent IR and the back-end, which turns the IR into something executable.

The main difference is that, unlike GCC, LLVM is extremely generic. While GCC is struggling to fit broader languages inside the strongly C-oriented IR, LLVM was created with a very extensible IR, with a lot of information on how to represent a plethora of languages (procedural, object-oriented, functional etcetera). This IR also carries information about possible optimizations, like GCC’s but to a deeper level.

Another very important difference is that, in the back-end, not only code generators to many platforms are available, but Just-In-Time compilers (somewhat like JavaScript), so you can run, change, re-compile and run again, without even quitting your program.

The middle-layer is where the generic optimizations are done on the IR, so it’s language-independent (as all languages wil convert to IR). But that doesn’t mean that optimizations are done only on that step. All first-class compilers have strong optimizations from the time it opens the file until it finishes writing the binary.

Parser optimizations normally include useless code removal, constant expression folding, among others, while the most important optimizations on the back-end involve instruction replacement, aggressive register allocation and abuse of hardware features (such as special registers and caches).

But the LLVM goes beyond that, it optimizes during run-time, even after the program is installed on the user machine. LLVM holds information (and the IR) together with the binary. When the program is executed, it profiles automatically and, when the computer is idle, it optimizes the code and re-compile it. This optimization is per-user and means that two copies of the same software will be quite different from each other, depending on the user’s use of it. Chris Lattner‘s paper about it is very enlightening.

There are quite a few very important people and projects already using LLVM, and although there is still a lot of work to do, the project is mature enough to be used in production environments or even completely replace other solutions.

If you are interested in compilers, I suggest you take a look on their website… It’s, at least, mind opening.


FSF Settles Suit Against Cisco
May 20th, 2009 under Devel, Digital Rights, OSS, rengolin, Unix/Linux. [ Comments: none ]

The long dispute with Cisco has finally come to an agreement. For me, that means two things: first, they’re not trolls sucking money from the big corps for stupid patent infringement, as some might fear. Second, they’re very patient, understanding and sometimes a bit too naive.

Why the fear?

When building embedded systems or when you’re too close to the hardware (such as Cisco) you may take a wise decision to use open source software, as it’s quite likely to be stable and taken care by a good bunch of good people. Even though there are several ways of doing it independently, so your software is not virally infected by the GPL, it’s not always possible and you may have to re-invent the wheel because of that.

It’s not only GPL, patents can also cause a whole lot of damage, and it seems that TomTom has decided to go head first with the Linux community.

So, although the fear is understandable, it’s more of a hysteria than based on actual facts. The FSF hasn’t had much to show on court, and that adds up to the uncertainty of the lawyers, but it’s in cases like the Cisco that they show a much higher maturity that most companies have shown recently, even mature companies like Microsoft.

Richard Stallman

The FSF is not only Stallman. Even though he’s the boss, the organization is a large list of people, sponsors, advisers (and now interns). One thing is to fear what RMS will do when he finds you using GPL in your kitchen scale, but a completely different matter is what the FSF (as an organization) does.

The Cisco case has been going for several years. They offered help, they’ve asked politely, they’ve warned about the potential dangers and so on. A lot has been made before they have actually filled the suit, and they’ve settled it nicely. This shows that they’re not just waiting the next infringement to get you down, they actually care about their (and your) freedom.

The day the FSF starts acting stupid is the day people will drive away. It’s not like Microsoft that you have no option, there’s plenty of options out there, software, licences, partners, advisers, programmers, etc. GNU/Linux is not the decent open source operating system, the BSDs are as good, sometimes better, especially in the embedded case.

The year of Linux

Every year since 1995 is the year of Linux. For me it always was, but I can’t say the same for the rest of the world. Recently, Linux (and other open source software) has played an important role in defining the future of mankind and more and more the Linux community feels that it’s their sweat and blood.

There is a great chance it’ll become the platform of all things in a very short time-frame. Cars, mobile phones, PDAs, netbooks, laptops, desktops, servers, clusters, spaceships. One platform to rule them all and in the darkness bind them, but if they play dumb, their glory might never see daylight.

Lots of people disagree with the new revisions of the GPL license, they feel it bites the hand that feeds it. Many companies feed back open source regularly and that kinda broke the synergy. I personally think that it’s excellent for some cases, but not for all. For instance, development tools should not be restricted, especially when it comes to platforms they can’t reach. Opening the platform is an obvious way around it, but not everything can be exposed and they can’t figure out every implementation detail.

Drivers might also have trouble with GPLv3 for the same reason. Again, there are ways around it, the FSF recently opened a backdoor to develop proprietary plug-ins if they’re blessed, but that might not be suitable for every case.

Solution?

Sorry, not today. Stick to FreeBSD if you can’t cope with GPLv3, find a way to co-exist with the GCC exception and provide the source code of what you have to. If it’s not your core business, you could donate your code to the community and make it GPL too and treat your program as enabling technology, of course, providing your code doesn’t expose any patent or trade secret.

So, well, yeah. Each case is a different case, that’s the problem of being in the long tail.


MySQL down the drain?
April 20th, 2009 under Devel, OSS, Politics, rengolin. [ Comments: 4 ]

Almost 10 years ago, MySQL became a great open source database, part of the LAMP platform (Perl, not PHP) and had everything to compete with the big players in the next few years.

It was then that they have done major releases, with a huge set of new features each, almost once a year. The community was happy using, developing and integrating with other products. But it was around 2005 that the things started going bad…

Back in 2005, when I was still in the loop, I have to say that I wasn’t impressed with the progress that the database had. I wasn’t also impressed with the new view the board gave to big companies (such as Yahoo!) on what was a good bet and what wasn’t.

After release 5.0 (still the production release, irrespective of what Sun says) there wasn’t a major development until Sun acquired MySQL and only then they’ve released 5.1 which they better shouldn’t.

In the old days, MySQL became famous by not implementing foreign keys and transactions, something that every other database had, because of speed issues. That decision became the core of the company and allowed other storage engines (such as InnoDB and BerkeleyDB which had those features) to be integrated, making it very easy to plan your database, using only the features you needed where you needed.

Who’s to blame?

I’m not sure it has something to do with Oracle buying InnoDB and Sleepycat (and now buying Sun, which owns MySQL). Even with all the politics of Oracle slowly buying MySQL in pieces, I don’t believe it’s the whole story. I see much more of an internal conflict and a lack of vision (probably for the lack of guts to keep taking weird decisions and succeeding) than anything else.

Now, MySQL is going down the same drain InnoDB and Sleepycat went, but with a twist: the source code is still GPL. Sun screwed up MySQL in a way I thought it wasn’t possible, Oracle will do it much more efficiently, even if they still play as good guys, it is definitely the end.

Don’t take my word only, my good friend and MySQL guru Jeremy Cole is taking himself out of the loop to avoid the useless politics. Steven (Computerworld) also cannot see how any of the involved companies will get anything in return of this deal.

Is there a light at the end?

Could Monty’s fork become a new MySQL without all the fuss? Could he, the odd guy with odd ideas, put MySQL on the map again? I do hope so, but that will cost MySQL the hall of fame. They’ll need to start over again and eventually fail once they’re there again and restart…

It’ll be fun to watch, at least MySQL had a GPL license which always ease forks and future development. Long live the open source revolution!

UPDATE:

Two excellent articles about the same issue from The Register and Ars Technica.


When refactoring goes wrong
April 15th, 2009 under Devel, rengolin. [ Comments: none ]

Recently I had to implement a very simple feature that would cross the barrier between a few components. As any good software, the communication between the components is done via public interfaces, and that case wasn’t an exception.

Unfortunately, some core interfaces would need to be changed and I knew I was looking for trouble. Nevertheless, I started it anyway and, in the beginning, it was not as bad as I thought. Lots of changes, of course, but nothing too complex. But the devil is in the details…

Each refactoring pointed out to another refactoring needed, which, if I implemented the first I’d either need to do the second or have to hack it, so I did it. What happened is that it didn’t stop in the second, it went on and on. Each refactoring uncovered another and another. In the end, the state of the program and the unit tests were hardly stable.

I’ve reached a cycle, where refactoring A would break B, B would break C and C would break A again in a different way. The snake was biting its own tail…

That taught me some very important lessons:

  1. Sometimes a simple refactoring can cost you the week and still have to be rolled back. In this case I believe I couldn’t have done differently, I had no way to know how bad it was before actually start poking around the code.
  2. When you face this situation, the best thing you can do is take notes on your changes and roll back. Trying to fix it, especially when you’ve changed quite a lot of unit tests, is recipe to disaster.
  3. Examine your notes and decide which refactoring need to be done first. It’ll probably be backwards to what you had to do in the first refactoring. When in doubt, start from the last and go up the stack.
  4. Never do more than one refactoring at a time. When faced with this situation, stop, take notes, roll back and do the last refactoring first. Test everything and commit before you start the next step.

The last item is especially true when more developers are actively changing the code. This will give them time to adapt their changes to yours and adapt your changes to theirs.

Those lessons I’ve learned, I believe, are irrespective of the version control you use. I know that GIT has some pretty impressive conflict resolutions when merging your code but I doubt it’ll successfully merge high-level concepts (such as object orientation principles and design patterns). If you can’t tell if the test is right or wrong, how could the version control?


Ad infinitum
February 12th, 2009 under Algorithms, Devel, Life, OSS, Physics, rengolin, World. [ Comments: none ]

Quality is fundamental in any job, and software is no exception. Although fairly good software is relatively easy to do, really good software is an art that few can truly reach.

While in some places you see a complete lack of understanding about the minimal standards of software development, in others you see it in excess. It’s no good either. In the end, as we all know, the only thing that prevails is common sense. Quality management, all sorts of tests and refactoring is fundamental to the agile development, but being agile doesn’t mean being time-proof.

One might argue that, if you keep on refactoring your code, one day it’ll be perfect. That if you have unit tests, regression tests, usability test (and they’re also being constantly refactored), you won’t be able to revive old bugs. That if you have a team always testing your programs, building a huge infrastructure to assure everything is user proof, users will never get a product they can’t handle. It won’t happen.

It’s like general relativity, the more speed you get, the heavier you become and it gets more difficult to get more speed. Unlike physics, though, there is a clear ceiling to your growth curve, from where you fall rather than stabilize. It’s the moment when you have to let go, take out what you’ve learned and start all over again, probably making the same mistakes and certainly making new ones.

Cost

It’s all about cost analysis. It’s not just money, it’s also about time, passion, hobbies. It’s about what you’re going to show your children when they grow up. You don’t have much time (they grow pretty fast!), so you need to be efficient.

Being efficient is quite different on achieving the best quality possible, and being efficient locally can also be very deceiving. Hacking your way through every problem, unworried about the near future is one way of screwing up things pretty badly, but being agile can lead you to the same places, just over prettier roads.

When the team is bigger than one person, you can’t possibly know everything that is going on. You trust other peoples judgements, you understand things differently and you end up assuming too much about some of the things. Those little things add up to the amount of tests and refactoring you have to run for each and every little change and your system will indubitably cripple up to a halt.

Time

For some, time is money. For me, it’s much more than that. I won’t have time to do everything I want, so I better choose wisely putting all correct weights on the things I love or must do. We’re not alone, nor is all we do for ourselves, so it’s pretty clear that we all want our things to last.

Time, for software, is not a trivial concept. Some good software don’t even get the chance while some really bad things are still being massively used. Take the OS/2 vs. Windows case. But also some good software (or algorithms or protocols) have proven to be much more valuable and stable than anyone ever predicted. Take the IPv4 networking and the Unix operating system (with new clothes nowadays) as examples.

We desperately need to move to IPv6 but there’s a big fear. Some people are advocating for decades now that Unix is already decades deprecated and still it’s by far the best operating system we have available today. Is it really necessary to deprecate Unix? Is hardware really ready to take the best out of micro-kernel written in functional programming languages?

For how long does a software lives, then?

It depends on so many things that it’s impossible to answer that question, but there are some general rules:

  • Is it well written enough to be easy to enhance to users’ request? AND
  • Is it stable enough that won’t drive people away due to constant errors? AND
  • Does it really makes the difference to people’s lives? AND
  • Are people constantly being reminded that your software exists (both intentionally and unintentionally)? AND
  • Isn’t there something else much better? AND
  • Is the community big enough to make migration difficult?

If you answered no to two or more questions, be sure to review your strategy, you might already be loosing users.

There is another path you might find your answers:

  • Is the code so bad that no one (not even its creator) understand it anymore? OR
  • The dependency chain is so unbearably complicated, recursive and fails (or works) sporadically? OR
  • The creator left the company/group and won’t give a blimp to answer your emails? OR
  • You’re relying on closed-source/proprietary libraries/programs/operating systems, or they have no support anymore? OR
  • Your library or operating system has no support anymore?

If you answered yes to two or more questions, be sure to review your strategy, you might already be on a one-way dead-end.

Ad infinitum

One thing is for sure, the only thing that is really unlimited is stupidity. There are some things that are infinite, but limited. Take a sphere, you can walk on a great circle until the end of all universes and you won’t reach the end, but the sphere is limited in radius, thus, size. Things are, ultimately, limited in the number of dimensions they’re unlimited.

Stupidity in unlimitedly unlimited. If the universe really has 10 dimensions, stupidity has 11. Or more. The only thing that will endure, when the last creature of the last planet of the last galaxy is alive is his/her own stupidity. It’ll probably have the chance to propagate itself and the universe for another age, but it won’t.

In software, then, bugs are bound to happen. Bad design has to take part and there will be a time when you have to leave your software rest in peace. Use your time in a more creative way because for you, there is no infinite time or resources. Your children (and other people’s children too) will grow quick and deprecate you.


Closed source development
January 28th, 2009 under Devel, Digital Rights, OSS, rengolin. [ Comments: none ]

While closed source development has its niche (and a very important one), it does feel a bit weird.

I’m now working on low-level development (debuggers) at ARM, one of the things I like most but also a rare thing to find good quality open source development (with the exception of the gnu tools, of course). Of course there is a portion of your work that goes back to the community (via open standards, limited support for the open tools) but it’s not easy to find a job to write code exclusively to the gdb or gcc.

What I’m finding weirder is the fact that the documentation you need is seldom on the Internet (Google or usenet). The good side is that the guys that created the standards and tools are at your doorstep, so it’s quite easy to get hold of them in case you need something off the charts. But that’s normally true with open source as well.

The other weird thing is knowing what you can tell and what you can’t. I have no idea of what part of my current project is public so I just don’t talk about anything of it. But I think that’s just a matter of getting used to, just like I did before. Besides, albeit at EBI I could even show my (or anybody else’s) source code, I don’t think that anybody ever cared that much.

At last, licences. It’s so easy when you develop GPL or LGPL (or similar). Just write whatever you want, use whatever library you need and put a GPL3 tag on your code. That’s it. Simple as that. Now I have to think what would be the impact of that library on the license of what I write, and that’s something I didn’t want to care…

Also, if a document is GPL-ed, you have to GPL it too. If it’s version 3, everything you write (including company’s previous ideas) become GPLv3 as well. That’s a big nuisance. I do understand GPLv3 for code, even apply that to my own source code, but it does annoy a lot when applied to documents.

Although weird for some reasons, it’s not bad at all. I have many more reasons to love my new job. Excellent team, great environment and an impressive code quality, which for me, is a must.


Who’s afraid of the big bad code?
January 14th, 2009 under Articles, Devel, InfoSec, Politics, rengolin. [ Comments: none ]

What would Bruce Schneier say about the magic list that the NSA is putting together with Microsoft and Symantec of the 25 biggest errors in code that normally lead to a security flaw.

Don’t get me wrong, putting out a list of bad practices is a fantastic job, that’s for sure. It makes programmers more aware of the dangers, and as the article says itself, newbies can learn from experience before getting into a new field.

But the way that (lay) people take it makes it so magical that the practical side of such list is greatly reduced.

Order and size of the list

I understand that the order must have some sense, but which? Is it ordered by number of attacks in the last 12 months? Or by the sum of all reported losses caused by them? Or by number of such errors found in common code (on those companies’ code, of course)? Or by any other subjective “importance” factor from a bunch of “Security Experts”?

Also, why 25? Why not 30? Who says that the 25th is so important to show up in the list and not the 26th?

Real-world

We programmers know about most of them, know the problems they pose and normally how to fix them. We often want to fix them, but that normally requires some refactoring and now it’s time to implement those features that our client needs for the demo, right? We can think about that later… can we? Will we?

Than, NSA decides to make this a priority for the country and claim it as a national security problem. Big companies like fancy terms, and would strive to adopt any new standard that shows up in the market.

Then, comes down the VP of engineering and say:

“We need to make sure every programmer knows how to write code that is free of the top 25 errors.”

Done, he can put the GIF image from the NSA saying his company’s software is secure against all odds, according to the NSA and DHS.

Now, coders and technicians, tell me: Would any editor, IDE or compiler ever be able to spot those errors with 100% accuracy?

“Then we need to make sure every programming team has processes in place to find and fix these problems [in existing code] and has the tools needed to verify their code is as free of these errors,”

Of course not, but they will try, and Microsoft will put a beta on Visual C++ and other companies will tell their clients that their software is being tested with the new product and the clients will buy, after all, who are them to say anything about that matter?

Protect against who?

Now, after so much time and effort, 30+ companies and government departments working hard to come up with a (quite good) list of the most common errors that lead to security flaws for what?

“The real dedicated serial attacker will probably find a way in even if all these errors were removed. But a high school hacker with malicious intent – ankle-biters if you will – would be deterred from breaking in.”

WHAT?!?! All that to stop script-kids? For heavens’ sake, I thought they were serious on that… Well, maybe I expected too much from the NSA… again…

(Note: quotes from original article, ipsis litteris)


On Workflows, State Machines and Distributed Services
September 21st, 2008 under Devel, Distributed, rengolin. [ Comments: none ]

I’ve been working with workflow pipelines, directly and indirectly, for quite a while now and one thing that is clearly visible is that, the way most people do it, it doesn’t scale.

Workflows (aka. Pipelines)

If you have a sequence of, say, 10 processes in a row (or graph) of dependencies and you need to run your data through each and every single one in the correct order to get the expected result, the first thing that comes to mind is a workflow. Workflows focus on data flow rather than on decisions, so there is little you can do with the data in case it doesn’t fit in your model. The result of such inconsistency generally falls in two categories: discard until further notice or re-run the current procedure until the data is correct.

Workflows are normally created ad-hoc, from a few steps. But things always grow out of proportions. If you’re lucky, these few steps would be wrapped into scripts as things grow, and you end up with a workflow of workflows, instead of a huge list of steps to take.

The benefit of a workflow approach is that you can run it at your own pace and assure that the data is still intact after each step. But the downfall is that it’s too easy to incorrectly blame the last step for some problem on your data. The status of the data could have been deteriorating over the time and the current state was the one that picked that up. Also, checking your data after each step is a very boring and nasty job and no one can guarantees that you’ll pick up every single bug anyway.

It becomes worse when the data volume increases to a level you can’t just look it up anymore. You’ll have to write syntax checkers for the intermediate results, you’ll end up with thousands of logs and scripts just to keep the workflow running. This is the third stage: meta-workflow for the workflow of workflows.

But this is not all, the worse problem is still to come… Your data is increasing by the minute, you’ll have to split it up one day and rather sooner than later. But, as long as you have to manually feed your workflow (and hope the intermediate checks work fine), you’ll have to split the data manually. If your case is simple and you can just run multiple copies of each step in parallel with each chunk of data, you’re in heaven (or rather, hell for the future). But that’s not always the case…

Now, you’ve reached a point where you have to maintain a meta-workflow for your workflow or workflows and manually manage the parallelism and collisions and checks of your data only to find out at the end that a particular piece of code was ignoring an horrendous bug in your data when it was already public.

If you want to add a new feature or change the order of two processes… well… 100mg of prozac and good luck!

Refine your Workflow

Step 1: Get rid of manual steps. The first rule is that there can be no manual step, ever. If the final result is wrong, you turn on the debugger, read the logs, find the problem, fix it and turn it off again. If you can’t afford to have wrong data live than write better checkers or rather reduce the complexity of your data.

One way to reduce the complexity is to split the workflow into smaller independent workflows, which each one generate only a fraction of your final data. If you have a mission critical environment, you better off with 1/10th of it broken than with the whole. Nevertheless, try to reduce the complexity on your pipeline, data structure and dependencies. When you have no change to do, re-think about the whole workflow, I’m sure you’ll find lots of problems every iteration.

Step 2: Unitize each step. The important rule here is: each step must process one, and only one, piece of data. How does it help? Scalability.

Consider a multiple data workflow (fig. 1 below), where you have to send the whole thing through, every time. If one of the process is much slower than the others, you’ll have to split your data for that particular step and join it again for the others. Splitting your data once at the beginning and running multiple pipelines at the same time is a nightmare as you’ll have to deal with the scrambled error messages yourself, especially if you still have manual checks around.

Multiple data Workflow
Figure 1: Split/Join is required each parallel step

On the other hand, if only one unit passes through each step (fig. 2), there is no need to split or join them and you can run as many parallel processes as you want.

Single data Workflow
Figure 2: No need to split/join

Step 3: Use simple and stupid automatic checks. If possible, don’t code anything at all.

If data must be identical on two sides, run a checksum (CRC sucks, MD5 is good). If the syntax needs to be correct, run a syntax checker, preferably a schema/ontology based automatic check. If your file is too complex or specially crafted so you need a special syntax check, re-write your file to use standards (XML and RDF are good).

Another important point on automatic checking is that you don’t have to check your emails waiting for an error message. When the subject contains the error message is already a pain, but when you have to grep for errors inside the body of the message? Oh god! I’ve lost a few lives already because of that…

Only mail when a problem occurs, only send the message related to the specific problem. It’s ok to send a weekly or monthly report just in case the automatic checks miss something. Go on and check the data for yourself once in a while and don’t worry, if things really screw up your users will let you know!

Automation

But, what’s the benefit of allowing your pipeline to automatically compute individual values and check for consistency if you still have to push the buttons? What you want now is a way of having more time to look at the pipeline flowing and fix architectural problems (and longer tea time breaks), rather than putting down the fire all the time. To calculate how many buttons you’ll press just multiply the number of data blocks you have by the number of steps… It’s a loooong way…

If still, you like pressing buttons, that’s ok. Just skip step 2 above and all will be fine. Otherwise, keep reading…

To automate your workflow you have two choices: either you fire one complete workflow for each data block or do it like a workflow of data through different services.

Complete Workflow: State Machines

If your data is small, seldom or you just like the idea, you can use a State machine to build a complete workflow for each block of data. The concept is rather simple, you receive the data and fire it through the first state. The machine will carry on, sending the changed data through all necessary states and, at the end, you’ll have your final data in-place, checked and correct.

UML is pretty good on defining state machines. For instance, you can use a state diagram to describe how your workflow is organised, class diagrams to show how each process is constructed and sequence diagrams to describe how processes talk to each other (preferably using a single technology). With UML, you can generate code and vice-versa, making it very practical for live changes and documentation purposes.

The State Design Pattern allows you to have a very simple model (each state is of the same type) with only one point of decision where to go next (when changing states): the state itself. This gives you the power to change the connections between the states easily and with very (very) little work. It’ll also save you a lot on prozac.

If you got this far you’re really interested on workflows or state machines, so I assume you also have a workflow of your own. If you do, and it’s a mess, I also believe that you absolutely don’t want to re-code all your programs just to use UML diagrams, queues and state machines. But you don’t need to.

Most programming languages allow a shell to be created and an arbitrary command to be executed. You can then manage the inter-process administration (creating/copying files, fifos, flags, etc), execute the process and, at the end, check the data and choose the next step (based on the current state of your data).

This methodology is simple, powerful and straight-forward, but it comes with a price. When you got too many data blocks flowing through, you end up with lots of copies of the same process being created and destroyed all the time. You can, however let the machine running and only provide data blocks, but still this doesn’t scale as we wanted on step 2 above.

Layered Workflow: Distributed Services

Now comes the holy (but very complex) grail of workflows. If your data is huge, constant flowing, CPU demanding and with awkward steps in between, you need to program thinking on parallelism. The idea is not complex, but the implementation can be monstrous.

On figure 2 above, you have three processes, A, B and C, running in sequence, and process B had two copies running because it took twice as long as A and C. It’s that simple: the more it takes to finish, more copies you run in parallel to make the flow constant. It’s like sewage pipes, rain water can flow in small pipes, but house waste will need much bigger ones. but later on, when you filter the rubbish, you can use small pipes back again.

So, what’s so hard on implementing this scenario? Well, first you have to take into account that those processes will be competing for resources. If they’re on the same machine, CPU will be a problem. If you have a dual core, the four processes above will share CPU, not to mention memory, cache, bus etc. If you use a cluster, they’ll all compete for network bandwidth and space on shared filesystems.

So, the general guidelines for designing robust distributed automatic workflows are:

  • Use layered state architecture. Design your machine in layers, separate the layers into machines or groups of machines and put a queue or a load-balancer in between each layer (state). This will allow you to scale much easier as you can add more hardware to a specific layer without impacting on others. It also allows you to switch off defective machines or do any maintenance on them with zero down-time.
  • One process per core. Don’t spawn more than one process per CPU as this will impact in performance in more ways than you can probably imagine. It’s just not worth it. Reduce the number of processes or steps or just buy more machines.
  • Use generic interfaces. Use the same queue / load-balancer for all state changes and, if possible, make their interfaces (protocols) identical so the previous state doesn’t need to know what’s on the next and you can change from one to another with zero cost. Also, make the states implement the same interface in the case you don’t need queues or load-balancers for a particular state.
  • Include monitors and health checks in your design. With such complex architecture it’s quite easy to ignore machines or processes failing. Separate reports into INFO, WARNING and ERROR and give them priorities or different colours on a web interface and mail or SMS only the errors to you.

As you can see, by providing a layered load-balancing, you’re getting performance and high-availability for free!

Every time data piles up in one layer, just increase the number of processes on it. If a machine breaks, it’ll be off the rotation (automatic if using queues, ping-driven or MAC-driven for load-balancers). Updating the operating system, your software or anything on the machine is just a matter of taking it out, updating, testing and deploying back again. Your service will never be off-line.

Of course, to get this benefit for real you have to remove all single-points-of-failure, which is rarely possible. But to a given degree, you can get high-performance, high-availability, load-balancing and scalability at a reasonable low cost.

The initial cost is big, though. Designing such complex network and its sub-systems, providing all safe-checks and organizing a big cluster is not an easy (nor cheap) task and definitely not done by inexperienced software engineers. But once it’s done, it’s done.

More Info

Wikipedia is a wonderful source of information, especially for the computer science field. Search for workflows, inter-process communication, queues, load-balancers, commodity clusters, process-driven applications, message passing interfaces (MPI, PVM) and some functional programming like Erlang and Scala.

It’s also a good idea to look for UML tools that generates code and vice-versa, RDF ontologies and SPARQL queries. XML and XSD is also very good for validating data formats. If you haven’t yet, take a good look on design patterns, especially the State Pattern.

Bioinformatics and internet companies have a particular affection to workflows (hence my experience) so you may find numerous examples on both fields.


« Previous entries Next entries »