As well as any other optimization, you need to think about I/O bottlenecks during architecture phase but some communication details are not always obvious before they become a problem. There are a few obvious problems listed in the previous chapter related to distributed communication but there are also some local optimizations that are some other local optimizations that are very easy and can increase a lot the speed of a program.
The first rule for logs is: logs are for debugging purposes not production. This is why some languages like Java take a great care on keeping the stack information during execution so you can know exactly where it failed and then, later in debug environment, turn on full logs and run with the same input to get the problem. The time spent on storing the information is minimal compared to the time spent on writing it to a disk so the gains are considerable, plus you just need to write logs that are really necessary (errors and critical paths).
The Apache web server is a good example of that. There are several logs Apache can write but you can control which logs and which information per log, saving a lot of unnecessary I/O operations and disk space. The Apache access log shows important information about who accessed which page in your website and the error log is the standard error output of all programs executed by it, so you can print all error messages, stack traces and core dumps to it and investigate later.
Using standard logging libraries like Log4J (Java) and log4cpp (C++) you can set the level each particular message is and set the general minimal level to log. To be very simplistic, lets suppose you have three levels: ERROR, WARN and INFO. If the global log level is ERROR, it means that only ERROR messages will be shown, if it's WARN both ERROR and WARN messages and if it's INFO all three. A very simple example in generic code:
LogClass log; log.setLevel(WARN); // First message not shown log.write(INFO, "Life is hard"); // The warning is shown log.write(WARN, "Beware of the dog"); // Errors are always shown (in our three-level example) log.write (ERROR, "Major disruptions in the M25, reroute");
You can always use a command line arguments to set the logging level, but what if you want to log everything in one part of the program and nothing (or just errors) in all the rest? It's easy if you add another argument saying which part to be logged and even which level for each part if you do need this fine-grained control.
// Default level in production is ERROR but arguments can override defaultLogLevel = getArgument(logLevel) || ERROR; // Create object with log level LogClass log; log.setLevel(defaultLogLevel); // Normal code, INFO not written log.write(INFO, "Life is hard"); // Critical part of the code, change level and go back when finished log.setLevel(WARN); log.write(WARN, "Beware of the dog"); log.setLevel(defaultLogLevel); // Errors are always shown (in our three-level example) log.write (ERROR, "Major disruptions in the M25, reroute");
I/O redirection is universal and most programs are written to read from the standard input and to write to standard output and standard error handles. It makes possible for you to cascade the output of programs to others and create a complex data flow without a single line of code.
cat file | grep -i foo | awk '{print $1}' | sort -u > another_file
In the example above we're getting the first field (before the first empty space) of the lines containing 'foo' from 'file', sorting and removing the duplicated entries and writing that to 'another_file'. To write a program to do that you would have to define which containers to use and write a routine to get the text before the first empty space and that would take much more time that writing the line above.
But this power comes with a price: performance. Because processes are in different contexts, every time the data goes from one program to another the data must be copied to a different area of the memory, cache pages will be invalidated and the two process will compete for CPU and memory access. So, if the files are small and you're going to do it only once (this is important), using I/O redirection is probably the best option.
If the line above is part of a production schedule you must put it in a script. This is mandatory because most of the time this kind of trick is followed by manual checks of errors and return values. If you do it manually (and if you're organized) you'll end up writing a documentation on how to check for errors and increase the amount of manual work even more and worse, increasing the probability or more errors and making it much more difficult for new staff to run the same task.
Even in shell scripts we use a lot of redirection and that's not a problem providing your files are small and will remain small for a long time. Also, it's very important to note that writing real code (ie. core business) in shell scripts is a bad move, so this tricks should only be used in accessories or wrappers. If the automation is part of your core business choose a better scripting language that can hold more structured data and have better flow control like Perl or Python.
When one process produce some data that will later be used by other process not exactly in the same pipeline you need to create temporary files. The performance problems are even worse than with I/O redirection because on top of all that it also involves disk write and read, which are much slower than memory or CPU operations. Therefore, temporary files should be kept small at all costs. Avoid them if possible by rearranging the tasks' order or writing programs (not scripts) to do the same job without the temporary files.
Another common use of temporary files is during data split for parallel processing. The easy way to do it is to split the files beforehand into manageable chunks and run the program on each chunk on each node. This is easy but you spend a lot of time splitting the file and a lot of disk space to hold the temporary files (double the space as the sum of chunks is the same size of the whole).
There are a few ways of avoiding this and the most simple is by reading the file using offsets. If, instead of copying chunks to the disk, your splitting program reads the file and store only the offsets in the file corresponding to the chunks you can later on read the file from A to B only on each chunk. Because reading the file does not change it you can do it in parallel even if the chunks overlap on some regions.
$ split --chunks=30 --type=offset big_file $ read --chunk=1 big_file | my_program & $ read --chunk=2 big_file | my_program & (...)
Another solution if you don't have a big cluster available is to control the offsets without quitting the program. This can be done by splitting and forking or threading the execution for multi-core machines or using Message Passing Interfaces for multi-node clusters. Always analyze the amount of data going through the network to see if it'll impact more than splitting or not, the answer might not be obvious in all cases.
class chunk {
long from;
long to;
};
class chunks {
list chunk;
};
class results {
string list result;
};
// Read file, getting start and stop byte offsets
chunks = split(big_file);
foreach n (chunks) {
thread[n] = new readThread();
result[n] = thread.run(chunk[n]);
}
// Wait for all threads to finish and
final_result = reduce(results);
Another was of avoiding unnecessary I/O is to use queues. Instead of reading a lot of data and storing in temporary files you could separate the reader from the algorithm. You can write one program reading the data and putting them in a queue, connected to a socket or FIFO and whenever any other process connects to that socket the reader prints one block (entry) of data. If your consumer takes much more time than the producer you can have many consumers for a single producer.
This not only saves time and space by not writing temporary files but also scales independently when the time taken for different parts of the execution are too different and you can optimize the resources in different ways. If you're reading from the disk, normally fewer readers than consumers are necessary because disk is usually quite fast but it may be the opposite with network reading. Because network latency (especially over the internet) can be quite high you spend much time waiting for the information, so if you parallelize the waiting you only have to wait once for all resources. Web browsers do it to increase page load speed.
The implementation is rather simple. Grab a generic queue manager or write one using standard containers and algorithms from your preferred programming language and attach your blocks of code to them. Normally those libraries open network or Unix sockets so instead of writing to disk make your application write to a socket and assure you do it atomically, so every entry is within a block that will be retrieved later by the consumer. You can use the queue's own boundaries or teach the queue (if you did it yourself) to understand the boundaries of your data. The second solution is less generic but requires less format conversions that can save a few seconds.
The producer is quite simple, it only have to manage how to extract information from the media. It can be from a database, standard input, network or more complex media. The good thing about separating producers from consumers is that the I/O logic is normally very specific to the kind of media and that mixed with your core business could obfuscate the code.
while (entry = readFromStdin()) {
socket.print(entry);
}
To implement the queue you need to separated parts, one that listen to the socket and enqueue the messages and another, completely independent that dequeue whenever someone asks on the other side. You can achieve this by running in different threads or processes (fork) or by using select systemcall on sockets to manage more than one connection simultaneously.
// Enqueue
while (entry = socketIN.read()) {
queue(entry);
}
// Dequeue
while (socketOUT.read() == "GET") {
if (queue.notEmpty()) {
socketOUT.print(dequeue());
} else {
socketOUT.print("EMTPY");
}
}
The consumer is also quite simple:
// Main loop
while (true) {
entry = server.get();
if (entry == EMPTY) {
sleep 1 or break or something smarter;
}
// Do whatever you want with your data
...
}
The scratch of a protocol above is just a very bad example, you need to carefully write all states of this state machine and implement that with great detail, state machines are really complicated and can lead to severe unexpected results. To understand better how queues work please refer to a proper document or book on the subject.
The idea is to write/install a queue connector library (push and pull) over sockets (network, Unix) and a queue manager. You'll need one queue manager between each part of your program to scale independently from the others, parts that wouldn't benefit from independent scalability could be tied up together or connected directly.