Profile is the art of finding which part of your code is slower than it should be. It's an art because it's not exact: first, it's not only obvious what a part is and how to enclosure it in one unit. Second, the more precise you get your data worse your code looks like. You need to trade precision for usability and defining the threshold for each developer, for each program is not exact.
Finding the slowest parts of the program is normally ok but there are some parts that are inherently slower (like disk access or network) and there isn't much to do unless you parallelize. So the whole point is making the list of timings, number of calls, memory usage and importance of each part of your code and start optimizing the top entries.
The most obvious way of measuring running time is by getting the time just before and just after the big process you have and take the difference. This method is very flexible, you can put the time calls at any point (on any function) so you'll measure exactly what you want to and nothing more, so you get the exact time spent. This is the best approach when you have different algorithms to solve a problem and you're trying to define which one is the fastest.
This method is normally used in tests when defining the critical line but can also be used in emergencies when the code doesn't have any kind of profiling techniques or tools and for that they must be immediately erased afterwards. If you still want/need a good profiler find a statistical tool.
As you're probably expecting the discrete profiling is very simple:
start = getTimeStamp(); obj.method(data); stop = getTimeStamp(); print "The method took " + (stop - start) + " seconds to run\n";
However, there are a few things to consider when testing performance with the discrete profiling method, most of them quite simple and obvious. First, put the start measurement exactly before the start of your algorithm and not the whole block especially when defining complex data types (such as big matrices or big heavy objects) or when expecting input or sending output that might get locked somewhere in the way.
// Many long initializations
long start, stop;
MyClass obj;
obj.initMatrix(200, 200, 200, 200);
data = db.query("select * from table limit 1000000")
// Running
start = getTimeStamp();
obj.method(data);
stop = getTimeStamp();
// Results
print "The method took " + (stop - start) + " seconds to run\n";
There is another important consideration to take when comparing two algorithms: if the preparation of data is different, they must be taken into account. In the example above there was a big matrix initialization and a select in a database, both can be quite costy. Suppose your second algorithm doesn't depend on the database results and a third that only depends on it or in still other data like a tree or data from a network connection. All of this takes time, different time for each task that must be added up to the running time.
So the rule is: if it's used by all should be external to the time counting, internal otherwise.
// Many long initializations
long start, stop;
MyClass obj;
obj.initMatrix(200, 200, 200, 200);
// Running test #1
start = getTimeStamp();
obj.methodOne();
stop = getTimeStamp();
print "The method #1 took " + (stop - start) + " seconds to run\n";
// Running test #2
start = getTimeStamp();
data = db.query("select * from table limit 1000000")
obj.methodTwo(data);
stop = getTimeStamp();
print "The method #2 took " + (stop - start) + " seconds to run\n";
Some clock functions don't return the time in seconds (or milliseconds) so you have to do the math, which is generally quite simple. In C the function is called gettimeofday and the trick goes like this:
#include <stdio.h>
#include <time.h>
#include <sys/time.h>
int
main () {
struct timeval tv;
double start, stop;
gettimeofday(&tv, NULL);
start = ( (double) tv.tv_sec + (double) tv.tv_usec / 10e6 );
sleep(1);
gettimeofday(&tv, NULL);
stop = ( (double) tv.tv_sec + (double) tv.tv_usec / 10e6 );
printf ("Elapsed time is %lf\n", (stop - start));
return 0;
}
If you execute this code you'll probably have results like this:
$ ./elapsed Elapsed time is 1.000694
Where the 0.000694 seconds are due to calculations of the start variable and the overhead for calling functions, but it can be considered noise even for small times like one second. You could have used two timeval structures to get the time and do the math at the end too, but the difference is not that big:
#include <stdio.h>
#include <time.h>
#include <sys/time.h>
int
main () {
struct timeval tv_start, tv_stop;
double start, stop;
gettimeofday(&tv_start, NULL);
sleep(1);
gettimeofday(&tv_stop, NULL);
start = ( (double) tv_start.tv_sec + (double) tv_start.tv_usec / 10e6 );
stop = ( (double) tv_stop.tv_sec + (double) tv_stop.tv_usec / 10e6 );
printf ("Elapsed time is %lf\n", (stop - start));
return 0;
}
$ ./elapsed Elapsed time is 1.000469
This approach is not suitable for bottlenecks analysis, though. Bottlenecks are usually hidden in the implementation details and you don't know exactly in which function or which combination of methods it can happen so there is no right place to put the time calls to effectively identify problems.
It may work reasonably for bottleneck analysis when there is one big main function calling several smaller ones and you need to know which part is taking more time. But from a quality point of view, big main functions are bad for maintenance and time calls pollute the code and deviate the attention from the algorithm. In emergencies I have used temporary time calls within function calls and it worked, we found the problematic part and fixed, but later we removed all time calls and left a comment describing the optimization.
For a more long term solution you would have to put a time call as the first and last thing on each method or function and keep this counter global or return the value every time and keep track of hundreds of counters. Again, this would deviate the attention, pollute the code and even make it slower than expected. Because the time call is quite fast it wouldn't have much effect on slow and seldom called methods but would be a huge impact for the faster and often called methods, changing the execution profile of your program and yielding wrong results.
Statistical analysis is done by running the program and once in a while checking in which function the program is. At the end you have a list of time spent on each function and the number of times that particular function was called. With this you find out which are the most called functions (most popular) and the bottlenecks (slowest).
You may question this approach as imprecise and interference prone and that's one of the biggest motives why we've discarded the previous approach as a decent profiler tool. That's a good questions and the answer is simple: This method, although non-deterministic, is very precise and interfere less with the program than the previous one. Also, you don't have to add a single line to your program to have the statistical data so not cluttering your code nor making you spend more time profiling than coding.
Statistical profilers work by changing the way your functions are executed. Each function will increment a counter every time they're called and keep the information of which function called it so the profiler can build the caller/callee graph and know how many times each function was called. The profiler will also poke the program (normally via profil systemcall, but sometimes using timer signals) around 100 times per second, so you get around 10% of precision or better. For a program running around 100 seconds it drops to 1%.
Because the statistical part is only to build the histogram of execution time (not caller/callee graph or call count list), 100 times per second is much more than enough to catch the worse problems and because you shouldn't care about functions that spend less than 1% of running time, they're not worth optimizing anyway. If your program is too quick you can always combine several runs to get the expected error value.
Most programming languages have statistical profilers and the functionality is pretty much the same. One of the oldest profiles still in use today is the gprof for the C language. Because the three phases are clear: compiling, executing and getting the graphs are well separated, I'll use it as an example of profiling. The others should follow the same steps, probably missing a few to make it simpler to use.
You need to inform your program that it must include the counters for every function, so gcc should be aware of it via the option -pg:
$ gcc -o my_prog -pg my_prog.c
When you execute the program it'll automatically generate a file called gmon.out in your working directory:
$ ./my_prog $ ls gmon.out gmon.out
Now for what really matters, the statistics:
$ gprof -pb int Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 59.59 2.22 2.22 20032769 0.00 0.00 f_plain 18.25 2.90 0.68 20004097 0.00 0.00 f_trig 10.74 3.30 0.40 2 0.20 0.93 simpson 9.13 3.64 0.34 2 0.17 0.90 trapezoid 2.42 3.73 0.09 main 0.00 3.73 0.00 27 0.00 0.00 romberg 0.00 3.73 0.00 9 0.00 0.00 to_sec 0.00 3.73 0.00 6 0.00 0.61 integrate 0.00 3.73 0.00 1 0.00 3.64 test
This is the result of a program to test the performance of three different numerical integration methods: trapezoid, simpson and romberg. All of them use the plain and trigonometric functions and the quality is the same (ie. they stop whenever the error is below a certain threshold).
As you can see both test functions (plain and trig) account for almost 80% of the total time! So, if there is something to optimize it's them. Looking deeper at the functions we see that there isn't much more to do, maybe changing the two calls to sin in the second function but still, the f_trig only account to 18% of the running time, so the evil part is sqrt:
double f_plain (double x) {
// [x(1-x^2)]^1/2
return (x*sqrt((1-x*x)));
}
double f_trig (double x) {
// substitute x = con x on the above function
return (cos(x)*sin(x)*sin(x));
}
We could search for better implementations of sqrt outside the C standard library or rearrange the equation to get rid of the root but lets trust the libc and focus on our program for a while. What you can do is to avoid calling those functions all the time.
The “total s/call” is the average of time spent inside that function and its children. In the case of trapezoid, romberg and simpson it also includes f_plain and f_trig. So checking their timings you'll find out that the time spent by romberg is zero.
It's also called 27 times because it's iterative (could be recursive too) and internally it calls f_trig or f_plain only once per call so we're sure that the romberg function called the very expensive functions in our program exactly 27 times out of more than 20million times!
By analysing the profiler output you found a few things. First, f_plain and f_trig are very expensive and are called millions of times so they're the best candidates for optimization, but looking a bit deeper to the other functions you found out that romberg only calls them 27 times instead of 20million.
By using only the romberg method your program would take milliseconds instead of seconds to run and you'd probably find other bottlenecks, but in the order of milliseconds you shouldn't be worried about bottlenecks, unless your program (or library) is called millions of times too, than you can go and search for a better implementation of sqrt.
As well as CPU, memory plays a very important role on program execution. In fact it's even more crucial for lack of CPU will make the program run slower and lack of memory will make the program break instantly. Therefore, memory usage monitoring and profiling is also fundamental for the safety of the program.
Some programming languages use dynamic allocation (like C and C++) but the control of the clean up is up to you. Not only cleaning objects after using them but keeping in mind the amount of objects you'll have in the average and worst case scenario comparing with the amount of memory you have available in the lowest configuration possible. Unrestricted object creation will lead to memory starvation and not cleaning unused object will lead to memory leak that can, with time, lead to memory starvation as well.
The best tool for all kinds of memory check (and thread check too) is Valgrind. It'll help you assess how much memory you're using, how much was not cleaned or cleaned twice and even if you're using memory you can't access. This is just the MemCheck, Valgrind has lots of other tools and I greatly encourage you to investigate further. Note that, because it runs on the binary you can profile programs in any programming languages, even virtual-machine based or interpreted.
The most basic check Valgrind does is the memory usage and memory leak check. My integral program didn't have any memory leak so I'll use the simple example from Valgrind's documentation:
#include <stdlib.h>
void f(void)
{
int* x = malloc(10 * sizeof(int));
x[10] = 0; // problem 1: heap block overrun
} // problem 2: memory leak -- x not freed
int main(void)
{
f();
return 0;
}
Running valgrind on it we get the report of the two problems:
$ valgrind ./valgrind (...) ==6407== Invalid write of size 4 ==6407== at 0x804838F: f (valgrind.c:6) ==6407== by 0x80483AC: main (valgrind.c:11) ==6407== Address 0x4185050 is 0 bytes after a block of size 40 alloc'd ==6407== at 0x4022765: malloc (vg_replace_malloc.c:149) ==6407== by 0x8048385: f (valgrind.c:5) ==6407== by 0x80483AC: main (valgrind.c:11) ==6407== ==6407== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 11 from 1) ==6407== malloc/free: in use at exit: 40 bytes in 1 blocks. ==6407== malloc/free: 1 allocs, 0 frees, 40 bytes allocated. ==6407== For counts of detected errors, rerun with: -v ==6407== searching for pointers to 1 not-freed blocks. ==6407== checked 59,676 bytes. ==6407== ==6407== LEAK SUMMARY: ==6407== definitely lost: 40 bytes in 1 blocks. ==6407== possibly lost: 0 bytes in 0 blocks. ==6407== still reachable: 0 bytes in 0 blocks. ==6407== suppressed: 0 bytes in 0 blocks. ==6407== Rerun with --leak-check=full to see details of leaked memory.
First of all check the last paragraphs, you'll see ERROR and LEAK summaries, if everything is zero you can safely ignore the rest. In this case, however, Valgrind detected one error and one leak, as expected, and it explains a bit better in the first paragraph:
==6407== Invalid write of size 4 ==6407== at 0x804838F: f (valgrind.c:6) ==6407== by 0x80483AC: main (valgrind.c:11)
“Invalid write” because you can't write on x[10], remember that C arrays goes from 0 to N-1, so the maximum index is 9. By accessing the position 10 (11th from the start) you're accessing another part of the memory in the heap that might be another variable, rubbish or nothing at all, but because the heap belongs to your process the compiler won't complain (gcc doesn't even when using option -Wall). Checking if the data is in range before accessing the array is always a good idea when you can't control which position will be accessed (when sent as a parameter for example).
The leak report is a bit more cryptic because Valgrind can't tell exactly where to clean the allocated memory (sometimes not even we can) but it knows where it was allocated in the first place:
==6407== Address 0x4185050 is 0 bytes after a block of size 40 alloc'd ==6407== at 0x4022765: malloc (vg_replace_malloc.c:149) ==6407== by 0x8048385: f (valgrind.c:5) ==6407== by 0x80483AC: main (valgrind.c:11)
It was a malloced memory called from function f at valgrind.c line 5, we don't need much more than that. Delete the memory allocated whenever you know it's not going to be used anymore.
Cache miss is an important information when optimizing in a deeper level. After you've made sure your memory usage is correct and there is no memory leaks in your program you can start worrying about the size of your structures and objects and the amount of instructions you're throwing away. You can get the cache miss by running Valgrind with –tool=cachegrind option:
$ valgrind --tool=cachegrind ./valgrind (...) ==6893== I refs: 180,203 ==6893== I1 misses: 1,187 ==6893== L2i misses: 663 ==6893== I1 miss rate: 0.65% ==6893== L2i miss rate: 0.36% ==6893== ==6893== D refs: 88,584 (67,475 rd + 21,109 wr) ==6893== D1 misses: 2,042 ( 1,837 rd + 205 wr) ==6893== L2d misses: 1,571 ( 1,396 rd + 175 wr) ==6893== D1 miss rate: 2.3% ( 2.7% + 0.9% ) ==6893== L2d miss rate: 1.7% ( 2.0% + 0.8% ) ==6893== ==6893== L2 refs: 3,229 ( 3,024 rd + 205 wr) ==6893== L2 misses: 2,234 ( 2,059 rd + 175 wr) ==6893== L2 miss rate: 0.8% ( 0.8% + 0.8% )
Take a special note on D1 miss rate and L2d miss rate, they show the rate which the CPU is not getting the data from the cache (L1 and L2 respectively). It should be as low as possible but it's impossible to be zero, after all you must grab it from the memory at least one for every data. A rate of 2.3% for L1 and 1.7% for L2 is acceptable, it also says that most of your data fit in L1 and you didn't have to retrieve it from L@ or the main memory. But if your data is too big to fit in L1 you'd have a result like this:
$ valgrind --tool=cachegrind ./cache_miss (...) ==6902== I refs: 24,128,443 ==6902== I1 misses: 1,090 ==6902== L2i misses: 605 ==6902== I1 miss rate: 0.00% ==6902== L2i miss rate: 0.00% ==6902== ==6902== D refs: 15,065,103 (13,047,092 rd + 2,018,011 wr) ==6902== D1 misses: 2,062,716 ( 1,001,032 rd + 1,061,684 wr) ==6902== L2d misses: 127,895 ( 64,235 rd + 63,660 wr) ==6902== D1 miss rate: 13.6% ( 7.6% + 52.6% ) ==6902== L2d miss rate: 0.8% ( 0.4% + 3.1% ) ==6902== ==6902== L2 refs: 2,063,806 ( 1,002,122 rd + 1,061,684 wr) ==6902== L2 misses: 128,500 ( 64,840 rd + 63,660 wr) ==6902== L2 miss rate: 0.3% ( 0.1% + 3.1% )
With 13.6% for L1 and only 0.8% for L2, which means that the data fit in L2 but not entirely on L1, meaning that the program had to replace the L1 cache more than it should. Part of the cache miss was also because the access was not sequential, so the CPU had to throw away pages of L1 cache and get them again, spending precious time getting the same information again from L2. By changing the loop to sequential access we can save half of the cache misses:
// From for (i=0; i<MAX; i++) for (j=0; j<MAX; j++) m2[i][j] = m1[j][i]; // J - I // To for (i=0; i<MAX; i++) for (j=0; j<MAX; j++) m2[i][j] = m1[i][j]; // I - J
$ valgrind --tool=cachegrind ./cache_miss (...) ==6914== D refs: 15,065,103 (13,047,092 rd + 2,018,011 wr) ==6914== D1 misses: 1,125,218 ( 63,538 rd + 1,061,680 wr) ==6914== L2d misses: 126,888 ( 63,228 rd + 63,660 wr) ==6914== D1 miss rate: 7.4% ( 0.4% + 52.6% ) ==6914== L2d miss rate: 0.8% ( 0.4% + 3.1% ) (...)