Once your software is created, tested and reasonably documented, it's time to fine-tune it. There are several techniques and I'll try to cover the most important: architecture, I/O optimization, profiling and parallelization.
Architecture optimization should be considered since the beginning of the design phase, especially memory efficient containers and algorithms and modularization. Correct decisions made earlier will decrease a lot the time in development and further mistakes and refactoring but don't go too deep, let the other optimizations for later as the smaller the level the easier is to optimize without refactor.
I/O optimizations are easier to spot than algorithm problems and sometimes they can increase performance a lot, but normally that's the case for massively parallel applications where reducing interprocess communication, log levels and disk and network I/O saves a lot.
In more specific parts of the program or in very specific programs a profiling is often necessary to remove the most obvious problems, generally related to memory and CPU bad usage. This is the type of optimization that you can never perform before it's really giving you problems or the performance of the program is still not enough even after all the other optimizations.
Last, parallelization is important during all phases of development and should always be considered. In days where all new desktops and laptops sold have two or more processors parallelization is not just an option. Clusters are the cheapest way of scaling and programs should be able to use their full power instead of relying on faster, vector machines as they usually cost orders of magnitude more than simple commodity clusters.