header image
[ # ] LLVM Vectorizer
February 12th, 2013 under Algorithms, Devel, rengolin

Now that I’m back working full-time with LLVM, it’s time to get some numbers about performance on ARM.

I’ve been digging the new LLVM loop vectorizer and I have to say, I’m impressed. The code is well structured, extensible and above all, sensible. There are lots of room for improvement, and the code is simple enough so you can do it without destroying the rest or having to re-design everything.

The main idea is that the loop vectorizer is a Loop Pass, which means that if you register this pass (automatically on -O3, or with -loop-vectorize option), the Pass Manager will run its runOnLoop(Loop*) function on every loop it finds.

The three main components are:

  1. The Loop Vectorization Legality: Basically identifies if it’s legal (not just possible) to vectorize. This includes checking if we’re dealing with an inner loop, and if it’s big enough to be worth, and making sure there aren’t any conditions that forbid vectorization, such as overlaps between reads and writes or instructions that don’t have a vector counter-part on a specific architecture. If nothing is found to be wrong, we proceed to the second phase:
  2. The Loop Vectorization Cost Model: This step will evaluate both versions of the code: scalar and vector. Since each architecture has its own vector model, it’s not possible to create a common model for all platforms, and in most cases, it’s the special behaviour that makes vectorization profitable (like 256-bits operations in AVX), so we need a bunch of cost model tables that we consult given an instruction and the types involved. Also, this model doesn’t know how the compiler will lower the scalar or vectorized instructions, so it’s mostly guess-work. If the vector cost (normalized to the vector size) is less than the scalar cost, we do:
  3. The Loop Vectorization: Which is the proper vectorization, ie. walking through the scalar basic blocks, changing the induction range and increment, creating the prologue and epilogue, promote all types to vector types and change all instructions to vector instructions, taking care to leave the interaction with the scalar registers intact. This last part is a dangerous one, since we can end up creating a lot of copies from scalar to vector registers, which is quite expensive and was not accounted for in the cost model (remember, the cost model is guess-work based).

All that happens on a new loop place-holder, and if all is well at the end, we replace the original basic blocks by the new vectorized ones.

So, the question is, how good is this? Well, depending on the problems we’re dealing with, vectorizers can considerably speed up execution. Especially iterative algorithms, with lots of loops, like matrix manipulation, linear algebra, cryptography, compression, etc. In more practical terms, anything to do with encoding and decoding media (watching or recording videos, pictures, audio), Internet telephones (compression and encryption of audio and video), and all kinds of scientific computing.

One important benchmark for that kind of workload is Linpack. Not only Linpack has many examples of loops waiting to be vectorized, but it’s also the benchmark that defines the Top500 list, which classifies the fastest computers in the world.

Benchmarks

So, both GCC and Clang now have the vectorizers turned on by default with -O3, so comparing them is as simple as compiling the programs and see them fly. But, since I’m also interested in seeing what is the performance gain with just the LLVM vectorizer, I also disabled it and ran a clang with only  -O3, no vectorizer.

On x86_64 Intel (Core i7-3632QM), I got these results:

Compiler Opt Avg. MFLOPS Diff
Clang -O3 2413 0.0%
GCC -O3 vectorize 2421 0.3%
Clang -O3 vectorize 3346 38.6%

This is some statement! The GCC vectorizer exists for a lot longer than LLVM’s and has been developed by many vectorization gurus and LLVM seems to easily beat GCC in that field. But, a word of warning, Linpack is by no means representative of all use cases and user visible behaviour, and it’s very likely that GCC will beat LLVM on most other cases. Still, a reason to celebrate, I think.

This boost mean that, for many cases, not only the legality if the transformations are legal and correct (or Linpack would have gotten wrong results), but they also manage to generate faster code at no discernible cost. Of course, the theoretical limit is around 4x boost (if you manage to duplicate every single scalar instruction by a vector one and the CPU has the same behaviour about branch prediction and cache, etc), so one could expect a slightly higher number, something on the order of 2x better.

It depends on the computation density we’re talking about. Linpack tests specifically the inner loops of matrix manipulation, so I’d expect a much higher ratio of improvement, something around 3x or even closer to 4x. VoIP calls, watching films and listening to MP3 are also good examples of densely packet computation, but since we’re usually running those application on a multi-task operating system, you’ll rarely see improvements higher than 2x. But general applications rarely spend that much time on inner loops (mostly waiting for user input and then doing a bunch of unrelated operations, hardly vectorizeable).

Another important aspect of vectorization is that it saves a lot of battery juice. MP3 decoding doesn’t really matter if you finish in 10 or 5 seconds, as long as the music doesn’t stop to buffer. But taking 5 seconds instead of 10 means that on the other 5 seconds the CPU can reduce its voltage and save battery. This is especially important in mobile devices.

What about ARM code?

Now that we know the vectorizer works well, and the cost model is reasonably accurate, how does it compare on ARM CPUs?

It seems that the grass is not so green on this side, at least not at the moment. I have reports that on ARM it also reached the 40% boost similar to Intel, but what I saw was a different picture altogether.

On a Samsung Chromebook (Cortex-A15) I got:

Compiler Opt Avg. MFLOPS Diff
Clang -O3 796 0.0%
GCC -O3 vectorize 736 -8.5%
Clang -O3 vectorize 773 -2.9%

The performance regression can be explained by the amount of scalar code intermixed with vector code inside the inner loops as a result of shuffles (movement of data within the vector registers and between scalar and vector registers) not being lowered correctly. This most likely happens because the LLVM back-end relies a lot on pattern-matching for instruction selection (a good thing), but the vectorizers might not be producing the shuffles in the right pattern, as expected by each back-end.

This can be fixed by tweaking the cost model to penalize shuffles, but it’d be good to see if those shuffles aren’t just mismatched against the patterns that the back-end is expecting. We will investigate and report back.

Update

Got results for single precision floating point, which show a greater improvement on both Intel and ARM.

On x86_64 Intel (Core i7-3632QM), I got these results:

Compiler Opt Avg. MFLOPS Diff
Clang -O3 2530 0.0%
GCC -O3 vectorize 3484 37.7%
Clang -O3 vectorize 3996 57.9%

On a Samsung Chromebook (Cortex-A15) I got:

Compiler Opt Avg. MFLOPS Diff
Clang -O3 867 0.0%
GCC -O3 vectorize 788 -9.1%
Clang -O3 vectorize 1324 52.7%

Which goes on to show that the vectorizer is, indeed, working well for ARM, but the costs of using the VFP/NEON pipeline outweigh the benefits. Remember than NEON vectors are only 128-bits wide and VFP only 64-bit wide, and NEON has no double precision floating point operations, so they’ll only do one double precision floating point operations per cycle, so the theoretical maximum depends on the speed of the soft-fp libraries.

So, in the future, what we need to be working is the cost model, to make sure we don’t regress in performance, and try to get better algorithms when lowering vector code (both by making sure we match the patterns that the back-end is expecting, and by just finding better ways of vectorizing the same loops).

Conclusion

Without further benchmarks it’s hard to come to a final conclusion, but it’s looking good, that’s for sure. Since Linpack is part of the standard LLVM test-suite benchmarks, fixing this and running it regularly on ARM will at least avoid any further regressions… Now it’s time to get our hands dirty!

 


Read the Comments

[ # 1700 ] Comment from Thomas C. [May 7, 2013, 4:32 AM]

“The GCC vectorizer exists for a lot longer than LLVM’s and has been developed by many vectorization gurus”

In fact, 3-4 years ago qualcomm hired (almost?) everyone that worked on the GCC vectorizer and they’re no longer permitted to contribute to GCC because of the stupid GPL. :(

[ # 1701 ] Comment from Eamon Nerbonne [May 7, 2013, 10:41 AM]

This is promising! But to better understand how meaningful your comparison is: what were the exact versions of clang and gcc used, and what was the exact compiler command line used?

Write a comment





License
Creative Commons License
We Support

WWF

National Autistic Society

Royal Society for the Prevention of Cruelty to Animals

DefectiveByDesign.org

End Software Patents

See Also
Disclaimer

The information in this weblog is provided “AS IS” with no warranties, and confers no rights.

This weblog does not represent the thoughts, intentions, plans or strategies of our employers. It is solely our opinion.

Feel free to challenge and disagree, and do not take any of it personally. It is not intended to harm or offend.

We will easily back down on our strong opinions by presentation of facts and proofs, not beliefs or myths. Be sensible.

Recent Posts