Nibble Stew: Optimizing code: even the simplest things are unbelievably complex

In the previous post we looked at optimizing this simple function.

uint64_t result = 0;
for(size_t i=0; i<bufsize; i++) {
if(buf[i] >= 128) {
result += buf[i];
}
}

We shall now do more measurements with real world hardware and compilers. The algorithms we use are the following:

simple: the above loop as-is
lookup: create a lookup table where entries less than 128 have value zero and the rest have the same value as the index
bit fiddling: convert the if into a branchless bitmask operation
partition: run std::partition on the data and only add the first half
zeroing: go over the data and set values not matching to zero, then add all
bucket: keep an array of 255 entries and count the number of times each number appears
multiply: convert if to a multiplication by 0 or 1, then add
parallel add: add several chars in a single packed 64 bit addition

Those interested in the actual implementations should look it up in the repo.

The hardware used is the following:

Raspberry Pi, Raspbian, GCC 4.9.2, Clang 3.5.0
Ubuntu zesty, GCC 6.3.0, Clang 4.0.0
Macbook Pro i5, XCode 8
Windows 10, Visual Studio 2017, run in VirtualBox

The test suite runs all available compilers with a selection of optimization types, CPU features (SSE, AVX, Neon etc) and measures the times taken.

The results

Let's start by looking at the simplest build setup.

This seems quite reasonable. Parallel addition is the fastest, others are roughly as fast and the two algoritms that reorder the input array are the slowest. For comparison Raspberry Pi looks like this:

Everything is much flatter as one would expect. Since everything is going smoothly, let's look at the first measurement again, except this time we sort the input data before evaluating. One would expect that the simple loop becomes faster because the branch predictor has an easier task, partitioning becomes faster and nothing becomes noticeably slower.

Well ... ummm ... one out of three ain't bad, I guess. At this point I should probably confess that I don't have a proper grasp on why these things are happening. Any speculation to follow might be completely wrong. The reason bucket slows down is the easier of these two to explain. Since the input is sorted, consecutive iterations of the loop attempt to write to the same memory address, which leads to contention. When the data was random, each iteration wrote to a random location which leads to fewer collisions.

The reason why the simple loop does not get faster may be caused by the processor evaluating both branches of the if clause in any case and thus having better branch prediction does not matter. On the other hand Visual Studio does this:

Bucket is slower for sorted as above, but the simple loop is an order of magnitude slower on unsorted data. Ideas on what could be the cause of that are welcome.

What is the fastest combination?

The fastest combination for each hardware platform is the following.

Ubuntu: bit fiddle, g++, release build, -msse, unsorted
Raspi: bit fiddle, g++, release build, -mfpu=neon, sorted
OSX: simple loop, Clang++, debugoptimized build, -msse4.2, sorted
VS2017: lut, debugoptimized build, unsorted

This is basically random. There does not seem to be any one algorithm that is consistently the fastest, every one of them is noticeably slower than others under some circumstances. Even weirder, things that you would expect to be straightforward and true are not. Here are some things to scratch your head over:

AVX instructions are never the fastest, and on an i7 the fastest is plain old SSE (for the record MMX was not tested)
With Clang, enabling Neon instructions makes everything a lot slower
On the Raspberry Pi doing a read only table lookup using Neon is slower than with regular instructions
On an i7 multiplication is sometimes faster than arithmetic shifting

6 comments:

UnknownJune 8, 2017 at 8:44 PM
Another possible implementation - bucketing the sum:

uint64_t result[2];
result[0] = result[1] = 0;
for(size_t i=0; i<bufsize; i++) {
result[buf[i]/128]++;
}
return result[1];
Billy O'NealJune 9, 2017 at 5:08 AM
debugoptimized build <-- Why not /O2 ?
MartinJune 14, 2017 at 11:47 AM
I'm looking at parallel_add_lookup. You take a pointer to uint8_t, cast it to pointer to uint64_t and then dereference that pointer. Isn't that undefined behavior?

See https://blog.regehr.org/archives/1307, in particular under Chunking Optimizations Are Broken.

Thursday, June 8, 2017

Optimizing code: even the simplest things are unbelievably complex

The results

What is the fastest combination?

6 comments: