tiistai 18. heinäkuuta 2017

Experiment: binary size reduction by using common function tails

In embedded development the most important feature of any program is its size. The raw performance does not usually matter that much, but size does. A program that is even one byte larger than available flash size is useless.

GCC, Clang and other free compilers do an admirable job in creating small executables when asked to with the -Os compiler switch. However there are still optimizations that could be added. Suppose we have two functions that looks like this:

int funca() {
  int i = 0;
  i+=func2();
  return i+func1();
}

int funcb() {
  int i = 1;
  i+=func3();
  return i+func1();
}

They would get compiled into the following asm on x86-64:

funca():
        push    rbp
        mov     rbp, rsp
        sub     rsp, 16
        mov     DWORD PTR [rbp-4], 0
        call    func2()
        add     DWORD PTR [rbp-4], eax
        call    func1()
        mov     edx, eax
        mov     eax, DWORD PTR [rbp-4]
        add     eax, edx
        leave
        ret
funcb():
        push    rbp
        mov     rbp, rsp
        sub     rsp, 16
        mov     DWORD PTR [rbp-4], 1
        call    func3()
        add     DWORD PTR [rbp-4], eax
        call    func1()
        mov     edx, eax
        mov     eax, DWORD PTR [rbp-4]
        add     eax, edx
        leave
        ret

If you look carefully, the last 7 instructions on both of these functions are identical. In fact the code above can be rewritten to this:

funca():
        push    rbp
        mov     rbp, rsp
        sub     rsp, 16
        mov     DWORD PTR [rbp-4], 0
        call    func2()
common_tail:
        add     DWORD PTR [rbp-4], eax
        call    func1()
        mov     edx, eax
        mov     eax, DWORD PTR [rbp-4]
        add     eax, edx
        leave
        ret
funcb():
        push    rbp
        mov     rbp, rsp
        sub     rsp, 16
        mov     DWORD PTR [rbp-4], 1
        call    func3()
        jmp common_tail

Depending on your point of view this can be seen as either a cool hack or an insult to everything that is good and proper in the world. funcb does an unconditional jump inside the body of an unrelated function. The reason this works is that we know that the functions end in a ret operand which pops a return address from the stack and jumps into that (that is, the "parent function" that called the current function). Thus both code segments are identical and can be collapsed into one. This is an optimisation that can only be done at the assembly level, because C prohibits gotos between functions.

How much does this save?

To test this I wrote a simple Python script that parses assembly output, finds the ends of functions and replaces common tails with jumps as described above. It uses a simple heuristic and only does the reduction if there are three or more common instructions. Then I ran it on the assembly output of SQLite's "amalgamation" source file. That resulted in reductions such as this one:

Ltail_packer_57:
        setne   %al
Ltail_packer_1270:
        andb    $1, %al
        movzbl  %al, %eax
        popq    %rbp
        retq

This function tail is being used in two different ways, sometimes with the setne command and sometimes without. In total the asm file contained 1801 functions. Out of those 1522 could be dedupped. The most common removals looked like this:

       addq    $48, %rsp
       popq    %rbp
       retq

That is, the common function suffix. Interestingly, when the dedupped asm is compiled, the output is about 10k bigger than without dedupping. The original code was 987 kB. I did not measure where the difference comes from. It could be either because the extra labels need extra metadata or because the jmp instruction takes more space than the instructions it replaces because the jump might need a 32 bit offset. A smarter implementation would look to minimize the jump distance so it would fit in 16 bits and thus in a smaller opcode. (I'm not sure if x86-64 has those opcodes so the previous comment might be wrong in practice but the reasoning behind it is still valid.)

Is this actually worth doing?

On the x86 probably not because the machines have a lot of ram to spare and people running them usually only care about raw performance. The x86 instruction set is also quite compact because it has a variable size encoding. The situation is different in ARM and other embedded platforms. They have fewer instructions and a constant encoding size (usually 32 bits). This means longer instruction sequences which gives more potential for size reductions. Some embedded compilers do this optimization so The Real World would seem to indicate that it is worth it.

I wanted to run the test on ARM assembly as well, but parsing it for function tails is much more difficult than for x86 asm so I just gave up. Thus knowing the real benefits would require getting comments from an actual compiler engineer. I don't even pretend to be one on the Internet, so I just filed a feature request on this to the Clang bug tracker.

keskiviikko 12. heinäkuuta 2017

Is every build system using Ninja just as fast as every other?

One of the most common arguments against Meson is that "it is only fast because it uses Ninja rather than Make, using any other Ninja build generator would be just as fast". This is always stated as fact without any supporting evidence or measurements. But is this really the case? Let's find out.

For testing one needs a project that has both CMake and Meson build definitions. I'm not aware of any so I created one myself. I took the source code of the Mediascanner 2 project, which is using CMake and converted it to use Meson. This project was chosen solely based on the fact that I wrote the original CMake definitions ages ago so I should have a fairly good understanding of the code base. The project itself is a fairly typical small-to-medium project written in C++ with a handful of system dependencies.

Compiling and running the project on a regular laptop gives a fairly straightforward answer. Both build systems are roughly as fast. So, case closed then?

Well no, not really. The project is small and machines today are very fast so a similar result is not very surprising. For a better test we would need to either convert a much larger project or use a slower machine. The former is not really feasible, but the latter can be achieved simply by running the tests on a Raspberry Pi 2. This is a fairly good real world test, as compiling programs of this size on a raspi is a common task.

The measurements

The tests were run on a Rasberry Pi 2+ running Jessie with a sid chroot. The CMake version was the latest in Sid whereas Meson trunk was used. Each measurement was done several times and the fastest time was always chosen. If you try to replicate these results yourself note that there is a lot of variance between consecutive runs so you have to run them many times. The source code can be found in this repository.

The first measurement is how long running the first configuration step takes.

CMake takes 12 seconds whereas Meson gets by with only four. This is fairly surprising as CMake is a C++ executable whereas Meson is implemented in Python so one would expect the former to be faster. Configuration step is run seldom, so it's ultimately not that interesting. Most of the time is probably spent on a full build, so let's look at that next.
CMake takes 2 minutes and 21 seconds to do a full build. Meson is 31 seconds, or 20%, faster clocking at 1 minute 50 seconds. Both systems build the same files and they have the same number of build steps, 63, as reported by Ninja. Finally let's look at the most common task during development: incremental compilation. This is achieved by touching one source file and running Ninja.

In this case Meson is almost 50% faster than CMake. This is due to the smart link skipping functionality that we stole from LibreOffice (who stole it from Chromium :). This improvement can have a big impact during day to day development, as it allows faster iteration cycles.

Conclusions

Ninja is awesome but not made of magic. The quality of the generated Ninja file has a direct impact on build times. Based on experiments run here it seems that Meson performs consistently faster than CMake.

maanantai 26. kesäkuuta 2017

Writing your own simple GUI SSH client

Most Unix users are accustomed to using SSH from the command line. On Windows and other platforms GUI tools are popular and they can do some nice tricks such as opening graphical file transfer windows and initiate port forwardings to an existing connection. You can do all the same things with the command line client but you have to specify all things you want to use when first opening the connection.

This got me curious. How many lines of code would one need to build a GUI client that does all that on Linux. The answer turned out to be around 1500 lines of code whose job is mostly to glue together the libvte terminal emulator widget and the libssh network library. This is what it looks like:

Port forwardings can be opened at any time. This allows you to e.g. forward http traffic through your own proxy server to go around draconian firewalls.
File transfers also work.
A lot of things do not work, such as reverse port forwards or changing the remote directory in file transfers. Some key combinations that have ctrl/alt/etc modifiers can't be sent over the terminal. I don't really know why, but it seems the vte terminal does some internal state tracking to know which modifiers are active. There does not seem to be a way to smuggle corresponding raw keycodes out, it seems to send them directly to the terminal it usually controls. I also did not find an easy way of getting full keyboard status from raw GTK+ events.

torstai 15. kesäkuuta 2017

Of XML, GUIDs and installers

Every now and then I have to use Windows. I'd really like to use Emacs for editing so I need to install it. Unfortunately the official releases from the FSF are zip archives that you need to manually unpack, add shortcuts and all that stuff. This gets tedious and it would be nice if there were MSI installer packages to do all that for you. So I figured I'd create some.

Conceptually what the installer needs to do is the following:

  1. Create the c:\Program Files\GNU Emacs directory
  2. Unzip contents of the official release zip file in said directory
Seems simple, right? There are two main tools for creating installers. The first one is NSIS, but it only creates exe installers, not msi. It also has a scripting language designed by a shaman who has licked a few frogs too many. So let's ignore that.

The other tool is WiX. It creates nice installers but it is Enterprise. Very, Very Enterprise. And XML. But mostly Enterprise. For starters the installer needs to have a GUID (basically a 128 bit random number). It also needs an "upgrade GUID". And a "package GUID". The first two must be the same over all installer versions but the latter must not be.

Having conjured the necessary amount of GUIDs (but not too many), you are ready to tackle copying files around. As you probably guessed, each one of them needs its own GUID. But if you though that each one would require an XML element of their own, you are sadly mistaken. Every file needs two nested XML elements. The files also need a container. And a container-container.

Did I mention that the documentation consists almost entirely of XML fragments? So it is all but impossible to tell which tag should follow which and which ones should be nested?

The Emacs distribution has a lot of files, so obviously writing the XML by hand is out of the question. Unfortunately the WiX toolkit ships a helper utility called heat to generate the XML from a directory tree. Yes, I really did say unfortunately. While the script pretends to work in reality it does not. Following the documentation you might try doing something like this (simplified for purposes of exposition):

heat <directory with unpacked files> <output xml file> other_flags

This does create an installer which installs all the files. But it also does a little bit extra. If your unpack directory was called unpack, then the files will be installed to c:\Program Files\GNU Emacs\unpack. How might we get rid of that extra directory segment? The correct answer is that the system is optimized for in-source Visual Studio builds and trying to do anything else is doomed to fail. But let's try it anyway.

First one might look for a command line switch like -P1 for patch to discard the first path segment. It does not seem to exist.

Next one might try to be clever and cd inside the unpack dir and do this (again simplified):

heat . <output xml file> other_flags

The system reports no error but the output is identical to the previous attempt. The script helpfully notices that you are using a period for the directory name and will do a lookup in the parent directory to see what it would be called and substitutes it in. Because!

Since there are only a few directories in the top level one might try something along the lines of:

heat dir1 dir2 dir3 dir4 <output xml file> other args

Which again succeeds without error. However the output installer only has files from dir1. The tool parses all the input and then dutifully throws away all entries but the first without so much as a warning.

The way to make this work is to generate the XML files and then monkey patch all paths in them before passing them on to the installer generator. For That Is the Enterprise Way! Just be glad there are no files at the root directory.

Join us next time to learn how a randomly generated daddy GUID comes together with a randomly generated mommy GUID to produce a new entry in the start menu.

Is this available somewhere?

The script is here. There are no downloadable binaries because I don't have the resources to maintain those. It would be cool if someone (possibly even the FSF) would provide them.

keskiviikko 7. kesäkuuta 2017

Optimizing code: even the simplest things are unbelievably complex

In the previous post we looked at optimizing this simple function.

uint64_t result = 0;
for(size_t i=0; i<bufsize; i++) {
  if(buf[i] >= 128) {
    result += buf[i];
  }
}

We shall now do more measurements with real world hardware and compilers. The algorithms we use are the following:

  • simple: the above loop as-is
  • lookup: create a lookup table where entries less than 128 have value zero and the rest have the same value as the index
  • bit fiddling: convert the if into a branchless bitmask operation
  • partition: run std::partition on the data and only add the first half
  • zeroing: go over the data and set values not matching to zero, then add all
  • bucket: keep an array of 255 entries and count the number of times each number appears
  • multiply: convert if to a multiplication by 0 or 1, then add
  • parallel add: add several chars in a single packed 64 bit addition
Those interested in the actual implementations should look it up in the repo.

The hardware used is the following:

  • Raspberry Pi, Raspbian, GCC 4.9.2, Clang 3.5.0
  • Ubuntu zesty, GCC 6.3.0, Clang 4.0.0
  • Macbook Pro i5, XCode 8
  • Windows 10, Visual Studio 2017, run in VirtualBox
The test suite runs all available compilers with a selection of optimization types, CPU features (SSE, AVX, Neon etc) and measures the times taken.

The results


Let's start by looking at the simplest build setup.

This seems quite reasonable. Parallel addition is the fastest, others are roughly as fast and the two algoritms that reorder the input array are the slowest. For comparison Raspberry Pi looks like this:
Everything is much flatter as one would expect. Since everything is going smoothly, let's look at the first measurement again, except this time we sort the input data before evaluating. One would expect that the simple loop becomes faster because the branch predictor has an easier task, partitioning becomes faster and nothing becomes noticeably slower.
Well ... ummm ... one out of three ain't bad, I guess. At this point I should probably confess that I don't have a proper grasp on why these things are happening. Any speculation to follow might be completely wrong. The reason bucket slows down is the easier of these two to explain. Since the input is sorted, consecutive iterations of the loop attempt to write to the same memory address, which leads to contention. When the data was random, each iteration wrote to a random location which leads to fewer collisions.

The reason why the simple loop does not get faster may be caused by the processor evaluating both branches of the if clause in any case and thus having better branch prediction does not matter. On the other hand Visual Studio does this:

Bucket is slower for sorted as above, but the simple loop is an order of magnitude slower on unsorted data. Ideas on what could be the cause of that are welcome.

What is the fastest combination?

The fastest combination for each hardware platform is the following.
  • Ubuntu: bit fiddle, g++, release build, -msse, unsorted
  • Raspi: bit fiddle, g++, release build, -mfpu=neon, sorted
  • OSX: simple loop, Clang++, debugoptimized build, -msse4.2, sorted
  • VS2017: lut, debugoptimized build, unsorted
This is basically random. There does not seem to be any one algorithm that is consistently the fastest, every one of them is noticeably slower than others under some circumstances. Even weirder, things that you would expect to be straightforward and true are not. Here are some things to scratch your head over:
  • AVX instructions are never the fastest, and on an i7 the fastest is plain old SSE (for the record MMX was not tested)
  • With Clang, enabling Neon instructions makes everything a lot slower
  • On the Raspberry Pi doing a read only table lookup using Neon is slower than with regular instructions
  • On an i7 multiplication is sometimes faster than arithmetic shifting

keskiviikko 31. toukokuuta 2017

Gee, optimization sure is hard

In a recent Reddit discussion the following piece of code was presented:

for (unsigned c = 0; c < arraySize; ++c) {
    if (data[c] >= 128)
        sum += data[c];
}

This code snippet seems to be optimal but is it? There is a hard to predict branch and a data dependency, both of which can cause slowdowns. To see if this can be implemented faster I created a test repo with a bunch of alternative implementations. Let's see how they stack up.

The implementations

The simplest is the simple loop as described above.

A version using a lookup table has a helper table of 256 entries. Values larger or equal to 128 have identity value and values smaller than 128 have the value zero. The body of the loop then becomes result += lut[buf[i]], which is branch free.

The bit fiddling approach goes through the values one by one and first calculates a mask value which is b >> 7. This is an arithmetic right shift whose outcome is either all zeros or all ones depending whether the value of b is less than 128. Then we add the value to the result by ANDing it with the mask. The core of this loop is result += buf[i] & (((int8_t)buf[i]) >> 7).

The partitioning approach divides the data array with std::partition and then only the part with values we want are added.

The zeroing approach goes over the array and sets all values that are less than 128 to zero. Then it goes over it again and adds all entries unconditionally. This goes over the data twice (and mutates it) but there are no data dependencies.

The bucket approach has an array of 256 entries. The loop goes over the data and increments the count for each input byte like this: ++counts[buf[i]]. The result is then obtained by going over entries 128-255 and evaluating result += i*counts[i].

So which one is the fastest?

Well that depends. A lot. On pretty much everything.

On x86_64 lookup table and bucket are the fastest, but only when using -O2. And GCC. For some reason Clang can't optimize the bucket version and it is consistently slower. On O3 bit fiddling becomes faster.

On a Raspberry Pi and -O2, the fastest are bit fiddling and bucket. But when using -O3, the simple loop is the fastest by a noticeable margin.

There does not seem to be a version that is consistently the fastest. However the versions that mutate their input array (partition and zeroing) are consistently the slowest.

What should I do to get teh fastest codez for my program?

This is a difficult question and depends on many things. The most straightforward solution is to write the simplest code you possibly can. It is easiest to understand and modify and is also the simplest for the compilers to optimize for the largest number of targets.

After that find out if you have a bottleneck. Measure. Measure again. Write an optimized version if you really need it. Measure. Measure with many different compilers and platforms. Keep measuring and fixing issues until the project reaches end of life.

maanantai 15. toukokuuta 2017

Emulating the Rust borrow checker with C++ part II: the borrowining

The most perceptive among you might have noticed that the last blog post did not actually do any borrowing, only single owner semantics with moves. Let's fix that. But first a caveat.

As far as I can tell, it is not possible to emulate Rust's borrow checker fully in compile time with C++. You can get pretty close (for some features at least) but there is some runtime overhead and violations are detected only at runtime, not compile time. See the end of this post for a description why that is. Still, it's better than nothing.

At the core is a resource whose ownership we want to track. We either want to allow either one accessor that can alter the object or many read-only accessors. We put the item inside a class that looks like this:

template<typename T>
class Owner final {
 private:
    T i;
    int roBorrows = 0;
    bool rwBorrow = false;
<rest of class omitted for brevity>

The holder object does not give out pointers or references to the underlying object. Instead you can only request either a read only or read-write (or mutable) reference to it. The read only function looks like this (the rw version is almost identical):

template<typename T>
RoBorrow<T> Owner<T>::borrowReadOnly() {
    if(rwBorrow) {
        throw std::runtime_error("Tried to create read only borrow when a rw borrow exists.");
    } else {
        return ::RoBorrow<T>(this); // increments borrow count
    }
}

Creating read only borrows only succeeds if there are no rw borrows. This code throws an exception but it could also call std::terminate or something else. A rw borrow would fail in a similar way if there are any ro borrows.

The interesting piece is the implementation of the proxy object RoBorrow<T>. It is the kind of move-only type that was described in part I. When it goes out of scope its destructor decrements the owner's borrow count:

~RoBorrow() { if (o) { o->decrementRoBorrows();} }

The magic lies in the conversion operator that is slightly different than the original version:

operator const T&() const { return owner->i; }

The conversion operator only gives out a const reference to the underlying object, which means only const operations can be invoked on it. Obviously this does not work for e.g. raw file descriptors because a "const int" does not protect the stat of the kernel fd object.  In order to call non-const functions you first need to get an RwBorrow whose conversion operator gives a constless reference, and Owner will only provide one if there are no other borrows outstanding.

When using this mechanism the following code fails (at runtime):

 auto b1 = owner.borrowReadOnly();
 auto b2 = owner.borrowReadWrite();
 
But this code works:

{
    auto b1 = owner.borrowReadOnly();
    auto b2 = owner.borrowReadOnly();
}
{
    auto b3 = owner.borrowReadWrite();
}
{
    auto b4 = owner.borrowReadOnly();
}

because the borrows go out of scope and are destroyed decrementing borrow counts to zero.

Why can't this be done at compile time?

I'm not a TMP specialist so maybe it can but as far as I know it is not possible. I'd love to be proven wrong on this one. The issue is due to limitations of constexpr which can be distilled into this dummy function:

template<typename T>
void constexpr foo(boolean b) {
    T x;
    constexpr int refcount = 1;
    if constexpr(b) {
        // do something
        --refcount;
    } else {
        // do something else
        --refcount;
    }
    static_assert(refcount == 0);
}

The static assert is obviously true no matter which code path is executed but the compiler can't prove that. First of all the if clause does not compile because its argument is not a compile-time constant. Second of all, the refcount decrements do not compile because constexpr variables can not be mutated during evaluation. You can create a new variable with the value x-1 but it would go out of scope when the subblocks end and there is no phi-node -like concept to get the value out to the outer scope. Finally, destructors can not be constexpr so even if we could mutate the count variable, it would not be done automatically (even though in theory the compiler has all the information it needs to do that).