tiistai 18. heinäkuuta 2017

Experiment: binary size reduction by using common function tails

In embedded development the most important feature of any program is its size. The raw performance does not usually matter that much, but size does. A program that is even one byte larger than available flash size is useless.

GCC, Clang and other free compilers do an admirable job in creating small executables when asked to with the -Os compiler switch. However there are still optimizations that could be added. Suppose we have two functions that looks like this:

int funca() {
  int i = 0;
  i+=func2();
  return i+func1();
}

int funcb() {
  int i = 1;
  i+=func3();
  return i+func1();
}

They would get compiled into the following asm on x86-64:

funca():
        push    rbp
        mov     rbp, rsp
        sub     rsp, 16
        mov     DWORD PTR [rbp-4], 0
        call    func2()
        add     DWORD PTR [rbp-4], eax
        call    func1()
        mov     edx, eax
        mov     eax, DWORD PTR [rbp-4]
        add     eax, edx
        leave
        ret
funcb():
        push    rbp
        mov     rbp, rsp
        sub     rsp, 16
        mov     DWORD PTR [rbp-4], 1
        call    func3()
        add     DWORD PTR [rbp-4], eax
        call    func1()
        mov     edx, eax
        mov     eax, DWORD PTR [rbp-4]
        add     eax, edx
        leave
        ret

If you look carefully, the last 7 instructions on both of these functions are identical. In fact the code above can be rewritten to this:

funca():
        push    rbp
        mov     rbp, rsp
        sub     rsp, 16
        mov     DWORD PTR [rbp-4], 0
        call    func2()
common_tail:
        add     DWORD PTR [rbp-4], eax
        call    func1()
        mov     edx, eax
        mov     eax, DWORD PTR [rbp-4]
        add     eax, edx
        leave
        ret
funcb():
        push    rbp
        mov     rbp, rsp
        sub     rsp, 16
        mov     DWORD PTR [rbp-4], 1
        call    func3()
        jmp common_tail

Depending on your point of view this can be seen as either a cool hack or an insult to everything that is good and proper in the world. funcb does an unconditional jump inside the body of an unrelated function. The reason this works is that we know that the functions end in a ret operand which pops a return address from the stack and jumps into that (that is, the "parent function" that called the current function). Thus both code segments are identical and can be collapsed into one. This is an optimisation that can only be done at the assembly level, because C prohibits gotos between functions.

How much does this save?

To test this I wrote a simple Python script that parses assembly output, finds the ends of functions and replaces common tails with jumps as described above. It uses a simple heuristic and only does the reduction if there are three or more common instructions. Then I ran it on the assembly output of SQLite's "amalgamation" source file. That resulted in reductions such as this one:

Ltail_packer_57:
        setne   %al
Ltail_packer_1270:
        andb    $1, %al
        movzbl  %al, %eax
        popq    %rbp
        retq

This function tail is being used in two different ways, sometimes with the setne command and sometimes without. In total the asm file contained 1801 functions. Out of those 1522 could be dedupped. The most common removals looked like this:

       addq    $48, %rsp
       popq    %rbp
       retq

That is, the common function suffix. Interestingly, when the dedupped asm is compiled, the output is about 10k bigger than without dedupping. The original code was 987 kB. I did not measure where the difference comes from. It could be either because the extra labels need extra metadata or because the jmp instruction takes more space than the instructions it replaces because the jump might need a 32 bit offset. A smarter implementation would look to minimize the jump distance so it would fit in 16 bits and thus in a smaller opcode. (I'm not sure if x86-64 has those opcodes so the previous comment might be wrong in practice but the reasoning behind it is still valid.)

Is this actually worth doing?

On the x86 probably not because the machines have a lot of ram to spare and people running them usually only care about raw performance. The x86 instruction set is also quite compact because it has a variable size encoding. The situation is different in ARM and other embedded platforms. They have fewer instructions and a constant encoding size (usually 32 bits). This means longer instruction sequences which gives more potential for size reductions. Some embedded compilers do this optimization so The Real World would seem to indicate that it is worth it.

I wanted to run the test on ARM assembly as well, but parsing it for function tails is much more difficult than for x86 asm so I just gave up. Thus knowing the real benefits would require getting comments from an actual compiler engineer. I don't even pretend to be one on the Internet, so I just filed a feature request on this to the Clang bug tracker.

keskiviikko 12. heinäkuuta 2017

Is every build system using Ninja just as fast as every other?

One of the most common arguments against Meson is that "it is only fast because it uses Ninja rather than Make, using any other Ninja build generator would be just as fast". This is always stated as fact without any supporting evidence or measurements. But is this really the case? Let's find out.

For testing one needs a project that has both CMake and Meson build definitions. I'm not aware of any so I created one myself. I took the source code of the Mediascanner 2 project, which is using CMake and converted it to use Meson. This project was chosen solely based on the fact that I wrote the original CMake definitions ages ago so I should have a fairly good understanding of the code base. The project itself is a fairly typical small-to-medium project written in C++ with a handful of system dependencies.

Compiling and running the project on a regular laptop gives a fairly straightforward answer. Both build systems are roughly as fast. So, case closed then?

Well no, not really. The project is small and machines today are very fast so a similar result is not very surprising. For a better test we would need to either convert a much larger project or use a slower machine. The former is not really feasible, but the latter can be achieved simply by running the tests on a Raspberry Pi 2. This is a fairly good real world test, as compiling programs of this size on a raspi is a common task.

The measurements

The tests were run on a Rasberry Pi 2+ running Jessie with a sid chroot. The CMake version was the latest in Sid whereas Meson trunk was used. Each measurement was done several times and the fastest time was always chosen. If you try to replicate these results yourself note that there is a lot of variance between consecutive runs so you have to run them many times. The source code can be found in this repository.

The first measurement is how long running the first configuration step takes.

CMake takes 12 seconds whereas Meson gets by with only four. This is fairly surprising as CMake is a C++ executable whereas Meson is implemented in Python so one would expect the former to be faster. Configuration step is run seldom, so it's ultimately not that interesting. Most of the time is probably spent on a full build, so let's look at that next.
CMake takes 2 minutes and 21 seconds to do a full build. Meson is 31 seconds, or 20%, faster clocking at 1 minute 50 seconds. Both systems build the same files and they have the same number of build steps, 63, as reported by Ninja. Finally let's look at the most common task during development: incremental compilation. This is achieved by touching one source file and running Ninja.

In this case Meson is almost 50% faster than CMake. This is due to the smart link skipping functionality that we stole from LibreOffice (who stole it from Chromium :). This improvement can have a big impact during day to day development, as it allows faster iteration cycles.

Conclusions

Ninja is awesome but not made of magic. The quality of the generated Ninja file has a direct impact on build times. Based on experiments run here it seems that Meson performs consistently faster than CMake.