Sunday, January 29, 2017

Testing exception vs error code behaviour with real world code

Our previous looks at exception performance have used microbenchmarks and generated code (here and here). These are fun and informative tests but ultimately flawed. What actually matters is performance on real world code. The straightforward way of measuring this is to convert a program using error codes into one using exceptions. This is harder than it seems.

The problem is that most code using error codes is not exception safe so this change can not be done easily (I tried, would not recommend). Fortunately there is a simple solution: going the other way. A fully exception safe code base can be easily converted into one using GLib style error objects with the following simple steps:

  1. replace all instances of throw with a call to a function such as Error* create_error(const char *msg)
  2. replace all catch statements with equivalent error object handling code
  3. alter the function signature of all functions creating or using error objects from void func() to void func(Error **e)
  4. change every call site to pass an error object and check if it points to non-null after the call, exit immediately if so
  5. repeat steps 4 and 5 recursively until compiler errors go away
This implements the exception code flow with explicit hand-written code

For testing we used Parzip, a high performance PkZip implementation. For simplicity we removed all parallel code and also the parts that deal with Zip file creation. Thus we ended up with a simple single threaded Zip file unpacker. The full code is available in this repository.

Source code size

LOC is a poor metric in general but illuminating in this case, because it directly shows the difference between these two approaches. The measurement is done with wc and if we ignore the license header on each file we find that the implementation using exceptions is 1651 lines and the one with error objects has 1971 lines. This means that error codes have roughly 19% more lines of source code than the equivalent code using exceptions. 

Looking at the code it is easy to see where this difference comes from. As an extreme example there is this function that reads data from file with endianness swapping:

localheader read_local_entry(File &f) {
    localheader h;
    uint16_t fname_length, extra_length;
    h.needed_version = f.read16le();
    h.gp_bitflag = f.read16le();
    h.compression = f.read16le();
    h.last_mod_time = f.read16le();
    h.last_mod_date = f.read16le();
    h.crc32 = f.read32le();

    /* And so on for a while. */
    return h;
}

And this is how it looks when using error objects:

localheader read_local_entry(File &f, Error **e) {
    localheader h;
    uint16_t fname_length, extra_length;
    h.needed_version = f.read16le(e);
    if(*e) {
        return h;
    }
    h.gp_bitflag = f.read16le(e);
    if(*e) {
        return h;
    }
    h.compression = f.read16le(e);
    if(*e) {
        return h;
    }
    h.last_mod_time = f.read16le(e);
    if(*e) {
        return h;
    }
    h.last_mod_date = f.read16le(e);
    if(*e) {
        return h;
    }
    h.crc32 = f.read32le(e);
    if(*e) {
        return h;
    }

    /* And so on and so on. */
    return h;
}

This example nicely shows the way that exceptions can make code more readable. The former sample is simple, straightforward, linear and understandable code. The latter is not. It looks like idiomatic Go (their words, but also mine). Intermixing the happy path with error handling makes the code quite convoluted.

Binary size

We tested the size of the generated code with GCC 6.2.0, Clang 3.8.1 and with Clang trunk. We built the code on 64 bit Ubuntu 16/10 using -g -O2 and the error code version was built with and without -fno-exceptions. The results look like this.
The results are, well, weird. Building with -fno-exceptions reduces code size noticeably in every case but the other parts are odd. When not using -fno-exceptions, the code that uses exceptions is within a few dozen bytes within the size of the code that uses error objects. GCC manages to make the error code version smaller even though it has the abovementioned 19% more lines of code. Some of this may be caused by the fact that -fno-exceptions links in less stuff from the C++ standard library. This was not researched further, though.

Looking at the ELF tables we find that the extra size taken by exceptions goes in the text segment rather than data segment. One would have expected it to have gone to unwind tables in a readonly data segment instead of text (which houses runnable code).

Things get even weirder with noexcept specifications. We added those to all functions in the exception code base which do not take an error object pointer in the error code version (meaning they will never throw). We then measured the sizes again and found that the resulting binaries had exactly the same size. On all three compilers. One would have expected some change (such as smaller exception unwind tables or anything) but no.

What have we learned from all this?

Mainly that things have more complex behavior than one would expect. The binary sizes especially are unexpected, especially the way Clang produces the same binary size for both error objects and exceptions. Even with this contradictory information we can make the following conclusions:
  • using exceptions can cause a noticeable reduction in lines of code compared to the equivalent functionality using error objects
  • compiling with -fno-exceptions can reduce binary size by 10-15%
Other than that these measurements really raise more questions than they answer. Why does noexcept not change output size? Why does Clang produce the same code size with exceptions and error objects? Are exception performance and code size fully optimized or could they be made smaller (this area probably has not had as much work done because many compiler contributors do not use exceptions internally)? 

Sunday, January 22, 2017

A Python extension module using C, C++, FORTRAN and Rust

One of the advantages of using a a general build system is that combining code written in different languages is easy. To demonstrate this I wrote a simple Python module called Polysnake. It compiles and links four different compiled languages into one shared extension module. The languages are C, C++, FORTRAN and Rust.

The build system setup consists of five declarations in Meson. It should be doable in four, but Rust is a bit special. Ignoring the boilerplate the core business logic looks like this:

rustlib = static_library('func', 'func.rs')

py3_mod.extension_module('polysnake',
  'polysnake.c',
  'func.cpp',
  'ffunc.f90',
  link_with : rustlib,
  dependencies : py3_dep)

The code is available on Github.

Compiling and running.

Compiling is done using the default Meson commands:

meson build
ninja -C build

Once built the main script can be run with this command:

PYTHONPATH=build ./main.py

The script just calls into the module and prints the string it returns. The output looks like this:

Combining many languages is simple.

This line is created in C.
This line is created in FORTRAN.
This line is created in C++.
This line is created in Rust.

Why not COBOL?

Saturday, January 14, 2017

Using a smart card for decryption from the command line

In Finland the national ID card contains a fully featured smart card, much like in Belgium and several other countries. It can be used for all sorts of cool things, like SSH logins. Unfortunately using this card for HW crypto operations is not easy and there is not a lot of easily digestible documentation available. Here's how you would do some simple operations from the command line.

The commands here should work on most other smart cards with very little changes. Note that these are just examples, they are not hardened for security at all. In production you'd use the libraries directly, keep all data in memory rather than putting it in temp files and so on.

First you plug in the device and card and check the status:

$ pkcs15-tool -k

Private RSA Key [todentamis- ja salausavain]
Object Flags   : [0x1], private
Usage          : [0x26], decrypt, sign, unwrap
Access Flags   : [0x1D], sensitive, alwaysSensitive, neverExtract, local
Access Rules   : execute:01;
ModLength      : 2048
Key ref        : 0 (0x0)
Native         : yes
Path           : XXXXXXXXXXXX
Auth ID        : 01
ID             : 45
MD:guid        : XXXXXXXXXXXX

Private RSA Key [allekirjoitusavain]
Object Flags   : [0x1], private
Usage          : [0x200], nonRepudiation
Access Flags   : [0x1D], sensitive, alwaysSensitive, neverExtract, local
Access Rules   : execute:02;
ModLength      : 2048
Key ref        : 0 (0x0)
Native         : yes
Path           : XXXXXXXXXXXX
Auth ID        : 02
ID             : 46
MD:guid        : XXXXXXXXXXXX

This card has two keys. We use the first one whose usage is "decrypt". Its ID number is 45. Now we need to extract the public key from the card:

$ pkcs15-tool --read-public-key 45 > mykey.pub

Next we generate some data and encrypt it with the public key. The important thing to note here is that you can only encrypt a small amount of data, on the order of a few dozen bytes.

$ echo secret message > original

Then we encrypt this message with OpenSSL using the extracted public key.

$ openssl rsautl -encrypt -inkey mykey.pub -pubin -in original -out encrypted

The file encrypted now contains the encrypted message. The only way to decrypt it is to transfer the data to the smart card for decryption, because the private key can only be accessed inside the card. This is achieved with the following command.

$ pkcs11-tool --decrypt -v --input-file encrypted --output-file decrypted --id 45 -l -m RSA-PKCS --pin 1234

After this the decrypted contents have been written to the file decrypted. The important bit here is -m RSA-PKCS, which specifies the exact form of RSA to use. There are several and if you use the wrong one the output will be random bytes without any errors or warnings.

What can this be used for?

Passwordless full disk encryption is one potential use. Combining a card reader with a Raspberry Pi and an electrical lock makes for a pretty spiffy DIY physical access control system.

Sunday, January 8, 2017

Measuring execution performance of C++ exceptions vs error codes

In an earlier blog post we found out that C++ exceptions produce smaller executables than corresponding code using error codes. The most common comment to it was that executable size is not that important, performance is what matters. Since we have the tooling, let's do perf measurements as well.

Test setup

We generate code that calls a few other functions that call other functions and so on until we reach the bottommost function. In that function we either raise an exception or error or return success. This gives us two parameters to vary: the depth of the call hierarchy and what percentage of calls produce an error.

The code itself is simple. The C++ version looks like this:

int func0() {
    int num = random() % 5;
    if(num == 0) {
        return func1();
    }
    if(num == 1) {
        return func2();
    }
    if(num == 2) {
        return func3();
    }
    if(num == 3) {
        return func4();
    }
    return func5();
}

We generate a random number and based on that call a function deeper in the hierarchy. This means that every time we call the test it goes from the top function to the final function in a random way. We use the C random number generator and seed it with a known value so both C and C++ have the same call chain path, even though they are different binaries. Note that this function calls on average a function whose number is its own number plus three. So if we produce 100 functions, the call stack is on average 100/3 = 33 entries deep when the error is generated.

The plain C version that uses error objects is almost identical:

int func0(struct Error **error) {
    int num = random() % 5;
    int res;
    if(num == 0) {
        res = func1(error);
    } else if(num == 1) {
        res = func2(error);
    } else if(num == 2) {
        res = func3(error);
    } else if(num == 3) {
        res = func4(error);
    } else {
        res = func5(error);
    }
    /* This is a no-op in this specific code but is here to simulate
     * real code that would check the error and based on that
     * do something. We must have a branch on the error condition,
     * because that is what real world code would have as well.
     */
    if(*error) {
        return -1;
    }

    return res;
}

The code for generating this code and running the measurements can be found in this Github repo.

Results

We do not care so much about how long the executables take to run, only which one of them runs faster. Thus we generate output results that look like this (Ubuntu 16/10, GCC 6.2, Intel i7-3770):

CCeCCCCCCC
CCCCCCccec
eEcCCCeECE
cceEEeECCe
EEEcEEEeCE
EEEEEeccEE
EEEEEEEEEE
EEEEEEEEEE
EEEEEEEEEE
EEEEEEEEEE

In this diagram C means that plain C error codes are faster and E that exceptions are faster for that particular pair of parameters. For easier reading all cases where error codes win have been given a red background color. Each row represents a different number of functions. The top row has 50 functions (call depth of 16) and each row adds 50 so the bottommost row has 500 functions.

The columns represent different error rates. The leftmost column has 0% errors, the next one has 1% and so on. A capital letter means that the runtime difference was bigger than 10%. Each executable was executed five times and the fastest run was taken in each case. Full results look like the following:

GCC 6.2      Clang 3.8.1  Clang 4.0 trunk

CCeCCCCCCC   EecCcCcCcC   ECcECcEeCC
CCCCCCccec   CceEEeEcEC   EEeEECEcCC
eEcCCCeECE   EEEEEECEeE   EEEECECeCC
cceEEeECCe   EcEeEECEEE   EEEEceeEeC
EEEcEEEeCE   EEEeEEEEEe   EEEEEEEEEE
EEEEEeccEE   EEEEEEEEEE   EEEEEEEEEE
EEEEEEEEEE   EEEEEEEEEE   EEEEEEEEEE
EEEEEEEEEE   EEEEEEEEEE   EEEEEEEEEE
EEEEEEEEEE   EEEEEEEEEE   EEEEEEEEEE
EEEEEEEEEE   EEEEEEEEEE   EEEEEEEEEE


Immediately we see that once the stack depth grows above a certain size (here 200/3 = 66), exceptions are always faster. This is not very interesting, because call stacks are usually not this deep (enterprise Java notwithstanding). For lower depths there is a lot of noise, especially for GCC. However we can see that for small error rates exceptions are sometimes faster, especially for Clang. Because of the noise we redid the measurements several times. This changed the individual measurements but produced the same overall shape where small errors seem to be faster with exceptions but as the error rate grows error codes become faster.

On a Raspberry Pi the results look like this:

GCC 4.9.2

eeccCCCCCC
EeeeccccCC
Eeeeeecccc
Eeeeeeeccc
EEeeeeeeec
EEEEEeeeee
EEEEEEEEee
EEEEEEEEEE
EEEEEEEEEE
EEEEEEEEEE

Here the results are a lot clearer and consistent, probably due to the simpler architecture of the ARM processor. Exceptions are faster except for small call depths and large error rates. When using 50 functions error objects only become noticeable faster at 4% error rate.

Finally let's do the same measurements on an i5-6600K.

GCC 4.8.4    gcc 5.4.0   Clang 3.{4, 6, 8}

ECCCCCCCCCC  eeccCCCCCC  EeeccCCCCC
EecCCCCCCCC  EEEeeeeccc  EEEEEeeeee
EEecCCCCCCC  EEEEEEeeee  EEEEEEEEEE
EEEeecCCCCC  EEEEEEEEEe  EEEEEEEEEE
EEEEeecccCC  EEEEEEEEEE  EEEEEEEEEE
EEEEEeeeccc  EEEEEEEEEE  EEEEEEEEEE
EEEEEEeeeec  EEEEEEEEEE  EEEEEEEEEE
EEEEEEEEeee  EEEEEEEEEE  EEEEEEEEEE
EEEEEEEEEee  EEEEEEEEEE  EEEEEEEEEE
EEEEEEEEEEe  EEEEEEEEEE  EEEEEEEEEE

Here we see that the compiler used can make a massive difference. GCC 4.8.4 favors error codes but 5.4.0 is the opposite as are all versions of Clang (which produce identical results). On this platform exceptions seem to be even more performant than on the Pi.

The top left corner seems to be the most interesting part of this entire diagram, as it represents shallow call chains with few errors. This is similar to most real world code used in, for example, XML or JSON parsing. Let's do some more measurements focusing on this area.

i5               Raspberry Pi

GCC    Clang     GCC    Clang
5.4.0  3.8       4.9.2  3.5.0

cCCCC  eCCCC     eCCCC  eCCCC
ecCCC  EeCCC     ecCCC  ecCCC
ecCCC  EeccC     eccCC  eccCC
eccCC  EEecc     eecCC  eecCC
eeccC  EEEee     eeccC  eeccC


In these measurements the parameters go from 10 functions at the top row to 50 in the bottom row in increments of 10 and the colums go from 0% error on the left to 4% on the right in increments of 1. Here we see that exceptions keep their performance advantage when there are no errors, especially when using Clang, but error codes get better as soon as the error rate goes up even a little.

Conclusions

Whenever a discussion on C++ exceptions occurs, there is That Guy who comes in and says "C++ exceptions are slow, don't use them". At this point the discussion usually breaks down on fighting about whether exceptions are fast or not. These discussion are ultimately pointless because asking whether exceptions are "fast" is the wrong question.

The proper question to ask is "under which circumstances are exceptions faster". As we have seen here, the answer is surprisingly complex. It depends on many factors including which platform, compiler, code size and error rate is used. Blanket statements about performance of X as compared to Y are usually simplifying, misleading and just plain wrong. Such is the case here as well.

(Incidentally if someone can explain why i7 has those weird performance fluctuations, please post in the comments.)

Thanks to Juhani Simola for running the i5 measurements.

Wednesday, January 4, 2017

Clarifications to xz compression article

My previous article on beating the compression performance of xz went a bit viral. Comments on various discussion forums seemed to bring up the same, slightly erroneous, points so here are a few clarifications.

Random access to files is a useless feature

For most use cases this is probably true. But that was not the main point of the experiment. The point was to enable parallel compression without making the final size much bigger. Even more important was to enable parallel decompression. Very few file formats support that without losing compression performance. As a classical example PKZIP supports parallel decompression but its archives are a lot bigger than packed tar files.

Tarballs are just fine and need no changes

Many people were of the opinion that tarballs are "good enough for source distribution". This is probably correct. Unfortunately when people hear the word "tar" they immediately think of source tarballs. That is only a small fraction of tar usage in the world. Many file formats have tar files (or equivalent) hidden inside them. Thus all operations on said files are serial and can use only one CPU at a time.

As an example, the deb file format consists of a bunch of metadata and an embedded tar file. RPM is the same, except that it has a cpio archive rather than tar. This means that installing updates is an inherently serial operation. Update packages can only be installed one at a time, and because they are in a tarlike format, only one CPU can be used for processing them. This is particularly slow on platforms such as the Raspberry Pi that have underpowered multicore CPUs.

When running workloads such as a CI builder, there are multiple steps that are serial when using tar-like file formats. As an example pbuilder has the following steps:
  • unpack base image <- serial
  • install updates <- serial
  • install dependencies <- serial
  • unpack source <- serial
  • compile <- parallel
  • compress final result <- serial
If you are building Libreoffice, Firefox or Chromium, the compilation step dominates. For smaller packages the bottleneck may be elsewhere. As an extreme example, the Debian package build of Meson installs circa 1.5 GB of dependency packages but the actual build consists of a Python distutils invocation that takes all of one second (running the test suite does take a while, though).

This is not suitable for production due to X

It was never meant to. It was merely an experiment to see if a better file format were possible.

Pack/unpack operations are bound by disk, not by CPU

This is true on some platforms but not all of them. LZMA compression is almost always CPU bound even on the fastest CPUs and so is decompression if you have a fast SSD. Disks are getting faster all the time, CPUs are not, so CPUs will become even more of a bottleneck in the future.

What we have is good enough, the benefits are not worth an overhaul and even if they were inertia would prevent uptake

Maybe. But experiments are fun to do regardless.

Monday, January 2, 2017

Beating the compression performance of xz, or has the time come to dump tar?

Tar is one of the basic tools of Unix. It has been in use, basically unchanged, since 1979. It does one thing and it does it quite well. Basically it is a stream format for describing files. It consists of consecutive file metadata and content pairs. This is a simple and reliable format but it has a big downside: individual entries can not be accessed without processing the entire file from the beginning. Wikipedia has further details.

Compression makes this even worse. Accessing byte n requires unpacking every byte from the beginning of the file. This is unfortunate in itself but it has an even bigger downside. Compression and decompression are inherently serial operations. They can not be run in parallel. This was not really an issue in the seventies, but nowadays even a Raspberry Pi has four cores. Using only one seems wasteful.

There are some attempts to work around this, such as pigz, but they just chop the full file into constant sized blocks and compress them in parallel. Even in this case parallel decompression is not possible.

Why is tar used then?

In addition to inertia, tar has one major feature on its side: it compresses really well. Modern compressors like lzma (and even zlib) like having a lot of data to achieve high compression ratios. Tar clumps everything into one file, which is good for compressors. Let's examine how much. For testing we took Linux kernel 4.9 source tree. We first recompressed it with xz using the same encoder settings as for the other systems. The unpacked source is 762MB and the compressed size is 92 megabytes, which is our baseline.

The obvious comparison to tar + xz is a zip file using LZMA compression. There is an implementation called Parzip that supports parallel LZMA compression out of the box. When run it produces a zip file that is 164 megabytes in size. It is easy to see why tar + xz is so popular. The main point of compression is to save space and having a file that is almost twice the size is unacceptable.

One inefficiency of pkzip is that the metadata format is both messy and not compressed. Creating a new file format that stores the index as a single compressed lump at the end of the file is straightforward. The code for this (and the rest of the examples listed here) can be found in the jpak Github repo. Doing this creates a file that is 158 megabytes in size. This is better, but the compression ratio is still unusably bad.

If we wish to preserve perfect O(1) random file access to the compressed data this is probably the best we can do. But if loosen the requirements a bit we can get further. An archive basically consists of a sequence of files of different size. If we start from the top and pick files until their total uncompressed size reaches some predetermined limit, concatenate them into a single clump and compress that we can give the compressor large swatches of data at a time and in addition can compress the different clumps in parallel. To access a random file we need to decompress only the clump it is in, not the entire file from the beginning. This makes the operation take O(clumpsize) time.

If we do this using a clump size of 1 MB (uncompressed) the result is 102 MB file. This is a big improvement but still lags behind tar + xz. Increasing the clump size to 10MB produces a 93 MB file. Upping the size to 100 MB (which means that the file will consist of 8 clumps) produces a 91 MB file. This is smaller than what xz produces.


Wait, what? How can it be smaller than tar+xz?

The key here lies in the tar file format. It interleaves file data and metadata. Compressors do not really like this, because these two items usually have different statistics. Putting data that is similar next to each other improves compression ratios. Jpak stores the file metadata in a structure-of-arrays formation. That is, each metadata is a structure with properties A, B, C and so on. Jpak first writes the property A of all entries followed by all properties B and so on. This layout compresses better than tar's intermixed layout. Or that's the working theory at the moment, there has not been a thorough analysis.

It should be noted that the actual compression method is the same in every case: LZMA as provided by liblzma. The only difference is in how the data is laid out in the file. Changing the layout allows you to do the compression and decompression in parallel, get a random access index to your data and achieve as good or even better compression ratios with the downside being slightly slower access to individual files.

The small print

The current implementation does not actually do the operations in parallel (though Parzip does). The code has not been tested very thoroughly, so it might fail in many interesting ways. Do not use it to store any data you care about. The current implementation stores user and group metadata only as numbers rather than in strings, so it is not exactly equivalent to tar (though user/group strings are usually the same for most files so they compress really well). Other metadata bits may be missing as well.