Python Cryptography Testing and Coverage

Written by Dominik Pantůček on 2024-09-26

python crypt4gh

Working on a cryptographic library comes with certain challenges - like testing all the functions with (ideally all) incorrect inputs. In our recent project we leveraged Python's testing framework pytest with its support for coverage reporting to ensure we get to handle various - but very specific - kinds of corrupted input.

Crypt4GH is a cryptographic container primarily designed to hold genomic data. The container contains static header start with the following:

8 bytes containing "crypt4gh" magic word
4 bytes little-endian unsigned integer with format version (which should always be 1 at the time of writing)
4 bytes little-endian unsigned integer with the number of header packets that follow

Each header packet consists of:

4 bytes little-endian unsigned integer with the packet size in bytes (including these 4 bytes)
packet payload

After all header packets, the data blocks follow. However the data blocks are of no interest for us today, we will focus on keys needed to read the packet payloads which are - as expected - encrypted.

The algorithm used for header packet encryption requires compatible keys to be used. For the particular elliptic curve used, the private key is just a 256-bit number (although in reality not all are used as-is) and the public key is the point on the curve. But in this case only the X coordinate is needed and therefore the public key is also a 256-bit number.

When the key is stored in a file it is typically stored in ASCII-armored envelope. This is true for both Crypt4GH reference implementation key format and for compatible SSH keys as well. The same type of envelope is used for both the public and private key in case of the native format, however for SSH the public key may be stored in single-line shorter form (which is used by SSH's authorized_keys file).

If the private key is stored in cleartext, the software just needs to load the 256-bit number and work with it. However the key should be protected by symmetric encryption with passphrase-derived key using one of the supported key derivation functions (KDF).

The native format for private keys must use one of the following KDFs:

The derived symmetric key is then used for encrypting the actual private key using chacha20-poly1305-ietf cipher with random 96-bit initialization vector (IV) and the encrypted payload is supplemented by 128-bit message authentication code (MAC). Therefore unlocking given key is complete when the key derived from the provided passphrase results in MAC verification after the payload decryption.

When working with software implementation of asymmetric keys, there are already possible issues which should be addressed regardless of actual serialization format used:

public key data must be exactly 32 bytes long
you must not attempt creating shared symmetric keys without having the private key available

For the reference implementation key file format c4gh, the error landscape grows significantly:

ASCII-armored envelope is incomplete or has improper format
the binary contents may be truncated
there may be an incorrect magic byte sequence
unknown KDF may be used
local implementation of certain KDF may not be available
any problems with passphrase getter callback are fatal for encrypted private keys
the key may be encrypted using unsupported symmetric cipher

The container header consists of three static fields and variable number of packets after these. Even with the static fields, problems may arise:

invalid magic bytes can be present
unsupported format version may be signalled

However with header packets we get the most problems to be covered:

(un)usability of reader key (for software implementation should not happen after all the checks covered above, however for hardware implementations...)
the packet may be truncated (the whole container may be)
unsupported encryption method might be used
it can be an unknown packet type
unknown data encryption algorithm may be indicated

For data blocks it is rather easy as the only exceptional state is missing DEK for decrypting given data block - making such block simply unreadable without further consequences.

In order to test for all these cryptographic gotchas, we "only" had to make sure we have them identified with corresponding code paths. Making sure that all of them are covered is just a matter of checking given parts of the source code have 100% test coverage.

The pytest-cov plugin for pytest provides exactly this functionality. Running it with coverage report enabled for given package and creating the output report in HTML format for readability is as simple as:

pytest --cov=oarepo_c4gh tests --cov-report html

This was yet another pleasant surprise in the world of Python. Although the language itself has some questionable features (one-line lambdas being the most annoying one), it can be very useful nevertheless.

Hope you enjoyed this one again and as always - see ya next time!