Beating QOI - Part 2

June 01, 2024 - images rust

This is the second part of a series on how the png crate was able to acheive performance on-par or better than QOI, while remaining fully compatible with the PNG standard.

This post will cover the optimizations used by the PNG encoder to achieve QOI-class performance. The PNG format is more complex than QOI, but it is still possible to achieve competitive performance by leveraging SIMD instructions and other optimizations.

Filtering

The first step of encoding PNGs is to filter each pixel based on the preceeding pixel values. This is a prediction step where pixel values are predicted based on previously seen values, and then only the difference is stored. Each row of the image shares the same filter, of which there are four to chose from:

Up filtering: each pixel value is predicted to be the same as the one directly above it.
Sub filtering: each pixel is predicted to be the same as the pixel directly left of it.
Average filtering: each pixel is predicted to be the average of the pixel above and the pixel to the left.
Paeth filtering: each pixel channel is predicted to be either the same as the pixel above, the pixel to the left, or the pixel above and to the left. The choice is made based on comparing those three values.

PNG also allows for no filtering, but experiments found that filtering on average produced a better compression ratio even when heuristics we were using indicated that the "no filter" option should perform better. Thus the PNG encoder always selects one of the four filters.

Of the four, the Paeth filter is the most difficult to vectorize. The reason is that the Paeth filter is not a simple function of the previous pixel values, but also involves branching based on the values of the previous pixels.

A solution to this challenge is to use an alternative branch-less formulation of the Paeth filter. This formulation allows the Paeth filter for multiple pixels to be computed at the same time using SIMD instructions.

Run length encoding

After filtering, a large fraction of the image data is zero or near zero. In fact, on the test corpus, approximately 50% bytes are zeros. This makes run length encoding (RLE) an effective way to compress the image data. The RLE step replaces runs of repeated values with a single value and a count of how many times it is repeated. In principle, runs of any byte value could be replaced, but we opt to only replace runs of zeros.

In order to detect runs of repeated zero bytes, the image data is first split into 8-byte chunks and then interpreted as 64-bit integers. This allows the CPU to compare eight values at once. If all eight values are zero, then the run is extended to the next 8-byte chunk. If not, we use count_loading_zeros to compute the number of zero bytes at the start of the chunk. This is a fast operation that is implemented using a single instruction on modern CPUs. If there are at least four consecutive zero bytes, then a run is written to the output stream.

Entropy coding

If the current chunk is not a zero run, then the next step is to apply entropy coding to the filtered image data. This is where the actual compression happens. In PNG, Huffman compression is used as the entropy coding scheme. Normally, the Huffman codes are determined by the frequency of each value in the image data, but this is not strictly necessary. Instead, we can use a fixed set of Huffman codes since after the filtering step, the image data frequency distribution is actually quite similar between images.

The Huffman coding step also appears as a good candidate for SIMD optimization, because each byte can be handled via a simple table lookup. The table is indexed by the value to be encoded, and the result is the Huffman code. This table lookup can be done in parallel for multiple values at once. However, it turns out that scalar code is actually faster than the SIMD code for this step. The reason is that the specific SIMD instructions available on x86 are not well-suited to this task.

The scalar code can do four array lookups for Huffman codes and four lookups of their lenghts without any data dependencies between them. Then a small amount of bit manipulation can combine those four codes into a single value of between 4 and 48 bits that gets pushed into the output buffer. Two sets of lookups are required to compress each 8-byte chunk.

Checksums

The final step of encoding a PNG is to compute and append two checksums to the image data. Theoretically, they are to prevent corruption of the image data, but file corruption is not a common problem these days, so they mostly just add overhead. Thankfully, both easy to compute and can be done quite quickly.

Adler-32

An Alder-32 checksum is taken of the uncompressed data after filtering. Adler-32 is a simple checksum that is fast to compute, but it is not as robust as other checksums. The usual description of Adler-32 is that it is a "rolling checksum" that is computed one byte at a time.

However, the excellent simd-adler32 crate provides a SIMD implementation of Adler-32 that is much faster than a scalar implementation. It uses an algorithmic trick to compute the checksum 32 or 64 bytes at a time, rather than one byte at a time. The key observation is that seemingly recursive formula for Adler-32 can be rewritten as a weighted sum of the input bytes modulo a specific prime.

CRC-32

For the CRC-32 checksum, recent CPUs have hardware instructions for computing the CRC-32 checksum of a block of data. It is slower than the Adler-32 checksum, but it is still quite fast. We rely on the crc32fast crate to compute the CRC-32 checksum.

Multithreading

NOTE: The following is a work in progress, but with very promising results so far.

A futher optimization that can be applied is to parallelize the encoding of multiple rows of pixels. Since PNG encoding is lossless and we don't use back-references to previous rows (or any back-references), we can encode each row independently.

The rayon crate is used to parallelize the encoding of rows. Each row is a separate task, and the rayon runtime will automatically distribute the tasks across the available CPU cores.

One complication is that due to the variable length size of each Huffman code, rows may not start and end at byte boundaries. This is solved by compressing each row into a temporary buffer, and then accumulating the compressed data into the final output buffer.

Since rows are filtered and compressed independently, the Adler-32 checksums for each row need to be computed in parallel as well. However, unlike a cryptographic hash, it is actually possible to compute the Adler-32 checksum of two concatenated blocks of data based on the checksums of the individual blocks. This allows the checksums to be computed in parallel and then combined at the end.