LZ4 Compression#

LZ4 is a compression scheme that is based on the LZ77 compression algorithm. It is a byte-oriented encoding that achieves compression by encoding input bytes that have occurred recently in the input stream with smaller symbols. Unlike Cascaded compression, LZ4 compression is less dependent on the input dataset having numerical patterns, and is therefore better suited for inputs such as character strings. A detailed description of the LZ4 compression scheme is available at the LZ4 compression github page .

LZ4 performs the encoding by scanning the dataset while maintaining a hash table of recently seen input bytes. If a new byte is encountered that is in the hash table, its corresponding symbol is simply used for the encoding. If it is not, the byte is placed in the hash table (possibly evicting an existing entry) and a new encoding is created for it. The lookback window and hash table sizes form parameters that impact the compression ratio and performance.

To efficiently parallelize LZ4, we divide the data into a series of blocks, where each block is compressed (or decompressed) concurrently. The size of each block is an input parameter to both the C and C++ API calls of the nvCOMP LZ4 compressor, and impacts both the performance and compression ratio.