GDeflate#

GDeflate is a new compression stream format that closely matches the DEFLATE format. The key difference lies in the way the bits in the compressed bitstream are stored. The GDeflate stream is essentially a bit swizzled version of any DEFLATE format. The GDeflate stream is essentially a bit swizzled version of any DEFLATE stream where the swizzling is done in a particular way to extract 32 way parallelism. This means that GDeflate can get very high decompression throughput on the GPU while still maintaining the exact same compression ratio of DEFLATE (with some small caveats about end effects).

Quick introduction to DEFLATE#

DEFLATE is a compression format that uses LZ77 compression and further compresses that using Huffman coding. Literals and match copy lengths use a single Huffman tree and match copy distances use a separate Huffman tree. DEFLATE allows for three types of data blocks that make up the stream:

  1. Dynamic Huffman compressed blocks of data where the Huffman trees are part of the compressed stream

  2. Static Huffman compressed blocks where the Huffman trees are defined in the RFC and so don’t need to be included in the compressed stream and

  3. Stored blocks where the data is uncompressed.

Decompressing DEFLATE streams requires first decoding the Huffman encoded bitstream. Since the code lengths of each symbol in the stream are not known apriori, the stream has to be decoded serially and it is very hard to parallelize this decoding process. This is the issue that GDeflate addresses.

For a complete description of the DEFLATE compression format, refer to RFC 1951.

GDeflate stream definition#

GDeflate extracts parallelism from a DEFLATE stream by swizzling the bits in the bitstream based on a state transition idea. The final stream is a serialization of 32 independent sub-streams allowing for up to 32 way parallelism during decompression. The literal or length+distance bits of a regular DEFLATE stream are interleaved across the 32 sub-streams in groups. Each group has <= 32 literal or length+distance codes. If a sub-stream is assigned a length+distance code in group N, then it does not get assigned any bits in group N+1. All bits required for a single match copy (length + distance) are always assigned to the same sub-stream.

The serialization of these sub-streams is defined by the decompression state. When decompressing the stream with multiple threads, each thread maintains a 64 bit state that holds the next bits in its sub-stream. When the thread decodes a symbol, the bits used are invalidated from the thread’s state. The thread loads new data in 32 bit / 4 byte chunks whenever it has <= 32 valid bits left in its state. This load condition along with the sub-stream definition then gives the serialization order for the final GDeflate stream. When decompressing the stream, threads read data from their local state and decode Huffman symbols. Then all threads invalidate the used bits. Threads that then have <= 32 valid bits left load data in 4 byte chunks from the stream in the order that the threads are numbered (thread j after thread i for j > i).

The figure below shows a schematic of how a DEFLATE block is swizzled to get the GDeflate block. To make the schematic more readable, this swizzling idea is shown with 4 way parallelism instead of 32.

GDeflate swizzling schematic

This swizzling idea is also extended to the data describing the Huffman trees in the headers of dynamic Huffman blocks as well as the code lengths for the code length alphabet. The section below describes this in more detail.

GDeflate swizzling order#

Initialization#

At initialization, each thread loads 4 bytes of data from the stream and populates 32 bits of its 64 bit state.

Block header#

Each block contains a 3 bit header. These bits are always assigned to thread 0.

Dynamic Huffman compressed blocks#

14 bit header#

Dynamic Huffman compressed blocks have an additional 14 bit header. These bits are always assigned to thread 0. These bits define the integers HLIT, HDIST and HCLEN.

Code lengths for the code length alphabet#

Code lengths for the code length alphabet are 3 bits each, HCLEN+4 in number and are assigned to 32 threads in a round-robin fashion starting with thread 0. This assignment is in the DEFLATE defined order: 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15.

Code lengths#

There are (HLIT + 257) + (HDIST + 1) code lengths that define the literal/length and distance Huffman trees. These are variable length Huffman encoded codes themselves and are assigned to 32 threads in a round-robin fashion starting with thread 0. Since the code lengths are run-length encoded, there may be fewer codes than the total number of code lengths.

Literal, length and distance symbols#

The bit assignment order for symbols is described in rounds. In each round, each thread decodes either a literal/length symbol or a distance symbol.

In the first round, each thread is assigned a literal/length code starting from thread 0.

If a thread gets a length code, the corresponding distance code is scheduled for the next round instead of assigning it a new literal/length code.

All threads that did not get a length code get assigned a new literal/length code in the next round.

Symbol 256 ends the literal/length codes.

There is a last round after encountering the end of block symbol (256) to process pending distance codes.

The table below shows symbols decoded by each thread for the example in the schematic. L denotes a literal/length symbol and D denotes a distance symbol.

Round

T0

T1

T2

T3

0

L

L

L

L

1

L

D

L

D

2

L

L

D

L

3

D

Static Huffman compressed blocks#

Static Huffman compressed blocks have pre-defined Huffman trees as described by DEFLATE. Threads only need to read literal, length and distance symbols from the stream and the format is the same as in dynamic Huffman compressed blocks.

Stored blocks#

Data#

Data in stored blocks are just single bytes and these bytes are assigned to threads in a round-robin fashion starting with thread 0.

Ending the stream#

The GDeflate stream requires zero padding at the end of the stream to guarantee that threads do not read data beyond the last valid bits that can corrupt the Huffman decode process. This is most easily done by padding the stream with 128 bytes of zeros but can be made more compact if possible.