Release notes#

nvcomp 4.2.0#

New features#

Added support for Blackwell HW Decompress Engine for Snappy, Gzip, and Deflate
Deflate and Gdeflate compression now supports chunk sizes larger than 64KB

Bug Fixes#

Fixed issue in ZSTD compression that resulted in “unspecified launch error” when presented with very small buffers
The HLIF previously did not raise exceptions in all failure cases

Known issues#

Cascaded, GDeflate, zStandard, Deflate, Gzip and Bitcomp decompressors can only operate on valid input data (data that was compressed using the same compressor). Other decompressors can sometimes detect errors in the compressed stream
Cascaded, zStandard and Bitcomp batched decompression C APIs cannot currently accept nullptr for actual_decompressed_bytes or device_statuses values. Deflate and Gzip cannot accept nullptr for device_statuses values
The Bitcomp low-level batched decompression function is not fully asynchronous
Gzip low-level interface only provides decompression
The device API only supports the LZ4/ANS format
Zstd decompression fails when decompressing buffers compressed with compression level 18 and higher using the zstd library version 1.5.6. To workaround the problem temporarily, you can provide 1.5x the scratch required by nvcompBatchedZstdDecompressGetTempSize to nvcompBatchedZstdDecompressAsync. Please file an nvBug.
nvCOMP C++ APIs on Linux can only be used with GCC >=9.x compilers

nvcomp 4.1.1#

Bug Fixes#

nvCOMP ZSTD compression exhibited failures / data corruption in the unlikely case where a ZSTD block contained only zero literals. Fixed by adding RLE literal support as required by the format.
Fixed bug in Deflate and Gzip uncompressed data size computation when non-compressed blocks (btype=00) were present in the deflate stream (compressed data).

nvcomp 4.1.0#

New features#

Fine-grained LLIF buffer alignment querying through nvcompBatched<alg>CompressGetRequiredAlignments and nvcompBatched<alg>DecompressRequiredAlignments
Enabled level 0 compression (Huffman only) for Deflate
Custom-allocator support in the Python interface through the set_*_allocator family of functions

Bug Fixes#

Fixed a memory leak in the Python interface
Fixed a bug in the Snappy decompressor that caused off-by-one token counts
Made GDeflate compression more RFC-1951-compliant by always producing headers with at most 286 literal-length codelengths in dynamic Huffman mode

Performance Optimizations#

Significant speedup in Bitcomp decompression using nvcomp HLIF – 7-8x for very small files (speedup observed on H100, A100 and L40 GPUs), and 1.3-1.5x for larger files on some GPUs (speedup observed on L40).

Known issues#

Cascaded, GDeflate, zStandard, Deflate, Gzip and Bitcomp decompressors can only operate on valid input data (data that was compressed using the same compressor). Other decompressors can sometimes detect errors in the compressed stream
Cascaded, zStandard and Bitcomp batched decompression C APIs cannot currently accept nullptr for actual_decompressed_bytes or device_statuses values. Deflate and Gzip cannot accept nullptr for device_statuses values
The Bitcomp low-level batched decompression function is not fully asynchronous
Gzip low-level interface only provides decompression
The device API only supports the LZ4/ANS format

nvcomp 4.0.1#

New Features#

Removed hard dependency of nvCOMP on the CUDA driver (libcuda.so on Linux and nvcuda64.dll on Windows) being present on the system
Python API now throws exceptions upon encountering CUDA Driver API problems
Added support for large internal element counts (INT_MAX+) in Deflate/GDeflate’s Optimal Parse

Bug Fixes#

Fixed a bug in Deflate/Gzip which caused occasional data corruption during decompression

Known issues#

Cascaded, GDeflate, zStandard, Deflate, Gzip and Bitcomp decompressors can only operate on valid input data (data that was compressed using the same compressor). Other decompressors can sometimes detect errors in the compressed stream
Cascaded, zStandard and Bitcomp batched decompression C APIs cannot currently accept nullptr for actual_decompressed_bytes or device_statuses values. Deflate and Gzip cannot accept nullptr for device_statuses values
The Bitcomp low-level batched decompression function is not fully asynchronous
Gzip low-level interface only provides decompression
The device API only supports the LZ4/ANS format

nvcomp 4.0.0#

New features#

Python API
Replaced spdlog by culiblogger and fmt for logging
Changed deflate/gdeflate compression modes, now support 0-5
- Level 0: Huffman only, no LZ. Currently unsupported on Deflate.
- Level 1: Default, same as 3.0
- Level 2: Achieves compression ratios that exceed zlib level 1. Up to 27% better ratio than level 1
- Level 3: Placeholder, equivalent to level 2
- Level 4: achieves similar compression ratio to zlib level 6
- Level 5: achieves similar compression ratio to zlib level 9
HLIF can now work on batches, not only on a single buffer
HLIF can now compress data without chunking and without nvcomp header (with option to store just uncompressed size)
Merged libnvcomp* shared library files into a single libnvcomp file
Shared library files have now major version in the name
Added LZ4 device-side API
Added “float16” mode to ANS for better ratios/performance with float16/bfloat16 data
Changed all low-level API function parameters to be named and documented more consistently
Updated many internal functions to use cuda::std::atomic values in place of volatile
ZSTD compression can now handle chunks up to (2GB - 1)

Bug Fixes#

Fixed a bug in the deflate decompressor which caused accuracy errors when copying uncompressed chunks
Fixed an HLIF encoding error when input data size is smaller than chunk size
Fixed a crash in cascaded compression for at least 2 delta passes and at least 1 RLE pass on highly compressible data
Fixed a runtime bug in LZ4 with multi-btye (e.g. int) data types
Fixed a runtime bug in Zstd which originated from a race condition during decoding
Fixed GPU buffer over-addressing problem in Bitcomp, LZ4, Snappy, and Zstd
Added HLIF constructors and functions without redundant device_id parameter
Fixed a case where the ANS HLIF was assuming device 0 for checking feature support
Fixed some cases where errors were logged to stdout, regardless of logging options

Performance Optimizations#

ZSTD Decompression up to 2x faster on T4, ~20% faster on H100 and others
Optimized Deflate/GDeflate Optimal Parse, up to ~10% faster on H100

Known issues#

Cascaded, GDeflate, zStandard, Deflate, Gzip and Bitcomp decompressors can only operate on valid input data (data that was compressed using the same compressor). Other decompressors can sometimes detect errors in the compressed stream
Cascaded, zStandard and Bitcomp batched decompression C APIs cannot currently accept nullptr for actual_decompressed_bytes or device_statuses values. Deflate and Gzip cannot accept nullptr for device_statuses values
The Bitcomp low-level batched decompression function is not fully asynchronous
Gzip low-level interface only provides decompression
The device API only supports the LZ4/ANS format
Deflate and GZip might corrupt data during decompression. For the time being, while using the low-level interface (LLIF), an external checksum or CRC verification is recommended, whereas the high-level interface (HLIF) can internally compute and verify checksums with the ComputeAndVerify checksum option.

nvcomp 3.0.6#

Bug Fixes#

Fixed a bug (introduced in 3.0.0) that resulted in ZSTD decompression errors.

nvcomp 3.0.5#

Bug Fixes#

Fixed a bug that caused compute-sanitizer memcheck failures in Snappy decompression.

nvcomp 3.0.4#

Bug Fixes#

Fixed a bug (introduced in 3.0.0) that caused incorrect snappy decompression in some cases.
Fixed a bug (introduced in 3.0.0) that caused incompatibility with CPU decompressors for ZSTD

nvcomp 3.0.3 (2023-10-06)#

Bug Fixes#

Fixed a bug (introduced in 3.0.0) that caused incorrect snappy decompression in some cases.

nvcomp 3.0.2 (2023-08-28)#

Bug Fixes#

Fixed a bug (introduced in 3.0.0) that caused incorrect snappy decompression in some cases.

nvcomp 3.0.1 (2023-08-08)#

Bug Fixes#

Remove unnecessary nvml dependency added in 3.0.0

nvcomp 3.0.0 (2023-07-03)#

New features#

Added nvcomp*RequiredAlignment constant variables for each compressor
Low-level batched functions now return nvcompErrorAlignment if device buffers aren’t sufficiently aligned
Added HLIF for ZSTD, Deflate. Updated HLIF design such that HLIF now dispatches to LLIF.
Introduced device-side API. Currently limited to the ANS format
Added support for logging using NVCOMP_LOG_LEVEL (0-5) and NVCOMP_LOG_FILE environment variables.

Performance Optimizations#

Optimize zSTD decompression. Up to 2.2x faster on H100 and 1.5x faster on A100
Optimize LZ4 decompression. Up to 1.4x faster on H100 and 1.4x faster on A100.
Optimize Snappy decompression. Up to 1.3x faster on H100 and 1.9x faster on A100.
Optimize Bitcomp decompression (standard algo). Up to 2x faster and more consistent accross datasets
Improve ZSTD compression ratio by up to 5% on 64 KB chunks, 30% on 512 KB chunks to closely match CPU L1 Compression.

nvcomp 2.6.1 (2023-02-03)#

Bug fixes#

Fixed a bug that caused non-deterministic decompression accuracy failures in ZSTD
Added support for Ada (sm89) GPUs
Fixed inconsistent compression stream format on some datasets when using GDeflate high-compression algorithm.

nvcomp 2.6.0 (2023-01-16)#

New features#

Added new nvcompBatched*CompressGetTempSizeEx API to allow less pessimistic scratch allocation requirement in many cases.
Further reduced zstd compression scratch requirement. For very large batches, in conjunction with the new extended API, the scratch allocation is now ~1.5x the total uncompressed size of the batch.

nvcomp 2.5.1 (2023-01-09)#

Bug fixes#

Improved GDeflate decompression throughput by up to 2x, fixing perf regression in 2.5.0
Fixed issue where some uses of CUB and Thrust in nvCOMP weren’t namespaced
Fixed bug, introduced in 2.5.0, in ZSTD decompression of large frames produced by the CPU compressor

nvcomp 2.5.0 (2022-12-16)#

New features#

Added Standard CRC32 support and its LLAPI.
Added Gzip batched decompresssion LL APIs, include getting decompression size APIs.
Added independent bitcomp.h header to access full feature set of bitcomp compressor
Added doc directory in nvcomp package containing the documentation files
Increased zStandard maximum compression chunk size from 64 KB to 16 MB
Improved zStandard decompression throughput by up to 2x on small batches and 40% on large batches
Added nvcomp*CompressionMaxAllowedChunkSize constant variables for each compressor
Updated GDeflate stream format to make it compatible with the GDeflate compression standard in NVIDIA RTX IO and Microsoft DirectStorage 1.1.
Updated GDeflate to support 64 KB dictionary window which allows a higher compression ratio.
Updated GDeflate CPU implementation to use the open source libdeflate repo: https://github.com/NVIDIA/libdeflate
Added initial support for SM90

Bug fixes#

Fixed memcheck failure in Snappy compression
Fixed deflate compression issue related to very small chunk sizes
Fixed handling of zero-byte chunks in ANS, Bitcomp, Cascaded, Deflate, and Gdeflate compressors
Fixed bug in Bitcomp where the maximum compressed size was slightly underestimated.

nvcomp 2.4.1 (2022-10-06)#

New features#

The Deflate batched decompression API can now accept nullptr for actual_decompressed_bytes.

Bug fixes#

Fixed incorrect behavior, failure, or crash when using duplicates feature (-x <count>) of the low-level “chunked” benchmarks.
Updated deflate_cpu_compression example to use the correct APIs.
The Deflate batched decompression API can work on uncomprressed data chunk larger than 64KB.
Fixed correctness / stability issue in compute capability 6.1

nvcomp 2.4.0 (2022-09-23)#

New features#

Added support for ZSTD compression to LL API
Early Access Linux SBSA binaries.

Bug fixes#

Fixed issue where cascaded compressor bitpack wasn’t considering unsigned data type, causing suboptimal compression ratio
Fixed cmake problem where we stated wrong version compatibility

Performance Optimizations#

Optimized GDeflate high-compression mode. Up to 2x faster.
Optimized ZSTD decompression. Up to 1.2x faster.
Optimized Deflate decompression. Up to 1.5x faster.
Optimized ANS compression. Strong scaling allows for up to 7x higher compression and decompression throughput for files on the order of a few MB in size. Decompression throughput is improved by at least 20% on all tested files.

nvcomp 2.3.3 (2022-07-20)#

Bug Fixes#

Add missing nvcompBatchedDeflateDecompressGetTempSizeEx API
Fixed minor correctness issue in deflate compression.
Fixed cmake problem that caused an unnecessary implied cudart_static dependency

Performance Optimizations#

Optimized nvcompBatchedDeflateGetDecompressSizeAsync. Now 2-3x faster on A100.

nvcomp 2.3.2 (2022-06-24)#

Bug Fixes#

Fixed various bugs in ZSTD decompression implementation
Fixed the issue of deflate compression could not be correctly decompressed by zlib::inflate().

nvcomp 2.3.1 (2022-06-15)#

Bug Fixes#

Fixed various bugs in ZSTD decompression implementation
Fixed various bugs in ANS compression implementation
Fix hang in GDeflate high-compression mode for large files
Fix bug in library build that required dynamic link to cudart.

Interface Changes#

Added new API, nvcompBatched<Format>DecompressGetTempSizeEx(). This provides an optional capability for providing the total decompressed size to the API, which for some formats can dramatically reduce the required temp size.

nvcomp 2.3.0 (2022-04-29)#

New features#

Support ZSTD decompression in the LLIF
Deflate support (RFC 1951)
Modified-CRC32 checksum support added to HLIF. Includes optional verification of HLIF-compressed buffers intended for error detection

Bug fixes#

Added Pascal GPU architecture support for all compressors

Performance Optimizations#

Performance optimizations in ANS compression / decompression, leading to ~100% speedup in compression and ~50% speedup in decompression
Developed algorithmic improvements to GDeflate’s high-compression mode. This is now 30-40x faster on average while producing the same output as the previous version

Infrastructure#

Improvements to the benchmarking interface for LLIF – common argument APIs

nvcomp 2.2.0 (2022-02-07)#

New features#

Entropy-only mode for GDeflate
New high-level interface
Windows support
Support for GPU-accelerated ANS

Interface Changes#

High-level interface#

High-level interface is now standardized across compressor formats.
This interface provides a single nvcompManagerBase object that can do compression and decompression. Users can now decompress nvcomp-compressed files without knowing how they were compressed. The interface also can manage scratch space and splitting the input buffer into independent chunks for parallel processing.

API Consolidation#

nvCOMP now supports only the low-level batch API and the new high-level interface

nvcomp 2.1.0 (2021-10-28)#

New features#

New release of low-level batched API for Cascaded and Bitcomp methods.
New high-throughput and high-compression-ratio GPU compressors in GDeflate

Interface Changes#

Update batched/low-level compression interfaces to take an options parameter, to allow configuring future compression algorithms.
Update batched/low-level decompression interfaces to output the decompressed size (or 0 if an error occurs).
Add bounds checking to batched/low-level decompression routines, such that if an invalid compressed data stream is provided, 0 will be written for the output size, rather than generating an illegal memory access.
Fix LZ4 to support chunk sizes < 32 KB.

Performance Optimizations#

Improve performance of Snappy compression by ~10% in some configurations.
Add an optimization to the LZ4 compressor based on specification of input data as char, short, or int, rather than just treating the input as raw bytes.
Optimization to reduce the LZ hash table size when compressing smaller chunks.
Improved compression performance in GDeflate with the high-throughput option
Improved decompression performance in GDeflate (10-75% depending on the dataset)

Bug Fixes#

Fix LZ4 CPU compression example.
Fix temp allocation size bug in benchmark_template_chunked.

Infrastructure#

Update CMakeLists to compile nvcomp with -fPIC enabled.
Add a new script for benchmarking compression algorithms.
Add unit tests for the Snappy decompressor that tests decompression on legally formatted files that won’t be generated by the nvcomp compressor due to configuration.
Update CMakeLists to suppress warnings about missing nvcomp external dependencies when the user didn’t indicate they wanted to include them.
Update CMakeLists to allow install into include folder that the user does not have ownership of.

nvcomp 2.0.2 (2021-06-30)#

Add example lz4_cpu_decompression to compress on the GPU with nvCOMP and decompress on the CPU with liblz4.
Add CMake option for building a static library.
Fix bug in LZ4 compression kernel to comply with LZ4 end of block restrictions.
Fix temp allocation size bug in benchmark_lz4_chunked.

nvcomp 2.0.1 (2021-06-08)#

Improve CMake setup for using nvCOMP as a submodule. This includes marking dependencies as PRIVATE, and adding options for building examples, tests, and benchmarks (e.g., -DBUILD_EXAMPLES=ON, -DBUILD_TESTS=ON, and -DBUILD_BENCHMARKS=ON).
Fix double free error in benchmark_snappy_synth.
Fix copy direction in Cascaded compression when the output size on the GPU.
Improve testing coverage.
Mark the generic decompression interfaces defined in include/nvcomp.h as deprecated.

nvcomp 2.0.0 (2021-04-28)#

Replace previous C, and C++ APIs.
Added Snappy compression (batched interface).
Added support for using Bitcomp and GDeflate external compressors.
Added /examples folder demonstrating use cases interface with CPU implementations of LZ4 and GDeflate, as well as GPU Direct Storage.
Improve support for Windows in benchmark implementations.
Made usage of std::uniform_int_distribution<> in the benchmarks conform to the C++14 standard.
Fix issue in Cascaded compression when using the default configuration (‘auto’), for small inputs.

nvcomp 1.2.3 (2021-04-07)#

Fix bug in LZ4 compression kernel for the Pascal architecture.

nvcomp 1.2.2 (2021-02-08)#

Fix linking errors in Clang++.
Fix error being incorrectly returned by Cascaded compression when output memory was initialized to all -1’s.
Fix C++17 style static assert.
Fix prematurely freeing memory in Cascaded compression.
Fix input format and usage messaging for benchmarks.

nvcomp 1.2.1 (2020-12-21)#

Fix compile error and unit tests for cascaded selector.

nvcomp 1.2.0 (2020-12-19)#

Add the Cascaded Selector and Cascaded Auto set of interfaces for automatically configuring cascaded compression.
Generally improve error handling and messaging.
Update CMake configuration to support CCache.

nvcomp 1.1.1 (2020-12-02)#

Add all-gather benchmark.
Add sm80 target if CUDA version is 11 or greater.

nvcomp 1.1.0 (2020-10-05)#

Add batch C interface for LZ4, allowing compressing/decompressing multiple inputs at once.
Significantly improve performance of LZ4 compression.

nvcomp 1.0.2 (2020-08-12)#

Fix metadata freeing for LZ4, to avoid possible mismatch of new[] and delete.

nvcomp 1.0.1 (2020-08-07)#

Fixed naming of nvcompLZ4CompressX functions in include/lz4.h, to have the nvcomp prefix.
Changed CascadedMetadata::Header struct initialization to work around internal compiler error.

nvcomp 1.0.0 (2020-07-31)#

Initial public release.