Release notes#

nvcomp 4.2.0#

New features#

  • Added support for Blackwell HW Decompress Engine for Snappy, Gzip, and Deflate

  • Deflate and Gdeflate compression now supports chunk sizes larger than 64KB

Bug Fixes#

  • Fixed issue in ZSTD compression that resulted in “unspecified launch error” when presented with very small buffers

  • The HLIF previously did not raise exceptions in all failure cases

Known issues#

  • Cascaded, GDeflate, zStandard, Deflate, Gzip and Bitcomp decompressors can only operate on valid input data (data that was compressed using the same compressor). Other decompressors can sometimes detect errors in the compressed stream

  • Cascaded, zStandard and Bitcomp batched decompression C APIs cannot currently accept nullptr for actual_decompressed_bytes or device_statuses values. Deflate and Gzip cannot accept nullptr for device_statuses values

  • The Bitcomp low-level batched decompression function is not fully asynchronous

  • Gzip low-level interface only provides decompression

  • The device API only supports the LZ4/ANS format

  • Zstd decompression fails when decompressing buffers compressed with compression level 18 and higher using the zstd library version 1.5.6. To workaround the problem temporarily, you can provide 1.5x the scratch required by nvcompBatchedZstdDecompressGetTempSize to nvcompBatchedZstdDecompressAsync. Please file an nvBug.

  • nvCOMP C++ APIs on Linux can only be used with GCC >=9.x compilers

nvcomp 4.1.1#

Bug Fixes#

  • nvCOMP ZSTD compression exhibited failures / data corruption in the unlikely case where a ZSTD block contained only zero literals. Fixed by adding RLE literal support as required by the format.

  • Fixed bug in Deflate and Gzip uncompressed data size computation when non-compressed blocks (btype=00) were present in the deflate stream (compressed data).

nvcomp 4.1.0#

New features#

  • Fine-grained LLIF buffer alignment querying through nvcompBatched<alg>CompressGetRequiredAlignments and nvcompBatched<alg>DecompressRequiredAlignments

  • Enabled level 0 compression (Huffman only) for Deflate

  • Custom-allocator support in the Python interface through the set_*_allocator family of functions

Bug Fixes#

  • Fixed a memory leak in the Python interface

  • Fixed a bug in the Snappy decompressor that caused off-by-one token counts

  • Made GDeflate compression more RFC-1951-compliant by always producing headers with at most 286 literal-length codelengths in dynamic Huffman mode

Performance Optimizations#

  • Significant speedup in Bitcomp decompression using nvcomp HLIF – 7-8x for very small files (speedup observed on H100, A100 and L40 GPUs), and 1.3-1.5x for larger files on some GPUs (speedup observed on L40).

Known issues#

  • Cascaded, GDeflate, zStandard, Deflate, Gzip and Bitcomp decompressors can only operate on valid input data (data that was compressed using the same compressor). Other decompressors can sometimes detect errors in the compressed stream

  • Cascaded, zStandard and Bitcomp batched decompression C APIs cannot currently accept nullptr for actual_decompressed_bytes or device_statuses values. Deflate and Gzip cannot accept nullptr for device_statuses values

  • The Bitcomp low-level batched decompression function is not fully asynchronous

  • Gzip low-level interface only provides decompression

  • The device API only supports the LZ4/ANS format

nvcomp 4.0.1#

New Features#

  • Removed hard dependency of nvCOMP on the CUDA driver (libcuda.so on Linux and nvcuda64.dll on Windows) being present on the system

  • Python API now throws exceptions upon encountering CUDA Driver API problems

  • Added support for large internal element counts (INT_MAX+) in Deflate/GDeflate’s Optimal Parse

Bug Fixes#

  • Fixed a bug in Deflate/Gzip which caused occasional data corruption during decompression

Known issues#

  • Cascaded, GDeflate, zStandard, Deflate, Gzip and Bitcomp decompressors can only operate on valid input data (data that was compressed using the same compressor). Other decompressors can sometimes detect errors in the compressed stream

  • Cascaded, zStandard and Bitcomp batched decompression C APIs cannot currently accept nullptr for actual_decompressed_bytes or device_statuses values. Deflate and Gzip cannot accept nullptr for device_statuses values

  • The Bitcomp low-level batched decompression function is not fully asynchronous

  • Gzip low-level interface only provides decompression

  • The device API only supports the LZ4/ANS format

nvcomp 4.0.0#

New features#

  • Python API

  • Replaced spdlog by culiblogger and fmt for logging

  • Changed deflate/gdeflate compression modes, now support 0-5

    • Level 0: Huffman only, no LZ. Currently unsupported on Deflate.

    • Level 1: Default, same as 3.0

    • Level 2: Achieves compression ratios that exceed zlib level 1. Up to 27% better ratio than level 1

    • Level 3: Placeholder, equivalent to level 2

    • Level 4: achieves similar compression ratio to zlib level 6

    • Level 5: achieves similar compression ratio to zlib level 9

  • HLIF can now work on batches, not only on a single buffer

  • HLIF can now compress data without chunking and without nvcomp header (with option to store just uncompressed size)

  • Merged libnvcomp* shared library files into a single libnvcomp file

  • Shared library files have now major version in the name

  • Added LZ4 device-side API

  • Added “float16” mode to ANS for better ratios/performance with float16/bfloat16 data

  • Changed all low-level API function parameters to be named and documented more consistently

  • Updated many internal functions to use cuda::std::atomic values in place of volatile

  • ZSTD compression can now handle chunks up to (2GB - 1)

Bug Fixes#

  • Fixed a bug in the deflate decompressor which caused accuracy errors when copying uncompressed chunks

  • Fixed an HLIF encoding error when input data size is smaller than chunk size

  • Fixed a crash in cascaded compression for at least 2 delta passes and at least 1 RLE pass on highly compressible data

  • Fixed a runtime bug in LZ4 with multi-btye (e.g. int) data types

  • Fixed a runtime bug in Zstd which originated from a race condition during decoding

  • Fixed GPU buffer over-addressing problem in Bitcomp, LZ4, Snappy, and Zstd

  • Added HLIF constructors and functions without redundant device_id parameter

  • Fixed a case where the ANS HLIF was assuming device 0 for checking feature support

  • Fixed some cases where errors were logged to stdout, regardless of logging options

Performance Optimizations#

  • ZSTD Decompression up to 2x faster on T4, ~20% faster on H100 and others

  • Optimized Deflate/GDeflate Optimal Parse, up to ~10% faster on H100

Known issues#

  • Cascaded, GDeflate, zStandard, Deflate, Gzip and Bitcomp decompressors can only operate on valid input data (data that was compressed using the same compressor). Other decompressors can sometimes detect errors in the compressed stream

  • Cascaded, zStandard and Bitcomp batched decompression C APIs cannot currently accept nullptr for actual_decompressed_bytes or device_statuses values. Deflate and Gzip cannot accept nullptr for device_statuses values

  • The Bitcomp low-level batched decompression function is not fully asynchronous

  • Gzip low-level interface only provides decompression

  • The device API only supports the LZ4/ANS format

  • Deflate and GZip might corrupt data during decompression. For the time being, while using the low-level interface (LLIF), an external checksum or CRC verification is recommended, whereas the high-level interface (HLIF) can internally compute and verify checksums with the ComputeAndVerify checksum option.

nvcomp 3.0.6#

Bug Fixes#

  • Fixed a bug (introduced in 3.0.0) that resulted in ZSTD decompression errors.

nvcomp 3.0.5#

Bug Fixes#

  • Fixed a bug that caused compute-sanitizer memcheck failures in Snappy decompression.

nvcomp 3.0.4#

Bug Fixes#

  • Fixed a bug (introduced in 3.0.0) that caused incorrect snappy decompression in some cases.

  • Fixed a bug (introduced in 3.0.0) that caused incompatibility with CPU decompressors for ZSTD

nvcomp 3.0.3 (2023-10-06)#

Bug Fixes#

  • Fixed a bug (introduced in 3.0.0) that caused incorrect snappy decompression in some cases.

nvcomp 3.0.2 (2023-08-28)#

Bug Fixes#

  • Fixed a bug (introduced in 3.0.0) that caused incorrect snappy decompression in some cases.

nvcomp 3.0.1 (2023-08-08)#

Bug Fixes#

  • Remove unnecessary nvml dependency added in 3.0.0

nvcomp 3.0.0 (2023-07-03)#

New features#

  • Added nvcomp*RequiredAlignment constant variables for each compressor

  • Low-level batched functions now return nvcompErrorAlignment if device buffers aren’t sufficiently aligned

  • Added HLIF for ZSTD, Deflate. Updated HLIF design such that HLIF now dispatches to LLIF.

  • Introduced device-side API. Currently limited to the ANS format

  • Added support for logging using NVCOMP_LOG_LEVEL (0-5) and NVCOMP_LOG_FILE environment variables.

Performance Optimizations#

  • Optimize zSTD decompression. Up to 2.2x faster on H100 and 1.5x faster on A100

  • Optimize LZ4 decompression. Up to 1.4x faster on H100 and 1.4x faster on A100.

  • Optimize Snappy decompression. Up to 1.3x faster on H100 and 1.9x faster on A100.

  • Optimize Bitcomp decompression (standard algo). Up to 2x faster and more consistent accross datasets

  • Improve ZSTD compression ratio by up to 5% on 64 KB chunks, 30% on 512 KB chunks to closely match CPU L1 Compression.

nvcomp 2.6.1 (2023-02-03)#

Bug fixes#

  • Fixed a bug that caused non-deterministic decompression accuracy failures in ZSTD

  • Added support for Ada (sm89) GPUs

  • Fixed inconsistent compression stream format on some datasets when using GDeflate high-compression algorithm.

nvcomp 2.6.0 (2023-01-16)#

New features#

  • Added new nvcompBatched*CompressGetTempSizeEx API to allow less pessimistic scratch allocation requirement in many cases.

  • Further reduced zstd compression scratch requirement. For very large batches, in conjunction with the new extended API, the scratch allocation is now ~1.5x the total uncompressed size of the batch.

nvcomp 2.5.1 (2023-01-09)#

Bug fixes#

  • Improved GDeflate decompression throughput by up to 2x, fixing perf regression in 2.5.0

  • Fixed issue where some uses of CUB and Thrust in nvCOMP weren’t namespaced

  • Fixed bug, introduced in 2.5.0, in ZSTD decompression of large frames produced by the CPU compressor

nvcomp 2.5.0 (2022-12-16)#

New features#

  • Added Standard CRC32 support and its LLAPI.

  • Added Gzip batched decompresssion LL APIs, include getting decompression size APIs.

  • Added independent bitcomp.h header to access full feature set of bitcomp compressor

  • Added doc directory in nvcomp package containing the documentation files

  • Increased zStandard maximum compression chunk size from 64 KB to 16 MB

  • Improved zStandard decompression throughput by up to 2x on small batches and 40% on large batches

  • Added nvcomp*CompressionMaxAllowedChunkSize constant variables for each compressor

  • Updated GDeflate stream format to make it compatible with the GDeflate compression standard in NVIDIA RTX IO and Microsoft DirectStorage 1.1.

  • Updated GDeflate to support 64 KB dictionary window which allows a higher compression ratio.

  • Updated GDeflate CPU implementation to use the open source libdeflate repo: https://github.com/NVIDIA/libdeflate

  • Added initial support for SM90

Bug fixes#

  • Fixed memcheck failure in Snappy compression

  • Fixed deflate compression issue related to very small chunk sizes

  • Fixed handling of zero-byte chunks in ANS, Bitcomp, Cascaded, Deflate, and Gdeflate compressors

  • Fixed bug in Bitcomp where the maximum compressed size was slightly underestimated.

nvcomp 2.4.1 (2022-10-06)#

New features#

  • The Deflate batched decompression API can now accept nullptr for actual_decompressed_bytes.

Bug fixes#

  • Fixed incorrect behavior, failure, or crash when using duplicates feature (-x <count>) of the low-level “chunked” benchmarks.

  • Updated deflate_cpu_compression example to use the correct APIs.

  • The Deflate batched decompression API can work on uncomprressed data chunk larger than 64KB.

  • Fixed correctness / stability issue in compute capability 6.1

nvcomp 2.4.0 (2022-09-23)#

New features#

  • Added support for ZSTD compression to LL API

  • Early Access Linux SBSA binaries.

Bug fixes#

  • Fixed issue where cascaded compressor bitpack wasn’t considering unsigned data type, causing suboptimal compression ratio

  • Fixed cmake problem where we stated wrong version compatibility

Performance Optimizations#

  • Optimized GDeflate high-compression mode. Up to 2x faster.

  • Optimized ZSTD decompression. Up to 1.2x faster.

  • Optimized Deflate decompression. Up to 1.5x faster.

  • Optimized ANS compression. Strong scaling allows for up to 7x higher compression and decompression throughput for files on the order of a few MB in size. Decompression throughput is improved by at least 20% on all tested files.

nvcomp 2.3.3 (2022-07-20)#

Bug Fixes#

  • Add missing nvcompBatchedDeflateDecompressGetTempSizeEx API

  • Fixed minor correctness issue in deflate compression.

  • Fixed cmake problem that caused an unnecessary implied cudart_static dependency

Performance Optimizations#

  • Optimized nvcompBatchedDeflateGetDecompressSizeAsync. Now 2-3x faster on A100.

nvcomp 2.3.2 (2022-06-24)#

Bug Fixes#

  • Fixed various bugs in ZSTD decompression implementation

  • Fixed the issue of deflate compression could not be correctly decompressed by zlib::inflate().

nvcomp 2.3.1 (2022-06-15)#

Bug Fixes#

  • Fixed various bugs in ZSTD decompression implementation

  • Fixed various bugs in ANS compression implementation

  • Fix hang in GDeflate high-compression mode for large files

  • Fix bug in library build that required dynamic link to cudart.

Interface Changes#

  • Added new API, nvcompBatched<Format>DecompressGetTempSizeEx(). This provides an optional capability for providing the total decompressed size to the API, which for some formats can dramatically reduce the required temp size.

nvcomp 2.3.0 (2022-04-29)#

New features#

  • Support ZSTD decompression in the LLIF

  • Deflate support (RFC 1951)

  • Modified-CRC32 checksum support added to HLIF. Includes optional verification of HLIF-compressed buffers intended for error detection

Bug fixes#

  • Added Pascal GPU architecture support for all compressors

Performance Optimizations#

  • Performance optimizations in ANS compression / decompression, leading to ~100% speedup in compression and ~50% speedup in decompression

  • Developed algorithmic improvements to GDeflate’s high-compression mode. This is now 30-40x faster on average while producing the same output as the previous version

Infrastructure#

  • Improvements to the benchmarking interface for LLIF – common argument APIs

nvcomp 2.2.0 (2022-02-07)#

New features#

  • Entropy-only mode for GDeflate

  • New high-level interface

  • Windows support

  • Support for GPU-accelerated ANS

Interface Changes#

High-level interface#

  • High-level interface is now standardized across compressor formats.

  • This interface provides a single nvcompManagerBase object that can do compression and decompression. Users can now decompress nvcomp-compressed files without knowing how they were compressed. The interface also can manage scratch space and splitting the input buffer into independent chunks for parallel processing.

API Consolidation#

  • nvCOMP now supports only the low-level batch API and the new high-level interface

nvcomp 2.1.0 (2021-10-28)#

New features#

  • New release of low-level batched API for Cascaded and Bitcomp methods.

  • New high-throughput and high-compression-ratio GPU compressors in GDeflate

Interface Changes#

  • Update batched/low-level compression interfaces to take an options parameter, to allow configuring future compression algorithms.

  • Update batched/low-level decompression interfaces to output the decompressed size (or 0 if an error occurs).

  • Add bounds checking to batched/low-level decompression routines, such that if an invalid compressed data stream is provided, 0 will be written for the output size, rather than generating an illegal memory access.

  • Fix LZ4 to support chunk sizes < 32 KB.

Performance Optimizations#

  • Improve performance of Snappy compression by ~10% in some configurations.

  • Add an optimization to the LZ4 compressor based on specification of input data as char, short, or int, rather than just treating the input as raw bytes.

  • Optimization to reduce the LZ hash table size when compressing smaller chunks.

  • Improved compression performance in GDeflate with the high-throughput option

  • Improved decompression performance in GDeflate (10-75% depending on the dataset)

Bug Fixes#

  • Fix LZ4 CPU compression example.

  • Fix temp allocation size bug in benchmark_template_chunked.

Infrastructure#

  • Update CMakeLists to compile nvcomp with -fPIC enabled.

  • Add a new script for benchmarking compression algorithms.

  • Add unit tests for the Snappy decompressor that tests decompression on legally formatted files that won’t be generated by the nvcomp compressor due to configuration.

  • Update CMakeLists to suppress warnings about missing nvcomp external dependencies when the user didn’t indicate they wanted to include them.

  • Update CMakeLists to allow install into include folder that the user does not have ownership of.

nvcomp 2.0.2 (2021-06-30)#

  • Add example lz4_cpu_decompression to compress on the GPU with nvCOMP and decompress on the CPU with liblz4.

  • Add CMake option for building a static library.

  • Fix bug in LZ4 compression kernel to comply with LZ4 end of block restrictions.

  • Fix temp allocation size bug in benchmark_lz4_chunked.

nvcomp 2.0.1 (2021-06-08)#

  • Improve CMake setup for using nvCOMP as a submodule. This includes marking dependencies as PRIVATE, and adding options for building examples, tests, and benchmarks (e.g., -DBUILD_EXAMPLES=ON, -DBUILD_TESTS=ON, and -DBUILD_BENCHMARKS=ON).

  • Fix double free error in benchmark_snappy_synth.

  • Fix copy direction in Cascaded compression when the output size on the GPU.

  • Improve testing coverage.

  • Mark the generic decompression interfaces defined in include/nvcomp.h as deprecated.

nvcomp 2.0.0 (2021-04-28)#

  • Replace previous C, and C++ APIs.

  • Added Snappy compression (batched interface).

  • Added support for using Bitcomp and GDeflate external compressors.

  • Added /examples folder demonstrating use cases interface with CPU implementations of LZ4 and GDeflate, as well as GPU Direct Storage.

  • Improve support for Windows in benchmark implementations.

  • Made usage of std::uniform_int_distribution<> in the benchmarks conform to the C++14 standard.

  • Fix issue in Cascaded compression when using the default configuration (‘auto’), for small inputs.

nvcomp 1.2.3 (2021-04-07)#

  • Fix bug in LZ4 compression kernel for the Pascal architecture.

nvcomp 1.2.2 (2021-02-08)#

  • Fix linking errors in Clang++.

  • Fix error being incorrectly returned by Cascaded compression when output memory was initialized to all -1’s.

  • Fix C++17 style static assert.

  • Fix prematurely freeing memory in Cascaded compression.

  • Fix input format and usage messaging for benchmarks.

nvcomp 1.2.1 (2020-12-21)#

  • Fix compile error and unit tests for cascaded selector.

nvcomp 1.2.0 (2020-12-19)#

  • Add the Cascaded Selector and Cascaded Auto set of interfaces for automatically configuring cascaded compression.

  • Generally improve error handling and messaging.

  • Update CMake configuration to support CCache.

nvcomp 1.1.1 (2020-12-02)#

  • Add all-gather benchmark.

  • Add sm80 target if CUDA version is 11 or greater.

nvcomp 1.1.0 (2020-10-05)#

  • Add batch C interface for LZ4, allowing compressing/decompressing multiple inputs at once.

  • Significantly improve performance of LZ4 compression.

nvcomp 1.0.2 (2020-08-12)#

  • Fix metadata freeing for LZ4, to avoid possible mismatch of new[] and delete.

nvcomp 1.0.1 (2020-08-07)#

  • Fixed naming of nvcompLZ4CompressX functions in include/lz4.h, to have the nvcomp prefix.

  • Changed CascadedMetadata::Header struct initialization to work around internal compiler error.

nvcomp 1.0.0 (2020-07-31)#

  • Initial public release.