Release notes#

nvcomp 4.0.1#

New Features#

  • Removed hard dependency of nvCOMP on the CUDA driver (libcuda.so on Linux and nvcuda64.dll on Windows) being present on the system

  • Python API now throws exceptions upon encountering CUDA Driver API problems

  • Added support for large internal element counts (INT_MAX+) in Deflate/GDeflate’s Optimal Parse

Bug Fixes#

  • Fixed a bug in Deflate/Gzip which caused occasional data corruption during decompression

Known issues#

  • Cascaded, GDeflate, zStandard, Deflate, Gzip and Bitcomp decompressors can only operate on valid input data (data that was compressed using the same compressor). Other decompressors can sometimes detect errors in the compressed stream

  • Cascaded, zStandard and Bitcomp batched decompression C APIs cannot currently accept nullptr for actual_decompressed_bytes or device_statuses values. Deflate and Gzip cannot accept nullptr for device_statuses values

  • The Bitcomp low-level batched decompression function is not fully asynchronous

  • Gzip low-level interface only provides decompression

  • The device API only supports the LZ4/ANS format

nvcomp 4.0.0#

New features#

  • Python API

  • Replaced spdlog by culiblogger and fmt for logging

  • Changed deflate/gdeflate compression modes, now support 0-5

    • Level 0: Huffman only, no LZ. Currently unsupported on Deflate.

    • Level 1: Default, same as 3.0

    • Level 2: Achieves compression ratios that exceed zlib level 1. Up to 27% better ratio than level 1

    • Level 3: Placeholder, equivalent to level 2

    • Level 4: achieves similar compression ratio to zlib level 6

    • Level 5: achieves similar compression ratio to zlib level 9

  • HLIF can now work on batches, not only on a single buffer

  • HLIF can now compress data without chunking and without nvcomp header (with option to store just uncompressed size)

  • Merged libnvcomp* shared library files into a single libnvcomp file

  • Shared library files have now major version in the name

  • Added LZ4 device-side API

  • Added “float16” mode to ANS for better ratios/performance with float16/bfloat16 data

  • Changed all low-level API function parameters to be named and documented more consistently

  • Updated many internal functions to use cuda::std::atomic values in place of volatile

  • ZSTD compression can now handle chunks up to (2GB - 1)

Bug Fixes#

  • Fixed a bug in the deflate decompressor which caused accuracy errors when copying uncompressed chunks

  • Fixed an HLIF encoding error when input data size is smaller than chunk size

  • Fixed a crash in cascaded compression for at least 2 delta passes and at least 1 RLE pass on highly compressible data

  • Fixed a runtime bug in LZ4 with multi-btye (e.g. int) data types

  • Fixed a runtime bug in Zstd which originated from a race condition during decoding

  • Fixed GPU buffer over-addressing problem in Bitcomp, LZ4, Snappy, and Zstd

  • Added HLIF constructors and functions without redundant device_id parameter

  • Fixed a case where the ANS HLIF was assuming device 0 for checking feature support

  • Fixed some cases where errors were logged to stdout, regardless of logging options

Performance Optimizations#

  • ZSTD Decompression up to 2x faster on T4, ~20% faster on H100 and others

  • Optimized Deflate/GDeflate Optimal Parse, up to ~10% faster on H100

Known issues#

  • Cascaded, GDeflate, zStandard, Deflate, Gzip and Bitcomp decompressors can only operate on valid input data (data that was compressed using the same compressor). Other decompressors can sometimes detect errors in the compressed stream

  • Cascaded, zStandard and Bitcomp batched decompression C APIs cannot currently accept nullptr for actual_decompressed_bytes or device_statuses values. Deflate and Gzip cannot accept nullptr for device_statuses values

  • The Bitcomp low-level batched decompression function is not fully asynchronous

  • Gzip low-level interface only provides decompression

  • The device API only supports the LZ4/ANS format

  • Deflate and GZip might corrupt data during decompression. For the time being, while using the low-level interface (LLIF), an external checksum or CRC verification is recommended, whereas the high-level interface (HLIF) can internally compute and verify checksums with the ComputeAndVerify checksum option.

nvcomp 3.0.6#

Bug Fixes#

  • Fixed a bug (introduced in 3.0.0) that resulted in ZSTD decompression errors

nvcomp 3.0.5#

Bug Fixes#

  • Fixed a bug that caused compute-sanitizer memcheck failures in Snappy decompression

nvcomp 3.0.6#

Bug Fixes#

  • Fixed a bug (introduced in 3.0.0) that resulted in ZSTD decompression errors.

nvcomp 3.0.5#

Bug Fixes#

  • Fixed a bug that caused compute-sanitizer memcheck failures in Snappy decompression.

nvcomp 3.0.4#

Bug Fixes#

  • Fixed a bug (introduced in 3.0.0) that caused incorrect snappy decompression in some cases.

  • Fixed a bug (introduced in 3.0.0) that caused incompatibility with CPU decompressors for ZSTD

nvcomp 3.0.3 (2023-10-06)#

Bug Fixes#

  • Fixed a bug (introduced in 3.0.0) that caused incorrect snappy decompression in some cases.

nvcomp 3.0.2 (2023-08-28)#

Bug Fixes#

  • Fixed a bug (introduced in 3.0.0) that caused incorrect snappy decompression in some cases.

nvcomp 3.0.1 (2023-08-08)#

Bug Fixes#

  • Remove unnecessary nvml dependency added in 3.0.0

nvcomp 3.0.0 (2023-07-03)#

New features#

  • Added nvcomp*RequiredAlignment constant variables for each compressor

  • Low-level batched functions now return nvcompErrorAlignment if device buffers aren’t sufficiently aligned

  • Added HLIF for ZSTD, Deflate. Updated HLIF design such that HLIF now dispatches to LLIF.

  • Introduced device-side API. Currently limited to the ANS format

  • Added support for logging using NVCOMP_LOG_LEVEL (0-5) and NVCOMP_LOG_FILE environment variables.

Performance Optimizations#

  • Optimize zSTD decompression. Up to 2.2x faster on H100 and 1.5x faster on A100

  • Optimize LZ4 decompression. Up to 1.4x faster on H100 and 1.4x faster on A100.

  • Optimize Snappy decompression. Up to 1.3x faster on H100 and 1.9x faster on A100.

  • Optimize Bitcomp decompression (standard algo). Up to 2x faster and more consistent accross datasets

  • Improve ZSTD compression ratio by up to 5% on 64 KB chunks, 30% on 512 KB chunks to closely match CPU L1 Compression.

nvcomp 2.6.1 (2023-02-03)#

Bug fixes#

  • Fixed a bug that caused non-deterministic decompression accuracy failures in ZSTD

  • Added support for Ada (sm89) GPUs

  • Fixed inconsistent compression stream format on some datasets when using GDeflate high-compression algorithm.

nvcomp 2.6.0 (2023-01-16)#

New features#

  • Added new nvcompBatched*CompressGetTempSizeEx API to allow less pessimistic scratch allocation requirement in many cases.

  • Further reduced zstd compression scratch requirement. For very large batches, in conjunction with the new extended API, the scratch allocation is now ~1.5x the total uncompressed size of the batch.

nvcomp 2.5.1 (2023-01-09)#

Bug fixes#

  • Improved GDeflate decompression throughput by up to 2x, fixing perf regression in 2.5.0

  • Fixed issue where some uses of CUB and Thrust in nvCOMP weren’t namespaced

  • Fixed bug, introduced in 2.5.0, in ZSTD decompression of large frames produced by the CPU compressor

nvcomp 2.5.0 (2022-12-16)#

New features#

  • Added Standard CRC32 support and its LLAPI.

  • Added Gzip batched decompresssion LL APIs, include getting decompression size APIs.

  • Added independent bitcomp.h header to access full feature set of bitcomp compressor

  • Added doc directory in nvcomp package containing the documentation files

  • Increased zStandard maximum compression chunk size from 64 KB to 16 MB

  • Improved zStandard decompression throughput by up to 2x on small batches and 40% on large batches

  • Added nvcomp*CompressionMaxAllowedChunkSize constant variables for each compressor

  • Updated GDeflate stream format to make it compatible with the GDeflate compression standard in NVIDIA RTX IO and Microsoft DirectStorage 1.1.

  • Updated GDeflate to support 64 KB dictionary window which allows a higher compression ratio.

  • Updated GDeflate CPU implementation to use the open source libdeflate repo: https://github.com/NVIDIA/libdeflate

  • Added initial support for SM90

Bug fixes#

  • Fixed memcheck failure in Snappy compression

  • Fixed deflate compression issue related to very small chunk sizes

  • Fixed handling of zero-byte chunks in ANS, Bitcomp, Cascaded, Deflate, and Gdeflate compressors

  • Fixed bug in Bitcomp where the maximum compressed size was slightly underestimated.

nvcomp 2.4.1 (2022-10-06)#

New features#

  • The Deflate batched decompression API can now accept nullptr for actual_decompressed_bytes.

Bug fixes#

  • Fixed incorrect behavior, failure, or crash when using duplicates feature (-x <count>) of the low-level “chunked” benchmarks.

  • Updated deflate_cpu_compression example to use the correct APIs.

  • The Deflate batched decompression API can work on uncomprressed data chunk larger than 64KB.

  • Fixed correctness / stability issue in compute capability 6.1

nvcomp 2.4.0 (2022-09-23)#

New features#

  • Added support for ZSTD compression to LL API

  • Early Access Linux SBSA binaries.

Bug fixes#

  • Fixed issue where cascaded compressor bitpack wasn’t considering unsigned data type, causing suboptimal compression ratio

  • Fixed cmake problem where we stated wrong version compatibility

Performance Optimizations#

  • Optimized GDeflate high-compression mode. Up to 2x faster.

  • Optimized ZSTD decompression. Up to 1.2x faster.

  • Optimized Deflate decompression. Up to 1.5x faster.

  • Optimized ANS compression. Strong scaling allows for up to 7x higher compression and decompression throughput for files on the order of a few MB in size. Decompression throughput is improved by at least 20% on all tested files.

nvcomp 2.3.3 (2022-07-20)#

Bug Fixes#

  • Add missing nvcompBatchedDeflateDecompressGetTempSizeEx API

  • Fixed minor correctness issue in deflate compression.

  • Fixed cmake problem that caused an unnecessary implied cudart_static dependency

Performance Optimizations#

  • Optimized nvcompBatchedDeflateGetDecompressSizeAsync. Now 2-3x faster on A100.

nvcomp 2.3.2 (2022-06-24)#

Bug Fixes#

  • Fixed various bugs in ZSTD decompression implementation

  • Fixed the issue of deflate compression could not be correctly decompressed by zlib::inflate().

nvcomp 2.3.1 (2022-06-15)#

Bug Fixes#

  • Fixed various bugs in ZSTD decompression implementation

  • Fixed various bugs in ANS compression implementation

  • Fix hang in GDeflate high-compression mode for large files

  • Fix bug in library build that required dynamic link to cudart.

Interface Changes#

  • Added new API, nvcompBatched<Format>DecompressGetTempSizeEx(). This provides an optional capability for providing the total decompressed size to the API, which for some formats can dramatically reduce the required temp size.

nvcomp 2.3.0 (2022-04-29)#

New features#

  • Support ZSTD decompression in the LLIF

  • Deflate support (RFC 1951)

  • Modified-CRC32 checksum support added to HLIF. Includes optional verification of HLIF-compressed buffers intended for error detection

Bug fixes#

  • Added Pascal GPU architecture support for all compressors

Performance Optimizations#

  • Performance optimizations in ANS compression / decompression, leading to ~100% speedup in compression and ~50% speedup in decompression

  • Developed algorithmic improvements to GDeflate’s high-compression mode. This is now 30-40x faster on average while producing the same output as the previous version

Infrastructure#

  • Improvements to the benchmarking interface for LLIF – common argument APIs

nvcomp 2.2.0 (2022-02-07)#

New features#

  • Entropy-only mode for GDeflate

  • New high-level interface

  • Windows support

  • Support for GPU-accelerated ANS

Interface Changes#

High-level interface#

  • High-level interface is now standardized across compressor formats.

  • This interface provides a single nvcompManagerBase object that can do compression and decompression. Users can now decompress nvcomp-compressed files without knowing how they were compressed. The interface also can manage scratch space and splitting the input buffer into independent chunks for parallel processing.

API Consolidation#

  • nvCOMP now supports only the low-level batch API and the new high-level interface

nvcomp 2.1.0 (2021-10-28)#

New features#

  • New release of low-level batched API for Cascaded and Bitcomp methods.

  • New high-throughput and high-compression-ratio GPU compressors in GDeflate

Interface Changes#

  • Update batched/low-level compression interfaces to take an options parameter, to allow configuring future compression algorithms.

  • Update batched/low-level decompression interfaces to output the decompressed size (or 0 if an error occurs).

  • Add bounds checking to batched/low-level decompression routines, such that if an invalid compressed data stream is provided, 0 will be written for the output size, rather than generating an illegal memory access.

  • Fix LZ4 to support chunk sizes < 32 KB.

Performance Optimizations#

  • Improve performance of Snappy compression by ~10% in some configurations.

  • Add an optimization to the LZ4 compressor based on specification of input data as char, short, or int, rather than just treating the input as raw bytes.

  • Optimization to reduce the LZ hash table size when compressing smaller chunks.

  • Improved compression performance in GDeflate with the high-throughput option

  • Improved decompression performance in GDeflate (10-75% depending on the dataset)

Bug Fixes#

  • Fix LZ4 CPU compression example.

  • Fix temp allocation size bug in benchmark_template_chunked.

Infrastructure#

  • Update CMakeLists to compile nvcomp with -fPIC enabled.

  • Add a new script for benchmarking compression algorithms.

  • Add unit tests for the Snappy decompressor that tests decompression on legally formatted files that won’t be generated by the nvcomp compressor due to configuration.

  • Update CMakeLists to suppress warnings about missing nvcomp external dependencies when the user didn’t indicate they wanted to include them.

  • Update CMakeLists to allow install into include folder that the user does not have ownership of.

nvcomp 2.0.2 (2021-06-30)#

  • Add example lz4_cpu_decompression to compress on the GPU with nvCOMP and decompress on the CPU with liblz4.

  • Add CMake option for building a static library.

  • Fix bug in LZ4 compression kernel to comply with LZ4 end of block restrictions.

  • Fix temp allocation size bug in benchmark_lz4_chunked.

nvcomp 2.0.1 (2021-06-08)#

  • Improve CMake setup for using nvCOMP as a submodule. This includes marking dependencies as PRIVATE, and adding options for building examples, tests, and benchmarks (e.g., -DBUILD_EXAMPLES=ON, -DBUILD_TESTS=ON, and -DBUILD_BENCHMARKS=ON).

  • Fix double free error in benchmark_snappy_synth.

  • Fix copy direction in Cascaded compression when the output size on the GPU.

  • Improve testing coverage.

  • Mark the generic decompression interfaces defined in include/nvcomp.h as deprecated.

nvcomp 2.0.0 (2021-04-28)#

  • Replace previous C, and C++ APIs.

  • Added Snappy compression (batched interface).

  • Added support for using Bitcomp and GDeflate external compressors.

  • Added /examples folder demonstrating use cases interface with CPU implementations of LZ4 and GDeflate, as well as GPU Direct Storage.

  • Improve support for Windows in benchmark implementations.

  • Made usage of std::uniform_int_distribution<> in the benchmarks conform to the C++14 standard.

  • Fix issue in Cascaded compression when using the default configuration (‘auto’), for small inputs.

nvcomp 1.2.3 (2021-04-07)#

  • Fix bug in LZ4 compression kernel for the Pascal architecture.

nvcomp 1.2.2 (2021-02-08)#

  • Fix linking errors in Clang++.

  • Fix error being incorrectly returned by Cascaded compression when output memory was initialized to all -1’s.

  • Fix C++17 style static assert.

  • Fix prematurely freeing memory in Cascaded compression.

  • Fix input format and usage messaging for benchmarks.

nvcomp 1.2.1 (2020-12-21)#

  • Fix compile error and unit tests for cascaded selector.

nvcomp 1.2.0 (2020-12-19)#

  • Add the Cascaded Selector and Cascaded Auto set of interfaces for automatically configuring cascaded compression.

  • Generally improve error handling and messaging.

  • Update CMake configuration to support CCache.

nvcomp 1.1.1 (2020-12-02)#

  • Add all-gather benchmark.

  • Add sm80 target if CUDA version is 11 or greater.

nvcomp 1.1.0 (2020-10-05)#

  • Add batch C interface for LZ4, allowing compressing/decompressing multiple inputs at once.

  • Significantly improve performance of LZ4 compression.

nvcomp 1.0.2 (2020-08-12)#

  • Fix metadata freeing for LZ4, to avoid possible mismatch of new[] and delete.

nvcomp 1.0.1 (2020-08-07)#

  • Fixed naming of nvcompLZ4CompressX functions in include/lz4.h, to have the nvcomp prefix.

  • Changed CascadedMetadata::Header struct initialization to work around internal compiler error.

nvcomp 1.0.0 (2020-07-31)#

  • Initial public release.