TensorFlow Wheel Platform

TensorFlow Wheel Platform (PDF)

This TensorFlow Wheel release is intended for use on the NVIDIA Ampere Architecture GPU, NVIDIA Turing Architecture GPUs, NVIDIA Volta Architecture GPUs, and NVIDIA Pascal Architecture GPU.

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 1. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
NVIDIA CUDA cuBLAS nvidia-cublas >=12.1
NVIDIA CUDA CUPTI nvidia-cublas-cupti >=12.0
NVIDIA CUDA NVCC nvidia-cuda-nvcc >=12.0
NVIDIA CUDA NVRTC nvidia-cuda-nvrtc 12.*
NVIDIA CUDA Runtime nvidia-cuda-runtime >=12.1
NVIDIA CUDA cuDNN nvidia-cudnn >=8.8
NVIDIA CUDA cuFFT nvidia-cufft >=11.0
NVIDIA CUDA cuRAND nvidia-curand >=10.3
NVIDIA CUDA cuSOLVER nvidia-cusolver >=11.4
NVIDIA CUDA cuSPARSE nvidia-cusparse >=12.0
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin ==1.23.0+nv23.03
NVIDIA DALI for CUDA 11.0 nvidia-dali-cuda110 ==1.23.0
Distributed training framework for TensorFlow, Keras, and PyTorch nvidia-horovod ==0.27.0+nv23.03
NVIDIA CUDA NCCL nvidia-nccl >=2.17
TensorBoard lets you watch Tensors Flow nvidia-tensorboard ==1.23.0 + nv23.03
NVIDIA TensorFlow nvidia-tensorflow ==1.15.5
NVIDIA TensorRT, a high-performance deep learning inference library nvidia-tensorrt ==8.5

Driver Requirements

Release 23.03 is based on CUDA 12.1.0, which requires NVIDIA Driver release 530 or later. However, if you are running on a data center GPU (for example, T4 or any other data center GPU), you can use NVIDIA driver release 450.51 (or later R450), 470.57 (or later R470), 510.47 (or later R510), or 525.85 (or later R525).

The CUDA driver's compatibility package only supports particular drivers. Thus, users should upgrade from all R418, R440, R460, and R520 drivers, which are not forward-compatible with CUDA 12.0. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Software Requirements

The 23.03 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

Note: If you encounter functional or performance issues when XLA is enabled, please refer to the XLA Best Practices document. It offers pointers on how to diagnose symptoms and possibly address them.

  • The default set of Keras optimizers are not currently compatible with Horovod, see github issues [1], [2]. Reverting to the old optimizers (available now under tf.keras.optimizers.legacy, e.g. tf.keras.optimizers.legacy.Adam instead of tf.keras.optimizers.Adam) resolves the errors. We also have an in-flight Horovod PR 3822 that fixes more cases.

  • Some DLRM models may regress by 10-40%. We are currently investigating.

  • A known performance regression of up to 50% affects some efficientnet models. The regression is inherited from upstream tensorflow and is still under investigation. It will be fixed in a subsequent release.

  • The TF-TRT native segment fallback has a known issue that causes a crash. This issue occurs when you use TF-TRT to convert a model with a subgraph that is then converted to TensorRT, but the conversion fails to build. Instead of falling back to native TensorFlow TF-TRT will crash. Using export TF_TRT_OP_DENYLIST="ProblematicOp" can help to prevent conversion of an OP causing a native segment fallback.

  • A known issue affects aarch64 libgomp, which might sometimes cause the following error:

    Copy
    Copied!
                

    cannot allocate memory in static TLS block

    The workaround is to run the following command:

    Copy
    Copied!
                

    export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libgomp.so.1

  • IO dominated CNN models, such as AlexNet and ResNet50 see a ~10% performance reduction on some platforms. The regression is under investigation and will be fixed in a future release.

  • In some configurations, the UNet3D model on A100 fails to initialize CUDNN due to an OOM. This can be fixed by increasing the GPU memory carveout with the environment variable TF_DEVICE_MIN_SYS_MEMORY_IN_MB=2000.

  • There is a known performance regression in XLA that can cause performance regressions of up to 55% when training certain models such as EfficientNet with XLA enabled. The root cause is under investigation and will be fixed in a future release.

  • On H100 NVLink systems using 2 GPUs for training, certain communication patterns can trigger a corner-case bug that manifests either as a hang or as an "illegal instruction" exception. A workaround for this case is to set the environment variable NCCL_PROTO=^LL128. This issue will be addressed in an upcoming release.

  • Within the TF1 container on T4 GPUs, the MaskRCNN model may fail with either the low accuracy or illegal memory access. The root cause is under investigation and will be fixed in a future release.

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 2. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
NVIDIA CUDA cuBLAS nvidia-cublas >=12.0
NVIDIA CUDA CUPTI nvidia-cublas-cupti >=12.0
NVIDIA CUDA NVCC nvidia-cuda-nvcc >=12.0
NVIDIA CUDA NVRTC nvidia-cuda-nvrtc 12.*
NVIDIA CUDA Runtime nvidia-cuda-runtime >=12.0
NVIDIA CUDA cuDNN nvidia-cudnn >=8.7
NVIDIA CUDA cuFFT nvidia-cufft >=11.0
NVIDIA CUDA cuRAND nvidia-curand >=10.3
NVIDIA CUDA cuSOLVER nvidia-cusolver >=11.4
NVIDIA CUDA cuSPARSE nvidia-cusparse >=12.0
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin ==1.22.0+nv23.02
NVIDIA DALI for CUDA 11.0 nvidia-dali-cuda110 ==1.22.0
Distributed training framework for TensorFlow, Keras, and PyTorch nvidia-horovod ==0.26.1+nv23.01
NVIDIA CUDA NCCL nvidia-nccl >=2.16
TensorBoard lets you watch Tensors Flow nvidia-tensorboard ==1.22.0 + nv23.02
NVIDIA TensorFlow nvidia-tensorflow ==1.15.5
NVIDIA TensorRT, a high-performance deep learning inference library nvidia-tensorrt ==8.5

Driver Requirements

Release 23.02 is based on CUDA 12.0.1, which requires NVIDIA Driver release 525 or later. However, if you are running on a data center GPU (for example, T4 or any other data center GPU), you can use NVIDIA driver release 450.51 (or later R450), 470.57 (or later R470), 510.47 (or later R510), or 525.85 (or later R525).

The CUDA driver's compatibility package only supports particular drivers. Thus, users should upgrade from all R418, R440, and R460 drivers, which are not forward-compatible with CUDA 12.0. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Software Requirements

The 23.02 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

Note: If you encounter functional or performance issues when XLA is enabled, please refer to the XLA Best Practices document. It offers pointers on how to diagnose symptoms and possibly address them.

  • In 23.02 containers, certain cuDNN cases that use runtime compilation via NVRTC, particularly on ARM SBSA systems, can fail with the following:

    Copy
    Copied!
                

    CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING

    A workaround for this situation is to export the following:

    Copy
    Copied!
                

    LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11/lib64

    This will be fixed in the next release.

  • The default set of Keras optimizers are not currently compatible with Horovod, see github issues [1], [2]. Reverting to the old optimizers (available now under tf.keras.optimizers.legacy, e.g. tf.keras.optimizers.legacy.Adam instead of tf.keras.optimizers.Adam) resolves the errors. We also have an in-flight Horovod PR 3822 that fixes more cases.

  • Some DLRM models may regress by 10-40%. We are currently investigating.

  • A known performance regression of up to 50% affects some efficientnet models. The regression is inherited from upstream tensorflow and is still under investigation. It will be fixed in a subsequent release.

  • The TF-TRT native segment fallback has a known issue that causes a crash. This issue occurs when you use TF-TRT to convert a model with a subgraph that is then converted to TensorRT, but the conversion fails to build. Instead of falling back to native TensorFlow TF-TRT will crash. Using export TF_TRT_OP_DENYLIST="ProblematicOp" can help to prevent conversion of an OP causing a native segment fallback.

  • A known issue affects aarch64 libgomp, which might sometimes cause the following error:

    Copy
    Copied!
                

    cannot allocate memory in static TLS block

    The workaround is to run the following command:

    Copy
    Copied!
                

    export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libgomp.so.1

  • IO dominated CNN models, such as AlexNet and ResNet50 see a ~10% performance reduction on some platforms. The regression is under investigation and will be fixed in a future release.

  • In some configurations, the UNet3D model on A100 fails to initialize CUDNN due to an OOM. This can be fixed by increasing the GPU memory carveout with the environment variable TF_DEVICE_MIN_SYS_MEMORY_IN_MB=2000.

  • There is a known performance regression in XLA that can cause performance regressions of up to 55% when training certain models such as EfficientNet with XLA enabled. The root cause is under investigation and will be fixed in a future release.

  • On H100 NVLink systems using 2 GPUs for training, certain communication patterns can trigger a corner-case bug that manifests either as a hang or as an "illegal instruction" exception. A workaround for this case is to set the environment variable NCCL_PROTO=^LL128. This issue will be addressed in an upcoming release.

  • Within the TF1 container on T4 GPUs, the MaskRCNN model may fail with either the low accuracy or illegal memory access. The root cause is under investigation and will be fixed in a future release.

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 3. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
NVIDIA CUDA cuBLAS nvidia-cublas >=12.0
NVIDIA CUDA CUPTI nvidia-cublas-cupti >=12.0
NVIDIA CUDA NVCC nvidia-cuda-nvcc >=12.0
NVIDIA CUDA NVRTC nvidia-cuda-nvrtc 11.*
NVIDIA CUDA Runtime nvidia-cuda-runtime >=12.0
NVIDIA CUDA cuDNN nvidia-cudnn >=8.7
NVIDIA CUDA cuFFT nvidia-cufft >=11.0
NVIDIA CUDA cuRAND nvidia-curand >=10.3
NVIDIA CUDA cuSOLVER nvidia-cusolver >=11.4
NVIDIA CUDA cuSPARSE nvidia-cusparse >=12.0
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin ==1.21.0+nv23.01
NVIDIA DALI for CUDA 11.0 nvidia-dali-cuda110 ==1.21.0
Distributed training framework for TensorFlow, Keras, and PyTorch nvidia-horovod ==0.26.1+nv23.01
NVIDIA CUDA NCCL nvidia-nccl >=2.16
TensorBoard lets you watch Tensors Flow nvidia-tensorboard ==1.16.0 + nv23.01
NVIDIA TensorFlow nvidia-tensorflow ==1.15.5
NVIDIA TensorRT, a high-performance deep learning inference library nvidia-tensorrt ==8.5

Driver Requirements

Release 23.01 is based on CUDA 12.0.1, which requires NVIDIA Driver release 525 or later. However, if you are running on a data center GPU (for example, T4 or any other data center GPU), you can use NVIDIA driver release 450.51 (or later R450), 470.57 (or later R470), 510.47 (or later R510), or 525.85 (or later R525).

The CUDA driver's compatibility package only supports particular drivers. Thus, users should upgrade from all R418, R440, and R460 drivers, which are not forward-compatible with CUDA 12.0. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Software Requirements

The 23.01 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

Note: If you encounter functional or performance issues when XLA is enabled, please refer to the XLA Best Practices document. It offers pointers on how to diagnose symptoms and possibly address them.

  • Due to a dependency issue, pip install nvidia-tensorflow[horovod] may pick up an older version of cuBLAS unless pip install nvidia-cublas-cu11~=11.8.0 is issued first.

  • Note that if you wish to make modifications to the source and rebuild TensorFlow, starting from Container Release 22.10 (TensorFlow 2.10) you will need a C++ 17-compatible compiler.

  • The default set of Keras optimizers are not currently compatible with Horovod, see github issues [1], [2]. Reverting to the old optimizers (available now under tf.keras.optimizers.legacy, e.g. tf.keras.optimizers.legacy.Adam instead of tf.keras.optimizers.Adam) resolves the errors. We also have an in-flight Horovod PR 3822 that fixes more cases.

  • Some DLRM models may regress by 10-40%. We are currently investigating.

  • A known performance regression of up to 50% affects some efficientnet models. The regression is inherited from upstream tensorflow and is still under investigation. It will be fixed in a subsequent release.

  • The TF-TRT native segment fallback has a known issue that causes a crash. This issue occurs when you use TF-TRT to convert a model with a subgraph that is then converted to TensorRT, but the conversion fails to build. Instead of falling back to native TensorFlow TF-TRT will crash. Using export TF_TRT_OP_DENYLIST="ProblematicOp" can help to prevent conversion of an OP causing a native segment fallback.

  • A known issue affects aarch64 libgomp, which might sometimes cause the following error:

    Copy
    Copied!
                

    cannot allocate memory in static TLS block

    The workaround is to run the following command:

    Copy
    Copied!
                

    export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libgomp.so.1

  • IO dominated CNN models, such as AlexNet and ResNet50 see a ~10% performance reduction on some platforms. The regression is under investigation and will be fixed in a future release.

  • In some configurations, the UNet3D model on A100 fails to initialize CUDNN due to an OOM. This can be fixed by increasing the GPU memory carveout with the environment variable TF_DEVICE_MIN_SYS_MEMORY_IN_MB=2000.

  • There is a known performance regression in XLA that can cause performance regressions of up to 55% when training certain models such as EfficientNet with XLA enabled. The root cause is under investigation and will be fixed in a future release.

  • On H100 NVLink systems using 2 GPUs for training, certain communication patterns can trigger a corner-case bug that manifests either as a hang or as an "illegal instruction" exception. A workaround for this case is to set the environment variable NCCL_PROTO=^LL128. This issue will be addressed in an upcoming release.

  • Within the TF1 container on T4 GPUs, the MaskRCNN model may fail with either the low accuracy or illegal memory access. The root cause is under investigation and will be fixed in a future release.

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 4. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
NVIDIA CUDA cuBLAS nvidia-cublas >=11.11
NVIDIA CUDA CUPTI nvidia-cublas-cupti >=11.8
NVIDIA CUDA NVCC nvidia-cuda-nvcc >=11.8
NVIDIA CUDA NVRTC nvidia-cuda-nvrtc 11.*
NVIDIA CUDA Runtime nvidia-cuda-runtime >=11.8
NVIDIA CUDA cuDNN nvidia-cudnn >=8.7
NVIDIA CUDA cuFFT nvidia-cufft >=10.9
NVIDIA CUDA cuRAND nvidia-curand >=10.3
NVIDIA CUDA cuSOLVER nvidia-cusolver >=11.4
NVIDIA CUDA cuSPARSE nvidia-cusparse >=11.7
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin ==1.20.0+nv22.12
NVIDIA DALI for CUDA 11.0 nvidia-dali-cuda110 ==1.20.0
Distributed training framework for TensorFlow, Keras, and PyTorch nvidia-horovod ==0.26.1+nv22.12
NVIDIA CUDA NCCL nvidia-nccl >=2.15
TensorBoard lets you watch Tensors Flow nvidia-tensorboard ==1.16.0 + nv22.12
NVIDIA TensorFlow nvidia-tensorflow ==1.15.5
NVIDIA TensorRT, a high-performance deep learning inference library nvidia-tensorrt ==8.5

Driver Requirements

Release 22.12 is based on CUDA 11.8, which requires NVIDIA Driver release 520 or later. However, if you are running on a data center GPU (for example, T4 or any other data center GPU), you can use NVIDIA driver release 450.51 (or later R450), 470.57 (or later R470), or 510.47 (or later R510).

The CUDA driver's compatibility package only supports particular drivers. Thus, users should upgrade from all R418, R440, and R460 drivers, which are not forward-compatible with CUDA 11.7. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Software Requirements

The 22.12 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

Note: If you encounter functional or performance issues when XLA is enabled, please refer to the XLA Best Practices document. It offers pointers on how to diagnose symptoms and possibly address them.

  • A known performance regression of up to 50% affects some efficientnet models. The regression is inherited from upstream tensorflow and is still under investigation. It will be fixed in a subsequent release.

  • The TF-TRT native segment fallback has a known issue that causes a crash. This issue occurs when you use TF-TRT to convert a model with a subgraph that is then converted to TensorRT, but the conversion fails to build. Instead of falling back to native TensorFlow TF-TRT will crash. Using export TF_TRT_OP_DENYLIST="ProblematicOp" can help to prevent conversion of an OP causing a native segment fallback.

  • A known issue affects aarch64 libgomp, which might sometimes cause the following error:

    Copy
    Copied!
                

    cannot allocate memory in static TLS block

    The workaround is to run the following command:

    Copy
    Copied!
                

    export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libgomp.so.1

  • IO dominated CNN models, such as AlexNet and ResNet50 see a ~10% performance reduction on some platforms. The regression is under investigation and will be fixed in a future release.

  • In some configurations, the UNet3D model on A100 fails to initialize CUDNN due to an OOM. This can be fixed by increasing the GPU memory carveout with the environment variable TF_DEVICE_MIN_SYS_MEMORY_IN_MB=2000.

  • There is a known performance regression in XLA that can cause performance regressions of up to 55% when training certain models such as EfficientNet with XLA enabled. The root cause is under investigation and will be fixed in a future release.

  • On H100 NVLink systems using 2 GPUs for training, certain communication patterns can trigger a corner-case bug that manifests either as a hang or as an "illegal instruction" exception. A workaround for this case is to set the environment variable NCCL_PROTO=^LL128. This issue will be addressed in an upcoming release.

  • Within the TF1 container on T4 GPUs, the MaskRCNN model may fail with either the low accuracy or illegal memory access. The root cause is under investigation and will be fixed in a future release.

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 5. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
NVIDIA CUDA cuBLAS nvidia-cublas >=11.11
NVIDIA CUDA CUPTI nvidia-cublas-cupti >=11.8
NVIDIA CUDA NVCC nvidia-cuda-nvcc >=11.8
NVIDIA CUDA NVRTC nvidia-cuda-nvrtc 11.*
NVIDIA CUDA Runtime nvidia-cuda-runtime >=11.8
NVIDIA CUDA cuDNN nvidia-cudnn >=8.7
NVIDIA CUDA cuFFT nvidia-cufft >=10.9
NVIDIA CUDA cuRAND nvidia-curand >=10.3
NVIDIA CUDA cuSOLVER nvidia-cusolver >=11.4
NVIDIA CUDA cuSPARSE nvidia-cusparse >=11.7
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin ==1.18.0+nv22.11
NVIDIA DALI for CUDA 11.0 nvidia-dali-cuda110 ==1.18.0
Distributed training framework for TensorFlow, Keras, and PyTorch nvidia-horovod ==0.26.1+nv22.11
NVIDIA CUDA NCCL nvidia-nccl >=2.15
TensorBoard lets you watch Tensors Flow nvidia-tensorboard ==1.16.0 + nv22.11
NVIDIA TensorFlow nvidia-tensorflow ==1.15.5
NVIDIA TensorRT, a high-performance deep learning inference library nvidia-tensorrt ==8.5

Driver Requirements

Release 22.11 is based on CUDA 11.8, which requires NVIDIA Driver release 520 or later. However, if you are running on a data center GPU (for example, T4 or any other data center GPU), you can use NVIDIA driver release 450.51 (or later R450), 470.57 (or later R470), or 510.47 (or later R510).

The CUDA driver's compatibility package only supports particular drivers. Thus, users should upgrade from all R418, R440, and R460 drivers, which are not forward-compatible with CUDA 11.7. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Software Requirements

The 22.11 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

Note: If you encounter functional or performance issues when XLA is enabled, please refer to the XLA Best Practices document. It offers pointers on how to diagnose symptoms and possibly address them.

  • Certain models using RELU activation may exhibit extreme (and easily noticeable) performance regressions. We have root-cased this to a cuDNN issue and will release the fix in 22.12.

  • Certain models may crash with an out-of-memory error. We are investigating and will fix in 22.12.

  • A known performance regression of up to 50% affects some efficientnet models. The regression is inherited from upstream tensorflow and is still under investigation. It will be fixed in a subsequent release.

  • The TF-TRT native segment fallback has a known issue that causes a crash. This issue occurs when you use TF-TRT to convert a model with a subgraph that is then converted to TensorRT, but the conversion fails to build. Instead of falling back to native TensorFlow TF-TRT will crash. Using export TF_TRT_OP_DENYLIST="ProblematicOp" can help to prevent conversion of an OP causing a native segment fallback.

  • A known issue affects aarch64 libgomp, which might sometimes cause the following error:

    Copy
    Copied!
                

    cannot allocate memory in static TLS block

    The workaround is to run the following command:

    Copy
    Copied!
                

    export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libgomp.so.1

  • IO dominated CNN models, such as AlexNet and ResNet50 see a ~10% performance reduction on some platforms. The regression is under investigation and will be fixed in a future release.

  • In some configurations, the UNet3D model on A100 fails to initialize CUDNN due to an OOM. This can be fixed by increasing the GPU memory carveout with the environment variable TF_DEVICE_MIN_SYS_MEMORY_IN_MB=2000.

  • There is a known performance regression in XLA that can cause performance regressions of up to 55% when training certain models such as EfficientNet with XLA enabled. The root cause is under investigation and will be fixed in a future release.

  • On H100 NVLink systems using 2 GPUs for training, certain communication patterns can trigger a corner-case bug that manifests either as a hang or as an "illegal instruction" exception. A workaround for this case is to set the environment variable NCCL_PROTO=^LL128. This issue will be addressed in an upcoming release.

  • Within the TF1 container on T4 GPUs, the MaskRCNN model may fail with either the low accuracy or illegal memory access. The root cause is under investigation and will be fixed in a future release.

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 6. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
NVIDIA CUDA cuBLAS nvidia-cublas >=11.11
NVIDIA CUDA CUPTI nvidia-cublas-cupti >=11.8
NVIDIA CUDA NVCC nvidia-cuda-nvcc >=11.8
NVIDIA CUDA NVRTC nvidia-cuda-nvrtc 11.*
NVIDIA CUDA Runtime nvidia-cuda-runtime >=11.8
NVIDIA CUDA cuDNN nvidia-cudnn >=8.6
NVIDIA CUDA cuFFT nvidia-cufft >=10.9
NVIDIA CUDA cuRAND nvidia-curand >=10.3
NVIDIA CUDA cuSOLVER nvidia-cusolver >=11.4
NVIDIA CUDA cuSPARSE nvidia-cusparse >=11.7
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin ==1.18.0+nv22.10
NVIDIA DALI for CUDA 11.0 nvidia-dali-cuda110 ==1.18.0
Distributed training framework for TensorFlow, Keras, and PyTorch nvidia-horovod ==0.25.0+nv22.10
NVIDIA CUDA NCCL nvidia-nccl >=2.15
TensorBoard lets you watch Tensors Flow nvidia-tensorboard ==1.16.0 + nv22.10
NVIDIA TensorFlow nvidia-tensorflow ==1.15.5
NVIDIA TensorRT, a high-performance deep learning inference library nvidia-tensorrt ==8.2.5

Driver Requirements

Release 22.10 is based on CUDA 11.8, which requires NVIDIA Driver release 520 or later. However, if you are running on a data center GPU (for example, T4 or any other data center GPU), you can use NVIDIA driver release 450.51 (or later R450), 470.57 (or later R470), or 510.47 (or later R510).

The CUDA driver's compatibility package only supports particular drivers. Thus, users should upgrade from all R418, R440, and R460 drivers, which are not forward-compatible with CUDA 11.7. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Software Requirements

The 22.10 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

Note: If you encounter functional or performance issues when XLA is enabled, please refer to the XLA Best Practices document. It offers pointers on how to diagnose symptoms and possibly address them.

  • Some multi-GPU TF2 models (e.g. EfficientNet) may crash with a segmentation fault. As a potential workaround, try increasing the host memory limit from the default of 64GB, by setting the environment variable TF_GPU_HOST_MEM_LIMIT_IN_MB=131072, which is MBs.

  • TF-TRT native segment fallback has a known issue causing a crash. This will occur when using TF-TRT to convert a model with a subgraph that is converted to TensorRT but fails to build at runtime. Instead of falling back to native TensorFlow TF-TRT will crash. Using export TF_TRT_OP_DENYLIST="ProblematicOp" can help to prevent conversion of an OP causing a native segment fallback.

  • A known issue affects aarch64 libgomp, which might sometimes cause the following error:

    Copy
    Copied!
                

    cannot allocate memory in static TLS block

    The workaround is to run the following command:

    Copy
    Copied!
                

    export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libgomp.so.1

  • IO dominated CNN models, such as AlexNet and ResNet50 see a ~10% performance reduction on some platforms. The regression is under investigation and will be fixed in a future release.

  • In some configurations, the UNet3D model on A100 fails to initialize CUDNN due to an OOM. This can be fixed by increasing the GPU memory carveout with the environment variable TF_DEVICE_MIN_SYS_MEMORY_IN_MB=2000.

  • There is a known performance regression in XLA that can cause performance regressions of up to 55% when training certain models such as EfficientNet with XLA enabled. The root cause is under investigation and will be fixed in a future release.

  • On H100 NVLink systems using 2 GPUs for training, certain communication patterns can trigger a corner-case bug that manifests either as a hang or as an "illegal instruction" exception. A workaround for this case is to set the environment variable NCCL_PROTO=^LL128. This issue will be addressed in an upcoming release.

  • Within the TF1 container on T4 GPUs, the MaskRCNN model may fail with either the low accuracy or illegal memory access. The root cause is under investigation and will be fixed in a future release.

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 7. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
NVIDIA CUDA cuBLAS nvidia-cublas >=11.11
NVIDIA CUDA CUPTI nvidia-cublas-cupti >=11.8
NVIDIA CUDA NVCC nvidia-cuda-nvcc >=11.8
NVIDIA CUDA NVRTC nvidia-cuda-nvrtc 11.*
NVIDIA CUDA Runtime nvidia-cuda-runtime >=11.8
NVIDIA CUDA cuDNN nvidia-cudnn >=8.6
NVIDIA CUDA cuFFT nvidia-cufft >=10.9
NVIDIA CUDA cuRAND nvidia-curand >=10.3
NVIDIA CUDA cuSOLVER nvidia-cusolver >=11.4
NVIDIA CUDA cuSPARSE nvidia-cusparse >=11.7
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin ==1.17.0+nv22.09
NVIDIA DALI for CUDA 11.0 nvidia-dali-cuda110 ==1.17.0
Distributed training framework for TensorFlow, Keras, and PyTorch nvidia-horovod ==0.25.0+nv22.09
NVIDIA CUDA NCCL nvidia-nccl >=2.15
TensorBoard lets you watch Tensors Flow nvidia-tensorboard ==1.16.0 + nv22.09
NVIDIA TensorFlow nvidia-tensorflow ==1.15.5
NVIDIA TensorRT, a high-performance deep learning inference library nvidia-tensorrt ==8.2.5

Driver Requirements

Release 22.09 is based on CUDA 11.8, which requires NVIDIA Driver release 520 or later. However, if you are running on a data center GPU (for example, T4 or any other data center GPU), you can use NVIDIA driver release 450.51 (or later R450), 470.57 (or later R470), or 510.47 (or later R510).

The CUDA driver's compatibility package only supports particular drivers. Thus, users should upgrade from all R418, R440, and R460 drivers, which are not forward-compatible with CUDA 11.7. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Software Requirements

The 22.09 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

  • TensorFlow Wheel release 22.09 is based on 1.15.5.

  • We introduced a new environment variable TF_GRAPPLER_GRAPH_DEF_PATH to output the Graphdef files before and after the TF grappler optimizations (For more information about the grappler optimizations, see the TensorFlow graph optimization with Grappler). In checking the optimized operation graph during the TF runtime, users can specify the following:

    Copy
    Copied!
                

    TF_GRAPPLER_GRAPH_DEF_PATH=/path/to/graphdef

  • We provided a visualization tool to convert (and compare) the given Graphdef files by graphdef2pydot, which was preinstalled in the 22.09 container. For more information about usage, see graphdef2pydot -h.

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

Note: If you encounter functional or performance issues when XLA is enabled, please refer to the XLA Best Practices document. It offers pointers on how to diagnose symptoms and possibly address them.

  • We have introduced another feature to be able to switch to the channel-last layout (NHWC) for harnessing the power of Tensor Core math. If you observed a performance regression in TF-TRT models, consider enabling the environment variable via export TF_ENABLE_LAYOUT_NHWC=1 to check whether it helps regain the lost performance.

  • TF-TRT native segment fallback has a known issue causing a crash. This will occur when using TF-TRT to convert a model with a subgraph that is converted to TensorRT but fails to build at runtime. Instead of falling back to native TensorFlow TF-TRT will crash. Using export TF_TRT_OP_DENYLIST="ProblematicOp" can help to prevent conversion of an OP causing a native segment fallback.

  • A known issue affects aarch64 libgomp, which might sometimes cause the following error:

    Copy
    Copied!
                

    cannot allocate memory in static TLS block

    The workaround is to run the following command:

    Copy
    Copied!
                

    export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libgomp.so.1

  • IO dominated CNN models, such as AlexNet and ResNet50 see a ~10% performance reduction on some platforms. The regression is under investigation and will be fixed in a future release.

  • In some configurations, the UNet3D model on A100 fails to initialize CUDNN due to an OOM. This can be fixed by increasing the GPU memory carveout with the environment variable TF_DEVICE_MIN_SYS_MEMORY_IN_MB=2000.

  • There is a known performance regression in XLA that can cause performance regressions of up to 55% when training certain models such as EfficientNet with XLA enabled. The root cause is under investigation and will be fixed in a future release.

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 8. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
NVIDIA CUDA cuBLAS nvidia-cublas >=11.10
NVIDIA CUDA CUPTI nvidia-cublas-cupti >=11.7
NVIDIA CUDA NVCC nvidia-cuda-nvcc >=11.7
NVIDIA CUDA NVRTC nvidia-cuda-nvrtc 11.*
NVIDIA CUDA Runtime nvidia-cuda-runtime >=11.7
NVIDIA CUDA cuDNN nvidia-cudnn >=8.5
NVIDIA CUDA cuFFT nvidia-cufft >=10.7
NVIDIA CUDA cuRAND nvidia-curand >=10.2
NVIDIA CUDA cuSOLVER nvidia-cusolver >=11.4
NVIDIA CUDA cuSPARSE nvidia-cusparse >=11.7
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin ==1.16.0+nv22.08
NVIDIA DALI for CUDA 11.0 nvidia-dali-cuda110 ==1.16.0
Distributed training framework for TensorFlow, Keras, and PyTorch nvidia-horovod ==0.25.0+nv22.08
NVIDIA CUDA NCCL nvidia-nccl >=2.12
TensorBoard lets you watch Tensors Flow nvidia-tensorboard ==1.16.0 + nv22.08
NVIDIA TensorFlow nvidia-tensorflow ==1.15.5
NVIDIA TensorRT, a high-performance deep learning inference library nvidia-tensorrt ==8.2.5

Driver Requirements

Release 22.08 is based on CUDA 11.7 Update 1, which requires NVIDIA Driver release 515 or later. However, if you are running on a data center GPU (for example, T4 or any other data center GPU), you can use NVIDIA driver release 450.51 (or later R450), 470.57 (or later R470), or 510.47 (or later R510).

The CUDA driver's compatibility package only supports particular drivers. Thus, users should upgrade from all R418, R440, and R460 drivers, which are not forward-compatible with CUDA 11.7. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Software Requirements

The 22.08 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

  • TensorFlow Wheel release 22.08 is based on 1.15.5.

  • We introduced a new environment variable TF_ENABLE_LAYOUT_NHWC to enforce the NHWC layout at runtime. In some models with fp32 on Ampere GPUs, users may obtain better performance when specifying TF_ENABLE_LAYOUT_NHWC=1, which can better utilize the TF32 tensor cores.

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

Note: If you encounter functional or performance issues when XLA is enabled, please refer to the XLA Best Practices document. It offers pointers on how to diagnose symptoms and possibly address them.

  • TF-TRT native segment fallback has a known issue causing a crash. This will occur when using TF-TRT to convert a model with a subgraph that is converted to TensorRT but fails to build at runtime. Instead of falling back to native TensorFlow TF-TRT will crash. Using export TF_TRT_OP_DENYLIST="ProblematicOp" can help to prevent conversion of an OP causing a native segment fallback.

  • A known issue affects aarch64 libgomp, which might sometimes cause the following error:

    Copy
    Copied!
                

    cannot allocate memory in static TLS block

    The workaround is to run the following command:

    Copy
    Copied!
                

    export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libgomp.so.1

  • IO dominated CNN models, such as AlexNet and ResNet50 see a ~10% performance reduction on some platforms. The regression is under investigation and will be fixed in a future release.

  • In some configurations, the UNet3D model on A100 fails to initialize CUDNN due to an OOM. This can be fixed by increasing the GPU memory carveout with the environment variable TF_DEVICE_MIN_SYS_MEMORY_IN_MB=2000.

  • There is a known performance regression in XLA that can cause performance regressions of up to 55% when training certain models such as EfficientNet with XLA enabled. The root cause is under investigation and will be fixed in a future release.

  • TF-TRT 22.07 may fail to build TensorRT engines for HF BERT or HF BART, which may manifest as large performance regressions. Please revert back to the previous version 22.06 if you see a TF-TRT warning stating that Myelin graph could not be created or see a substantial performance regression.

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 9. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
NVIDIA CUDA cuBLAS nvidia-cublas >=11.10
NVIDIA CUDA CUPTI nvidia-cublas-cupti >=11.7
NVIDIA CUDA NVCC nvidia-cuda-nvcc >=11.7
NVIDIA CUDA NVRTC nvidia-cuda-nvrtc 11.*
NVIDIA CUDA Runtime nvidia-cuda-runtime >=11.7
NVIDIA CUDA cuDNN nvidia-cudnn >=8.4
NVIDIA CUDA cuFFT nvidia-cufft >=10.7
NVIDIA CUDA cuRAND nvidia-curand >=10.2
NVIDIA CUDA cuSOLVER nvidia-cusolver >=11.4
NVIDIA CUDA cuSPARSE nvidia-cusparse >=11.7
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin ==1.15.0+nv22.07
NVIDIA DALI for CUDA 11.0 nvidia-dali-cuda110 ==1.15.0
Distributed training framework for TensorFlow, Keras, and PyTorch nvidia-horovod ==0.24.3+nv22.07
NVIDIA CUDA NCCL nvidia-nccl >=2.12
TensorBoard lets you watch Tensors Flow nvidia-tensorboard <1.16.0,>=1.15.0
NVIDIA TensorFlow nvidia-tensorflow ==1.15.5
NVIDIA TensorRT, a high-performance deep learning inference library nvidia-tensorrt ==8.2.5

Driver Requirements

Release 22.07 is based on NVIDIA CUDA 11.7, which requires NVIDIA Driver release 515 or later.

Software Requirements

The 22.07 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

Note: If you encounter functional or performance issues when XLA is enabled, please refer to the XLA Best Practices document. It offers pointers on how to diagnose symptoms and possibly address them.

  • TF-TRT native segment fallback has a known issue causing a crash. This will occur when using TF-TRT to convert a model with a subgraph that is converted to TensorRT but fails to build at runtime. Instead of falling back to native TensorFlow TF-TRT will crash. Using export TF_TRT_OP_DENYLIST="ProblematicOp" can help to prevent conversion of an OP causing a native segment fallback.

  • A known issue affects aarch64 libgomp, which might sometimes cause the following error:

    Copy
    Copied!
                

    cannot allocate memory in static TLS block

    The workaround is to run the following command:

    Copy
    Copied!
                

    export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libgomp.so.1

  • IO dominated CNN models, such as AlexNet and ResNet50 see a ~10% performance reduction on some platforms. The regression is under investigation and will be fixed in a future release.

  • In some configurations, the UNet3D model on A100 fails to initialize CUDNN due to an OOM. This can be fixed by increasing the GPU memory carveout with the environment variable TF_DEVICE_MIN_SYS_MEMORY_IN_MB=2000.

  • There is a known performance regression in XLA that can cause performance regressions of up to 55% when training certain models such as EfficientNet with XLA enabled. The root cause is under investigation and will be fixed in a future release.

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 10. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
NVIDIA CUDA cuBLAS nvidia-cublas >=11.10
NVIDIA CUDA CUPTI nvidia-cublas-cupti >=11.7
NVIDIA CUDA NVCC nvidia-cuda-nvcc >=11.7
NVIDIA CUDA NVRTC nvidia-cuda-nvrtc 11.*
NVIDIA CUDA Runtime nvidia-cuda-runtime >=11.7
NVIDIA CUDA cuDNN nvidia-cudnn >=8.4
NVIDIA CUDA cuFFT nvidia-cufft >=10.7
NVIDIA CUDA cuRAND nvidia-curand >=10.2
NVIDIA CUDA cuSOLVER nvidia-cusolver >=11.4
NVIDIA CUDA cuSPARSE nvidia-cusparse >=11.7
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin ==1.14.0+nv22.06
NVIDIA DALI for CUDA 11.0 nvidia-dali-cuda110 ==1.14.0
Distributed training framework for TensorFlow, Keras, and PyTorch nvidia-horovod ==0.24.3+nv22.06
NVIDIA CUDA NCCL nvidia-nccl >=2.12
TensorBoard lets you watch Tensors Flow nvidia-tensorboard <1.16.0,>=1.15.0
NVIDIA TensorFlow nvidia-tensorflow ==1.15.5
NVIDIA TensorRT, a high-performance deep learning inference library nvidia-tensorrt ==8.2.5

Driver Requirements

Release 22.06 is based on NVIDIA CUDA 11.7, which requires NVIDIA Driver release 515 or later.

Software Requirements

The 22.06 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

  • TensorFlow Wheel release 22.06 is based on 1.15.5.

  • Added support for NHWC TF32 2D convolutions in XLA.
  • TensorFlow 2.9 improves the functionality of prefetch_to_device to allow for concurrent kernel execution and data transfer. To make use of this feature, ensure that your dataset pipeline ends by applying the prefetch_to_device operation as follows.

    Copy
    Copied!
                

    dataset = dataset.batch(batch_size=1024) … dataset = dataset.apply(tf.data.experimental.prefetch_to_device('/gpu:0'))

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

Note: If you encounter functional or performance issues when XLA is enabled, please refer to the XLA Best Practices document. It offers pointers on how to diagnose symptoms and possibly address them.

  • TF-TRT native segment fallback has a known issue causing a crash. This will occur when using TF-TRT to convert a model with a subgraph that is converted to TensorRT but fails to build at runtime. Instead of falling back to native TensorFlow TF-TRT will crash. Using export TF_TRT_OP_DENYLIST="ProblematicOp" can help to prevent conversion of an OP causing a native segment fallback.

  • A known issue affects aarch64 libgomp, which might sometimes cause the following error:

    Copy
    Copied!
                

    cannot allocate memory in static TLS block

    The workaround is to run the following command:

    Copy
    Copied!
                

    export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libgomp.so.1

  • IO dominated CNN models, such as AlexNet and ResNet50 see a ~10% performance reduction on some platforms. The regression is under investigation and will be fixed in a future release.

  • In some configurations, the UNet3D model on A100 fails to initialize CUDNN due to an OOM. This can be fixed by increasing the GPU memory carveout with the environment variable TF_DEVICE_MIN_SYS_MEMORY_IN_MB=2000.

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 11. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
NVIDIA CUDA cuBLAS nvidia-cublas >=11.10
NVIDIA CUDA CUPTI nvidia-cublas-cupti >=11.7
NVIDIA CUDA NVCC nvidia-cuda-nvcc >=11.7
NVIDIA CUDA NVRTC nvidia-cuda-nvrtc 11.*
NVIDIA CUDA Runtime nvidia-cuda-runtime >=11.7
NVIDIA CUDA cuDNN nvidia-cudnn >=8.4
NVIDIA CUDA cuFFT nvidia-cufft >=10.7
NVIDIA CUDA cuRAND nvidia-curand >=10.2
NVIDIA CUDA cuSOLVER nvidia-cusolver >=11.3
NVIDIA CUDA cuSPARSE nvidia-cusparse >=11.7
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin ==1.13.0+nv22.05
NVIDIA DALI for CUDA 11.0 nvidia-dali-cuda110 ==1.13.0
Distributed training framework for TensorFlow, Keras, and PyTorch nvidia-horovod ==0.24.2+nv22.05
NVIDIA CUDA NCCL nvidia-nccl >=2.12
TensorBoard lets you watch Tensors Flow nvidia-tensorboard <1.16.0,>=1.15.0
NVIDIA TensorFlow nvidia-tensorflow ==1.15.5
NVIDIA TensorRT, a high-performance deep learning inference library nvidia-tensorrt ==8.2.4.2

Driver Requirements

Release 22.05 is based on NVIDIA CUDA 11.7, which requires NVIDIA Driver release 515 or later.

Software Requirements

The 22.05 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

  • TensorFlow Wheel release 22.05 is based on 1.15.5.

  • Fixed segfault in SparseToDense when validate_indices if false for both TF1 and TF2.
  • Fixed XLA device indexing issue in TF2 that caused out-of-memory errors when using Horovod to distribute work to multiple GPUs.

  • Removed unneeded copies when saving resource variables. This lowers the effective memory footprint for models with large layers (e.g., embedding layers in recommender models).

  • Optimized depthwise convolution backprop filter kernel, providing speedups between 10 and 100x over previous implementation.

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

Note: If you encounter functional or performance issues when XLA is enabled, please refer to the XLA Best Practices document. It offers pointers on how to diagnose symptoms and possibly address them.

  • TF-TRT native segment fallback has a known issue causing a crash. This will occur when using TF-TRT to convert a model with a subgraph that is converted to TensorRT but fails to build at runtime. Instead of falling back to native TensorFlow TF-TRT will crash. Using export TF_TRT_OP_DENYLIST="ProblematicOp" can help to prevent conversion of an OP causing a native segment fallback.

  • A known issue affects aarch64 libgomp, which might sometimes cause the following error:

    Copy
    Copied!
                

    cannot allocate memory in static TLS block

    The workaround is to run the following command:

    Copy
    Copied!
                

    export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libgomp.so.1

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 12. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
NVIDIA CUDA cuBLAS nvidia-cublas >=11.9
NVIDIA CUDA CUPTI nvidia-cublas-cupti >=11.6
NVIDIA CUDA NVCC nvidia-cuda-nvcc >=11.6
NVIDIA CUDA NVRTC nvidia-cuda-nvrtc 11.*
NVIDIA CUDA Runtime nvidia-cuda-runtime >=11.6
NVIDIA CUDA cuDNN nvidia-cudnn >=8.4
NVIDIA CUDA cuFFT nvidia-cufft >=10.7
NVIDIA CUDA cuRAND nvidia-curand >=10.2
NVIDIA CUDA cuSOLVER nvidia-cusolver >=11.3
NVIDIA CUDA cuSPARSE nvidia-cusparse >=11.7
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin ==1.12.0+nv22.04
NVIDIA DALI for CUDA 11.0 nvidia-dali-cuda110 ==1.12.0
Distributed training framework for TensorFlow, Keras, and PyTorch nvidia-horovod ==0.24.2+nv22.04
NVIDIA CUDA NCCL nvidia-nccl >=2.12
TensorBoard lets you watch Tensors Flow nvidia-tensorboard <1.16.0,>=1.15.0
NVIDIA TensorFlow nvidia-tensorflow ==1.15.5
NVIDIA TensorRT, a high-performance deep learning inference library nvidia-tensorrt ==8.2.4.2

Driver Requirements

Release 22.04 is based on NVIDIA CUDA 11.6.2, which requires NVIDIA Driver release 510 or later.

Software Requirements

The 22.04 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

  • TensorFlow Wheel release 22.04 is based on 1.15.5.

  • Container sizes were reduced by removing redundant PTX code sections.

  • Fixed the race condition in the cuDNN heuristics lookup that might sometimes lead to segmentation faults.

  • TF2 added cuTENSOR support for the einsum single label case.

  • Fixed pooling operations to support tensors with dimensions that exceed the 32-bit integer indexing.

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

Note: If you encounter functional or performance issues when XLA is enabled, please refer to the XLA Best Practices document. It offers pointers on how to diagnose symptoms and possibly address them.

  • TF-TRT native segment fallback has a known issue causing a crash. This will occur when using TF-TRT to convert a model with a subgraph that is converted to TensorRT but fails to build at runtime. Instead of falling back to native TensorFlow TF-TRT will crash. Using export TF_TRT_OP_DENYLIST="ProblematicOp" can help to prevent conversion of an OP causing a native segment fallback.

  • A known issue affects aarch64 libgomp, which might sometimes cause the following error:

    Copy
    Copied!
                

    cannot allocate memory in static TLS block

    The workaround is to run the following command:

    Copy
    Copied!
                

    export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libgomp.so.1

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 13. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
NVIDIA CUDA cuBLAS nvidia-cublas >=11.8
NVIDIA CUDA CUPTI nvidia-cublas-cupti >=11.6
NVIDIA CUDA NVCC nvidia-cuda-nvcc >=11.6
NVIDIA CUDA NVRTC nvidia-cuda-nvrtc 11.*
NVIDIA CUDA Runtime nvidia-cuda-runtime >=11.6
NVIDIA CUDA cuDNN nvidia-cudnn >=8.3
NVIDIA CUDA cuFFT nvidia-cufft >=10.7
NVIDIA CUDA cuRAND nvidia-curand >=10.2
NVIDIA CUDA cuSOLVER nvidia-cusolver >=11.3
NVIDIA CUDA cuSPARSE nvidia-cusparse >=11.7
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin ==1.11.1+nv22.03
NVIDIA DALI for CUDA 11.0 nvidia-dali-cuda110 ==1.11.1
Distributed training framework for TensorFlow, Keras, and PyTorch nvidia-horovod ==0.23.0+nv22.03
NVIDIA CUDA NCCL nvidia-nccl >=2.11
TensorBoard lets you watch Tensors Flow nvidia-tensorboard <1.16.0,>=1.15.0
NVIDIA TensorFlow nvidia-tensorflow ==1.15.5
NVIDIA TensorRT, a high-performance deep learning inference library nvidia-tensorrt ==8.2

Driver Requirements

Release 22.03 is based on NVIDIA CUDA 11.6.1, which requires NVIDIA Driver release 510 or later.

Software Requirements

The 22.03 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

  • Fixed a bug in the XLA convolution autotuner that appeared in the 22.01-tf2 release which sometimes caused Failed to determine best cudnn convolution algorithm: RESOURCE_EXHAUSTED errors.

  • TensorFlow 1.15 has been patched for compatibility with numpy 1.21.1, and the numpy version has been updated to that version.

    With older numpy releases, certain matrix operations resulted in NaN and Inf values on ARM SBSA.

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

Note: If you encounter functional or performance issues when XLA is enabled, please refer to the XLA Best Practices document. It offers pointers on how to diagnose symptoms and possibly address them.

  • For TensorFlow 1.15, TF-TRT inference throughput may regress for certain models by up to 37% compared to the 21.06-tf1 release. This will be fixed in a future release.

  • A CUDNN performance regression can cause slowdowns of up to 15% in certain ResNet models. This will be fixed in a future release.

  • There is a known performance regression affecting UNet Medical 3D model training by up to 23%. This will be addressed in a future release.

  • TF-TRT native segment fallback has a known issue causing a crash. This will occur when using TF-TRT to convert a model with a subgraph that is converted to TensorRT but fails to build at runtime. Instead of falling back to native TensorFlow TF-TRT will crash. Using export TF_TRT_OP_DENYLIST="ProblematicOp" can help to prevent conversion of an OP causing a native segment fallback.

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 14. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
NVIDIA CUDA cuBLAS nvidia-cublas >=11.8
NVIDIA CUDA CUPTI nvidia-cublas-cupti >=11.6
NVIDIA CUDA NVCC nvidia-cuda-nvcc >=11.6
NVIDIA CUDA NVRTC nvidia-cuda-nvrtc 11.*
NVIDIA CUDA Runtime nvidia-cuda-runtime >=11.6
NVIDIA CUDA cuDNN nvidia-cudnn >=8.3
NVIDIA CUDA cuFFT nvidia-cufft >=10.7
NVIDIA CUDA cuRAND nvidia-curand >=10.2
NVIDIA CUDA cuSOLVER nvidia-cusolver >=11.3
NVIDIA CUDA cuSPARSE nvidia-cusparse >=11.7
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin ==1.10.0+nv22.02
NVIDIA DALI for CUDA 11.0 nvidia-dali-cuda110 ==1.10.0
Distributed training framework for TensorFlow, Keras, and PyTorch nvidia-horovod ==0.23.0+nv22.02
NVIDIA CUDA NCCL nvidia-nccl >=2.11
TensorBoard lets you watch Tensors Flow nvidia-tensorboard <1.16.0,>=1.15.0
NVIDIA TensorFlow nvidia-tensorflow ==1.15.5
NVIDIA TensorRT, a high-performance deep learning inference library nvidia-tensorrt ==8.2

Driver Requirements

Release 22.02 is based on NVIDIA CUDA 11.6.0, which requires NVIDIA Driver release 510 or later.

Software Requirements

The 22.02 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

  • TensorFlow Wheel release 22.02 is based on 1.15.5.

  • For TF2 added CudnnMHA Keras op to expose CUDNN's optimized multi-head attention implementation.

  • Fixed segmentation fault when VLOG logging was enabled in TF1.

  • Updated TF-TRT with latest upstream changes.

  • Fixed bug in TF2 where CUDNN's fused batched norm grad kernels could be called when training = false.

  • Extended autotuning over CUDNN fallback engines. This change may increase the execution time of the first few iterations, but can result in substantially better engines being chosen during later iterations.

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

Note: If you encounter functional or performance issues when XLA is enabled, please refer to the XLA Best Practices document. It offers pointers on how to diagnose symptoms and possibly address them.

  • For TensorFlow 1.15, TF-TRT inference throughput may regress for certain models by up to 37% compared to the 21.06-tf1 release. This will be fixed in a future release.

  • A CUDNN performance regression can cause slowdowns of up to 15% in certain ResNet models. This will be fixed in a future release.

  • There is a known performance regression affecting UNet Medical 3D model training by up to 23%. This will be addressed in a future release.

  • TF-TRT native segment fallback has a known issue causing a crash. This will occur when using TF-TRT to convert a model with a subgraph that is converted to TensorRT but fails to build at runtime. Instead of falling back to native TensorFlow TF-TRT will crash. Using export TF_TRT_OP_DENYLIST="ProblematicOp" can help to prevent conversion of an OP causing a native segment fallback.

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 15. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
NVIDIA CUDA cuBLAS nvidia-cublas >=11.8
NVIDIA CUDA CUPTI nvidia-cublas-cupti >=11.6
NVIDIA CUDA NVCC nvidia-cuda-nvcc >=11.6
NVIDIA CUDA NVRTC nvidia-cuda-nvrtc 11.*
NVIDIA CUDA Runtime nvidia-cuda-runtime >=11.6
NVIDIA CUDA cuDNN nvidia-cudnn >=8.3
NVIDIA CUDA cuFFT nvidia-cufft >=10.7
NVIDIA CUDA cuRAND nvidia-curand >=10.2
NVIDIA CUDA cuSOLVER nvidia-cusolver >=11.3
NVIDIA CUDA cuSPARSE nvidia-cusparse >=11.7
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin ==1.9.0+nv22.01
NVIDIA DALI for CUDA 11.0 nvidia-dali-cuda110 ==1.9.0
Distributed training framework for TensorFlow, Keras, and PyTorch nvidia-horovod ==0.23.0+nv22.01
NVIDIA CUDA NCCL nvidia-nccl >=2.11
TensorBoard lets you watch Tensors Flow nvidia-tensorboard <1.16.0,>=1.15.0
NVIDIA TensorFlow nvidia-tensorflow ==1.15.5
NVIDIA TensorRT, a high-performance deep learning inference library nvidia-tensorrt ==8.2

Driver Requirements

Release 22.01 is based on NVIDIA CUDA 11.6.0, which requires NVIDIA Driver release 510 or later.

Software Requirements

The 22.01 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

Note: If you encounter functional or performance issues when XLA is enabled, please refer to the XLA Best Practices document. It offers pointers on how to diagnose symptoms and possibly address them.

  • For TensorFlow 1.15, TF-TRT inference throughput may regress for certain models by up to 37% compared to the 21.06-tf1 release. This will be fixed in a future release.

  • A CUDNN performance regression can cause slowdowns of up to 15% in certain ResNet models. This will be fixed in a future release.

  • There is a known performance regression affecting UNet Medical 3D model training by up to 23%. This will be addressed in a future release.

  • TF-TRT native segment fallback has a known issue causing a crash. This will occur when using TF-TRT to convert a model with a subgraph that is converted to TensorRT but fails to build at runtime. Instead of falling back to native TensorFlow TF-TRT will crash. Using export TF_TRT_OP_DENYLIST="ProblematicOp" can help to prevent conversion of an OP causing a native segment fallback.

  • Debugging with TF_CPP_MIN_VLOG_LEVEL=3 can result in a segmentation while auto-tuning convolution algorithms. This will be fixed in the 22.02 release.

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 16. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
NVIDIA CUDA cuBLAS nvidia-cublas >=11.7.3.1
NVIDIA CUDA CUPTI nvidia-cublas-cupti >=11.5.57
NVIDIA CUDA NVCC nvidia-cuda-nvcc >=11.5.50
NVIDIA CUDA NVRTC nvidia-cuda-nvrtc 11.*
NVIDIA CUDA Runtime nvidia-cuda-runtime >=11.5.50
NVIDIA CUDA cuDNN nvidia-cudnn >=8.3.1.22
NVIDIA CUDA cuFFT nvidia-cufft >=10.6.0.54
NVIDIA CUDA cuRAND nvidia-curand >=10.2.6.48
NVIDIA CUDA cuSOLVER nvidia-cusolver >=11.2.1.48
NVIDIA CUDA cuSPARSE nvidia-cusparse >=11.7.0.31
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin ==1.8.0+nv21.12
NVIDIA DALI for CUDA 11.0 nvidia-dali-cuda110 ==1.8.0
Distributed training framework for TensorFlow, Keras, and PyTorch nvidia-horovod ==0.22.1+nv21.12
NVIDIA CUDA NCCL nvidia-nccl ==2.11.4
TensorBoard lets you watch Tensors Flow nvidia-tensorboard <1.16.0,>=1.15.0
NVIDIA TensorFlow nvidia-tensorflow 1.15.5
NVIDIA TensorRT, a high-performance deep learning inference library nvidia-tensorrt ==8.2.1.8

Driver Requirements

Release 21.12 is based on NVIDIA CUDA 11.5.0, which requires NVIDIA Driver release 495 or later.

Software Requirements

The 21.12 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

  • TensorFlow Wheel release 21.12 is based on 1.15.5.

  • The environment variable TF_DISABLE_REDUCED_PRECISION_REDUCTION=1 can now be set to disable intermediate reductions in lower precision than the requested math type.

  • Patched the following CVEs in TensorFlow 1.15.5: CVE-2021-29571, CVE-2021-29592, CVE-2021-29601, CVE-2021-29608, CVE-2021-29609, CVE-2021-29613, CVE-2021-22876, CVE-2021-22897, CVE-2021-22898, CVE-2021-22901, CVE-2021-37636, CVE-2021-37640, CVE-2021-37642, CVE-2021-37644, CVE-2021-37646, CVE-2021-37653, CVE-2021-37660, CVE-2021-37661, CVE-2021-37668, CVE-2021-37669, CVE-2021-37670, CVE-2021-37672, CVE-2021-37673, CVE-2021-37674, CVE-2021-37675, CVE-2021-37684, CVE-2021-37686, CVE-2021-37690, CVE-2021-37691, CVE-2021-41195, CVE-2021-41196, CVE-2021-41197, CVE-2021-41198, CVE-2021-41199, CVE-2021-41200, CVE-2021-41201, CVE-2021-41202, CVE-2021-41203, CVE-2021-41204, CVE-2021-41206, CVE-2021-41207, CVE-2021-41208, CVE-2021-41213, CVE-2021-41215, CVE-2021-41216, CVE-2021-41217, CVE-2021-41218, CVE-2021-41219, CVE-2021-41221, CVE-2021-41222, CVE-2021-41223, CVE-2021-41224, CVE-2021-41225, CVE-2021-41228, CVE-2021-22922, CVE-2021-22923, CVE-2021-22924, CVE-2021-22925, CVE-2021-22926

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

Note: If you encounter functional or performance issues when XLA is enabled, please refer to the XLA Best Practices document. It offers pointers on how to diagnose symptoms and possibly address them.

  • For TensorFlow 1.15, TF-TRT inference throughput may regress for certain models by up to 37% compared to the 21.06-tf1 release. This will be fixed in a future release.

  • A CUDNN performance regression can cause slowdowns of up to 15% in certain ResNet models. This will be fixed in a future release.

  • There is a known performance regression affecting UNet Medical 3D model training by up to 23%. This will be addressed in a future release.

  • TF-TRT native segment fallback has a known issue causing a crash. This will occur when using TF-TRT to convert a model with a subgraph that is converted to TensorRT but fails to build at runtime. Instead of falling back to native TensorFlow TF-TRT will crash. Using export TF_TRT_OP_DENYLIST="ProblematicOp" can help to prevent conversion of an OP causing a native segment fallback.

  • TensorFlow Wheel release 21.12 has a known corruption issue in its NVTX profiling markers when using the CUPTI library from CUDA Toolkit version 11.5. An updated CUPTI build, numbered 11.5.57 or higher, in CUDA 11.5 Update 1 will address this issue.

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 17. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
NVIDIA CUDA cuBLAS nvidia-cublas 11.7.3.1
NVIDIA CUDA CUPTI nvidia-cublas-cupti 11.5.*
NVIDIA CUDA NVCC nvidia-cuda-nvcc 11.5.50
NVIDIA CUDA NVRTC nvidia-cuda-nvrtc 11.*
NVIDIA CUDA Runtime nvidia-cuda-runtime 11.5.50
NVIDIA CUDA cuDNN nvidia-cudnn 8.3.0.96
NVIDIA CUDA cuFFT nvidia-cufft 10.6.0.54
NVIDIA CUDA cuRAND nvidia-curand 10.2.6.48
NVIDIA CUDA cuSOLVER nvidia-cusolver 11.2.1.48
NVIDIA CUDA cuSPARSE nvidia-cusparse 11.7.0.31
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin 1.7.0+nv21.11
NVIDIA DALI for CUDA 11.0 nvidia-dali-cuda110 1.7.0
Distributed training framework for TensorFlow, Keras, and PyTorch nvidia-horovod 0.22.1+nv21.11
NVIDIA CUDA NCCL nvidia-nccl 2.11.4
TensorBoard lets you watch Tensors Flow nvidia-tensorboard 1.15.0+nv21.11
NVIDIA TensorFlow nvidia-tensorflow 1.15.5
NVIDIA TensorRT, a high-performance deep learning inference library nvidia-tensorrt 8.0.3.4

Driver Requirements

Release 21.11 is based on NVIDIA CUDA 11.5.0, which requires NVIDIA Driver release 495 or later.

Software Requirements

The 21.11 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

Note: If you encounter functional or performance issues when XLA is enabled, please refer to the XLA Best Practices document. It offers pointers on how to diagnose symptoms and possibly address them.

  • For TensorFlow 1.15, TF-TRT inference throughput may regress for certain models by up to 37% compared to the 21.06-tf1 release. This will be fixed in a future release.

  • A CUDNN performance regression can cause slowdowns of up to 15% in certain ResNet models. This will be fixed in a future release.

  • There is a known issue in TensorRT 8.0 regarding accuracy for a certain case of int8 inferencing on A40 and similar GPUs. The version of TF-TRT in TF2 includes a feature that works around this issue, but TF1 does not include that feature and may experience the accuracy drop for a small subset of model/data type/batch size combinations on A40. This will be fixed in the next version of TensorRT.

  • TF-TRT native segment fallback has a known issue causing a crash. This will occur when using TF-TRT to convert a model with a subgraph that is converted to TensorRT but fails to build at runtime. Instead of falling back to native TensorFlow TF-TRT will crash. Using export TF_TRT_OP_DENYLIST="ProblematicOp" can help to prevent conversion of an OP causing a native segment fallback.

  • TensorFlow Wheel release 21.11 has a known corruption issue in its NVTX profiling markers when using the CUPTI library from CUDA Toolkit version 11.5. An updated CUPTI build, numbered 11.5.57 or higher, in CUDA 11.5 Update 1 will address this issue.

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 18. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
NVIDIA CUDA cuBLAS nvidia-cublas 11.6.5.2
NVIDIA CUDA CUPTI nvidia-cublas-cupti 11.4.120
NVIDIA CUDA NVCC nvidia-cuda-nvcc 11.4.120
NVIDIA CUDA NVRTC nvidia-cuda-nvrtc 11.*
NVIDIA CUDA Runtime nvidia-cuda-runtime 11.4.108
NVIDIA CUDA cuDNN nvidia-cudnn 8.2.4.15
NVIDIA CUDA cuFFT nvidia-cufft 10.5.2.100
NVIDIA CUDA cuRAND nvidia-curand 10.2.5.120
NVIDIA CUDA cuSOLVER nvidia-cusolver 11.2.0.120
NVIDIA CUDA cuSPARSE nvidia-cusparse 11.6.0.120
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin 1.6.0+nv21.10
NVIDIA DALI for CUDA 11.0 nvidia-dali-cuda110 1.6.0
NVIDIA DLprof binary installation nvidia-dlprof 1.6.0
NVIDIA Nsight Systems is a system-wide performance analysis tool designed to visualize an application’s algorithms, help you identify the largest opportunities to optimize, and tune to scale efficiently across any quantity or size of CPUs and GPUs; from large server to our smallest SoC nvidia-nsys-cli 2021.3.2.12
Distributed training framework for TensorFlow, Keras, and PyTorch nvidia-horovod 0.22.1+nv21.10
NVIDIA CUDA NCCL nvidia-nccl 2.11.4
DLprof TensorBoard plugin nvidia-tensorboard-plugin-dlprof 1.6.0
TensorBoard lets you watch Tensors Flow nvidia-tensorboard 1.15.0+nv21.10
NVIDIA TensorFlow nvidia-tensorflow 1.15.5
NVIDIA TensorRT, a high-performance deep learning inference library nvidia-tensorrt 8.0.3.4

Driver Requirements

Release 21.10 is based on NVIDIA CUDA 11.4.2 + cuBLAS 11.6.5.2, which requires NVIDIA Driver release 470 or later. However, if you are running on Tesla (for example, T4 or any other Tesla board), you may use NVIDIA driver release 418.40 (or later R418), 440.33 (or later R440), 450.51 (or later R450), or 460.27 (or later R460). The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Software Requirements

The 21.10 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

  • TensorFlow Wheel version 21.10 is based on 1.15.5.

  • Improved handling of exp ops in XLA.

  • Enable pointwise row vectorization for small rows in XLA.

  • Integrate latest TF-TRT features for dynamic shape support.

  • Gemm+bias+relu cublasLt based epilogue fusion in XLA. This feature can be enabled by setting the environment variable TF_USE_CUBLASLT=1.

NVIDIA TensorFlow Wheel Versions

Announcements

  • Starting with the 21.10 release, a beta version of the TensorFlow 1 and 2 containers is available for the ARM SBSA platform. Pulling the Docker image nvcr.io/nvidia/tensorflow:21.10-tf2-py3 on an ARM SBSA machine will automatically fetch the ARM-specific image.

  • The TensorCore example models are no longer provided in the core container (previously shipped in /workspace/nvidia-examples). These models are still available on Github or as model-specific containers from the NVIDIA GPU Cloud (NGC). Some python packages, included in previous containers to support these example models, have also been removed. Depending on specific use cases, users may need to install some packages that were previously pre-installed.

  • Support for SLURM PMI2 is deprecated and will be removed after the 21.12 release. PMIX is supported by the container, but is not supported by default in SLURM. Users depending on SLURM integration may need to configure SLURM for PMIX in the base OS as appropriate to their OS distribution (for Ubuntu 20.04, the required package is slurm-wlm-basic-plugins).

  • The nvtx-plugins utility package pre-installed in previous releases has been removed. Users depending on nvtx-plugins can install it using:

    Copy
    Copied!
                

    pip install nvtx-plugins

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

Note: If you encounter functional or performance issues when XLA is enabled, please refer to the XLA Best Practices document. It offers pointers on how to diagnose symptoms and possibly address them.

  • For TensorFlow 1.15, TF-TRT inference throughput may regress for certain models by up to 37% compared to the 21.06-tf1 release. This will be fixed in a future release.

  • The OpenSeq2Seq toolkit has been removed from the TensorFlow 1.x container.

  • There is a known issue in TensorRT 8.0 regarding accuracy for a certain case of int8 inferencing on A40 and similar GPUs. The version of TF-TRT in TF2 includes a feature that works around this issue, but TF1 does not include that feature and may experience the accuracy drop for a small subset of model/data type/batch size combinations on A40. This will be fixed in the next version of TensorRT.

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 19. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
NVIDIA CUDA cuBLAS nvidia-cublas 11.6.1.51
NVIDIA CUDA CUPTI nvidia-cublas-cupti 11.4.120
NVIDIA CUDA NVCC nvidia-cuda-nvcc 11.4.120
NVIDIA CUDA NVRTC nvidia-cuda-nvrtc 11.*
NVIDIA CUDA Runtime nvidia-cuda-runtime 11.4.108
NVIDIA CUDA cuDNN nvidia-cudnn 8.2.4.15
NVIDIA CUDA cuFFT nvidia-cufft 10.5.2.100
NVIDIA CUDA cuRAND nvidia-curand 10.2.5.120
NVIDIA CUDA cuSOLVER nvidia-cusolver 11.2.0.120
NVIDIA CUDA cuSPARSE nvidia-cusparse 11.6.0.120
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin 1.5.0+nv21.09
NVIDIA DALI for CUDA 11.0 nvidia-dali-cuda110 1.5.0
NVIDIA DLprof binary installation nvidia-dlprof 1.5.0
NVIDIA Nsight Systems is a system-wide performance analysis tool designed to visualize an application’s algorithms, help you identify the largest opportunities to optimize, and tune to scale efficiently across any quantity or size of CPUs and GPUs; from large server to our smallest SoC nvidia-nsys-cli 2021.3.1.57
Distributed training framework for TensorFlow, Keras, and PyTorch nvidia-horovod 0.22.1+nv21.09
NVIDIA CUDA NCCL nvidia-nccl 2.11.4
DLprof TensorBoard plugin nvidia-tensorboard-plugin-dlprof 1.5.0
TensorBoard lets you watch Tensors Flow nvidia-tensorboard 1.15.0+nv21.09
NVIDIA TensorFlow nvidia-tensorflow 1.15.5
NVIDIA TensorRT, a high-performance deep learning inference library nvidia-tensorrt 8.0.3.0

Driver Requirements

Release 21.09 is based on NVIDIA CUDA 11.4.2, which requires NVIDIA Driver release 470 or later. However, if you are running on Tesla (for example, T4 or any other Tesla board), you may use NVIDIA driver release 418.40 (or later R418), 440.33 (or later R440), 450.51 (or later R450), or 460.27 (or later R460). The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Software Requirements

The 21.09 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

NVIDIA TensorFlow Wheel Versions

Announcements

  • The TensorCore example models are no longer provided in the core container (previously shipped in /workspace/nvidia-examples). These models are still available on Github or as model-specific containers from the NVIDIA GPU Cloud (NGC). Some python packages, included in previous containers to support these example models, have also been removed. Depending on specific use cases, users may need to install some packages that were previously pre-installed.

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

Note: If you encounter functional or performance issues when XLA is enabled, please refer to the XLA Best Practices document. It offers pointers on how to diagnose symptoms and possibly address them.

  • Support for SLURM PMI2 is deprecated and will be removed after the 21.12 release. PMIX is supported by the container, but is not supported by default in SLURM. Users depending on SLURM integration may need to configure SLURM for PMIX in the base OS as appropriate to their OS distribution (for Ubuntu 20.04, the required package is slurm-wlm-basic-plugins).

  • The nvtx-plugins utility package pre-installed in previous releases has been removed. Users depending on nvtx-plugins can install it as pip install nvtx-plugins.

  • For TensorFlow 1.15, TF-TRT inference throughput may regress for certain models by up to 37% compared to the 21.06-tf1 release. This will be fixed in a future release.

  • The OpenSeq2Seq toolkit has been removed from the TensorFlow 1.x container.

  • There is a known issue in TensorRT 8.0 regarding accuracy for a certain case of int8 inferencing on A40 and similar GPUs. The version of TF-TRT in TF2 21.08 includes a feature that works around this issue, but TF1 21.08 does not include that feature and may experience the accuracy drop for a small subset of model/data type/batch size combinations on A40. This will be fixed in the next version of TensorRT.

  • There is a known performance regression of up to 30% when training SSD models with fp32 data type on T4 GPUs. This will be addressed in a future release.

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 20. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
NVIDIA CUDA cuBLAS nvidia-cublas 11.5.4.8
NVIDIA CUDA CUPTI nvidia-cublas-cupti 11.4.100
NVIDIA CUDA NVCC nvidia-cuda-nvcc 11.4.100
NVIDIA CUDA NVRTC nvidia-cuda-nvrtc 11.*
NVIDIA CUDA Runtime nvidia-cuda-runtime 11.4.108
NVIDIA CUDA cuDNN nvidia-cudnn 8.2.2.26
NVIDIA CUDA cuFFT nvidia-cufft 10.5.1.100
NVIDIA CUDA cuRAND nvidia-curand 10.2.5.100
NVIDIA CUDA cuSOLVER nvidia-cusolver 11.2.0.100
NVIDIA CUDA cuSPARSE nvidia-cusparse 11.6.0.100
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin 1.4.0+nv21.08
NVIDIA DALI for CUDA 11.0 nvidia-dali-cuda110 1.4.0
NVIDIA DLprof binary installation nvidia-dlprof 1.4.0
NVIDIA Nsight Systems is a system-wide performance analysis tool designed to visualize an application’s algorithms, help you identify the largest opportunities to optimize, and tune to scale efficiently across any quantity or size of CPUs and GPUs; from large server to our smallest SoC nvidia-nsys-cli 2021.2.4.12
Distributed training framework for TensorFlow, Keras, and PyTorch nvidia-horovod 0.22.1+nv21.08
NVIDIA CUDA NCCL nvidia-nccl 2.10.3
DLprof TensorBoard plugin nvidia-tensorboard-plugin-dlprof 1.4.0
TensorBoard lets you watch Tensors Flow nvidia-tensorboard 1.15.0+nv21.08
NVIDIA TensorFlow nvidia-tensorflow 1.15.5
NVIDIA TensorRT, a high-performance deep learning inference library nvidia-tensorrt 8.0.1.6

Driver Requirements

Release 21.08 is based on NVIDIA CUDA 11.4.1, which requires NVIDIA Driver release 470 or later. However, if you are running on Tesla (for example, T4 or any other Tesla board), you may use NVIDIA driver release 418.40 (or later R418), 440.33 (or later R440), 450.51 (or later R450), or 460.27 (or later R460). The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Software Requirements

The 21.08 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

  • TensorFlow Wheel version 21.08 is based on 1.15.5.

  • Experimental integration of the cutensor library for einsum operations is included in the 21.08-tf2-py3 container. This should improve performance for many einsum operations. To enable export TF_ENABLE_CUTENSOR_EINSUM=1.

  • Added XLA feature to de-select compilation candidates based on shape inference. To enable this feature, use the environment variable TF_XLA_DO_NOT_COMPILE_POSSIBLE_DYNAMIC_OPS.

  • Bug fixes for cudaMallocAsync GPU memory allocator.

  • MKL is enabled for better performance in CPU-only workloads. To enable, set OMP_NUM_THREADS to a value >= 1.

NVIDIA TensorFlow Wheel Versions

Announcements

  • The TensorCore example models are no longer provided in the core container (previously shipped in /workspace/nvidia-examples). These models are still available on Github or as model-specific containers from the NVIDIA GPU Cloud (NGC). Some python packages, included in previous containers to support these example models, have also been removed. Depending on specific use cases, users may need to install some packages that were previously pre-installed.

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

Note: If you encounter functional or performance issues when XLA is enabled, please refer to the XLA Best Practices document. It offers pointers on how to diagnose symptoms and possibly address them.

  • The OpenSeq2Seq toolkit is deprecated and will be removed startin in the 21.09-tf1-py3 release. This only affects the TensorFlow 1.x release.

  • There is a known issue in TensorRT 8.0 regarding accuracy for a certain case of int8 inferencing on A40 and similar GPUs. The version of TF-TRT in TF2 21.08 includes a feature that works around this issue, but TF1 21.08 does not include that feature and may experience the accuracy drop for a small subset of model/data type/batch size combinations on A40. This will be fixed in the next version of TensorRT.

  • A known regression can reduce the training performance of VGG-16 by up to 12% at certain batch sizes. There is a known performance regression of up to 30% when training SSD models with fp32 data type on T4 GPUs. This will be addressed in a future release.

  • There is a known issue where attempting to convert some models using TF-TRT produces an error "Failed to import metagraph". This issue is still under investigation and will be resolved in a future release.

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 21. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
NVIDIA CUDA cuBLAS nvidia-cublas 11.5.2.43
NVIDIA CUDA CUPTI nvidia-cublas-cupti 11.4.65
NVIDIA CUDA NVCC nvidia-cuda-nvcc 11.4.48
NVIDIA CUDA NVRTC nvidia-cuda-nvrtc 11.*
NVIDIA CUDA Runtime nvidia-cuda-runtime 11.4.43
NVIDIA CUDA cuDNN nvidia-cudnn 8.2.2.26
NVIDIA CUDA cuFFT nvidia-cufft 10.5.0.43
NVIDIA CUDA cuRAND nvidia-curand 10.2.5.43
NVIDIA CUDA cuSOLVER nvidia-cusolver 11.2.0.43
NVIDIA CUDA cuSPARSE nvidia-cusparse 11.6.0.43
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin 1.3.0+nv21.07
NVIDIA DALI for CUDA 11.0 nvidia-dali-cuda110 1.3.0
NVIDIA DLprof binary installation nvidia-dlprof 1.3.0
NVIDIA Nsight Systems is a system-wide performance analysis tool designed to visualize an application’s algorithms, help you identify the largest opportunities to optimize, and tune to scale efficiently across any quantity or size of CPUs and GPUs; from large server to our smallest SoC nvidia-nsys-cli 2021.2.1.58
Distributed training framework for TensorFlow, Keras, and PyTorch nvidia-horovod 0.22.1+nv21.07
NVIDIA CUDA NCCL nvidia-nccl 2.10.3
DLprof TensorBoard plugin nvidia-tensorboard-plugin-dlprof 1.2.0
TensorBoard lets you watch Tensors Flow nvidia-tensorboard 1.15.0+nv21.07
NVIDIA TensorFlow nvidia-tensorflow 1.15.5
NVIDIA TensorRT, a high-performance deep learning inference library nvidia-tensorrt 8.0.1.6

Driver Requirements

Release 21.07 is based on NVIDIA CUDA 11.4.0, which requires NVIDIA Driver release 470 or later. However, if you are running on Data Center GPUs (formerly Tesla), for example, T4, you may use NVIDIA driver release 418.40 (or later R418), 440.33 (or later R440), 450.51 (or later R450), or 460.27 (or later R460). The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades and NVIDIA CUDA and Drivers Support.

Software Requirements

The 21.07 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

  • TensorFlow Wheel version 21.07 is based on 1.15.5.

  • Integrate TRT 8 Support

  • Increase GPU memory reservation to avoid OOM errors in some cases.

  • Improve NVTX markers to include XLA cluster names.

  • Fix deadlock in XLA by backporting upstream PR 50280 to TF1 and TF2.

  • Fix bug so that CUDNN now respects TF32 disable switch.

  • TF2 implements support for embedding ops on GPU

    • SparseFillEmptyRows[Grad]

    • fp16 embedding_lookup_sparse

    • fp16 SparseSegmentSumGrad

    • SparseSegmentSum/Mean

    • SparseSegmentSum/MeanGrad

    • hash value to string

  • TF2 - Use CUDA occupancy calculator to improve performance of BiasAdd

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

Note: If you encounter functional or performance issues when XLA is enabled, please refer to the XLA Best Practices document. It offers pointers on how to diagnose symptoms and possibly address them.

  • TF1 and TF2 containers include a version of Django with a known vulnerability that was discovered late in our QA process. See CVE-2021-35042 for details. This will be fixed in the next release.

  • A known regression can reduce the training performance of VGG-16 by up to 12% at certain batch sizes.

  • There is a known performance regression of up to 30% when training SSD models with fp32 data type on T4 GPUs. This will be addressed in a future release.

  • There is a known issue where attempting to convert some models using TF-TRT produces an error "Failed to import metagraph". This issue is still under investigation and will be resolved in a future release.

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 22. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
NVIDIA CUDA cuBLAS nvidia-cublas 11.5.1.109
NVIDIA CUDA CUPTI nvidia-cublas-cupti 11.3.111
NVIDIA CUDA NVCC nvidia-cuda-nvcc 11.3.109
NVIDIA CUDA NVRTC nvidia-cuda-nvrtc 11.*
NVIDIA CUDA Runtime nvidia-cuda-runtime 11.3.109
NVIDIA CUDA cuDNN nvidia-cudnn 8.2.1.32
NVIDIA CUDA cuFFT nvidia-cufft 10.4.2.109
NVIDIA CUDA cuRAND nvidia-curand 10.2.4.109
NVIDIA CUDA cuSOLVER nvidia-cusolver 11.1.2.109
NVIDIA CUDA cuSPARSE nvidia-cusparse 11.6.0.109
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin 1.2.0+nv21.06
NVIDIA DALI for CUDA 11.0 nvidia-dali-cuda110 1.2.0
NVIDIA DLprof binary installation nvidia-dlprof 1.2.0
NVIDIA Nsight Systems is a system-wide performance analysis tool designed to visualize an application’s algorithms, help you identify the largest opportunities to optimize, and tune to scale efficiently across any quantity or size of CPUs and GPUs; from large server to our smallest SoC nvidia-nsys-cli 2021.2.1.58
Distributed training framework for TensorFlow, Keras, and PyTorch nvidia-horovod 0.22.0+nv21.06
NVIDIA CUDA NCCL nvidia-nccl 2.9.9
DLprof TensorBoard plugin nvidia-tensorboard-plugin-dlprof 1.2.0
TensorBoard lets you watch Tensors Flow nvidia-tensorboard 1.15.0+nv21.06
NVIDIA TensorFlow nvidia-tensorflow 1.15.5
NVIDIA TensorRT, a high-performance deep learning inference library nvidia-tensorrt 7.2.3.4

Driver Requirements

Release 21.06 is based on NVIDIA CUDA 11.3.1, which requires NVIDIA Driver release 465.19.01 or later. However, if you are running on Data Center GPUs (formerly Tesla), for example, T4, you may use NVIDIA driver release 418.40 (or later R418), 440.33 (or later R440), 450.51 (or later R450), or 460.27 (or later R460). The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades and NVIDIA CUDA and Drivers Support.

Software Requirements

The 21.06 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

  • TensorFlow Wheel version 21.06 is based on 1.15.5.
  • Fixed bug that caused XLA to initialize TensorFlow on all visible GPUs leading to OOM errors in Horovod and other multi-process configurations.

  • Fixed bug in FakeQuantizeAndDequantize op that would result in non-symmetric quantization when max=-min.

  • Implemented GPU kernels for ops common in recommender model input pipelines: SparseApplyFtrl, [Sparse]ApplyProximalAdagrad, SparseReshape, SparseToDense.

  • Vectorized GPU Gather op to improve performance.

  • Introduced env var TF_CPP_VLOG_FILENAME to direct VLOG output to a file.

  • Improved CUDNN kernel selection by switching to CUDNN_HEUR_B kernel selector.

  • Updated tensorflow-addons to r0.13.

  • Added support for FussedBatchNormGrad op to optimize side-inputs and activations.

  • Patched recently announced vulnerabilities in TF 1.15.5: CVE-2021-29591, CVE-2021-29605, CVE-2021-29606, CVE-2021-29614.

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

Note: If you encounter functional or performance issues when XLA is enabled, please refer to the XLA Best Practices document. It offers pointers on how to diagnose symptoms and possibly address them.

  • In certain cases, TensorFlow may claim too much memory on Pascal-based GPUs leading to failures due to OOM and potentially an application hang. This can be worked around by setting the environment variable TF_DEVICE_MIN_SYS_MEMORY_IN_MB to 675. This will be fixed in the 21.07 release.

  • A known regression can reduce the training performance of VGG-16 by up to 12% at certain batch sizes.

  • There is a known performance regression of up to 30% when training SSD models with fp32 data type on T4 GPUs. This will be addressed in a future release.

  • There is a known issue where attempting to convert some models using TF-TRT produces an error "Failed to import metagraph". This issue is still under investigation and will be resolved in a future release.

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 23. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
NVIDIA CUDA cuBLAS nvidia-cublas 11.5.1.101
NVIDIA CUDA CUPTI nvidia-cublas-cupti 11.3.58
NVIDIA CUDA NVCC nvidia-cuda-nvcc 11.3.58
NVIDIA CUDA NVRTC nvidia-cuda-nvrtc 11.*
NVIDIA CUDA Runtime nvidia-cuda-runtime 11.3.58
NVIDIA CUDA cuDNN nvidia-cudnn 8.2.0.51
NVIDIA CUDA cuFFT nvidia-cufft 10.4.2.58
NVIDIA CUDA cuRAND nvidia-curand 10.2.4.58
NVIDIA CUDA cuSOLVER nvidia-cusolver 11.1.1.58
NVIDIA CUDA cuSPARSE nvidia-cusparse 11.5.0.58
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin 1.0.0+nv21.05
NVIDIA DALI for CUDA 11.0 nvidia-dali-cuda110 1.0.0
NVIDIA DLprof binary installation nvidia-dlprof 1.1.0
NVIDIA Nsight Systems is a system-wide performance analysis tool designed to visualize an application’s algorithms, help you identify the largest opportunities to optimize, and tune to scale efficiently across any quantity or size of CPUs and GPUs; from large server to our smallest SoC nvidia-nsys-cli 2021.2.1.58
Distributed training framework for TensorFlow, Keras, and PyTorch nvidia-horovod 0.21.3+nv21.05
NVIDIA CUDA NCCL nvidia-nccl 2.9.8
DLprof TensorBoard plugin nvidia-tensorboard-plugin-dlprof 1.2.0
TensorBoard lets you watch Tensors Flow nvidia-tensorboard 1.15.0+nv21.4
NVIDIA TensorFlow nvidia-tensorflow 1.15.5
NVIDIA TensorRT, a high-performance deep learning inference library nvidia-tensorrt 7.2.3.4

Driver Requirements

Release 21.05 is based on NVIDIA CUDA 11.3.0, which requires NVIDIA Driver release 465.19.01 or later. However, if you are running on Data Center GPUs (formerly Tesla), for example, T4, you may use NVIDIA driver release 418.40 (or later R418), 440.33 (or later R440), 450.51 (or later R450), or 460.27 (or later R460). The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades and NVIDIA CUDA and Drivers Support.

Software Requirements

The 21.05 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

Note: If you encounter functional or performance issues when XLA is enabled, please refer to the XLA Best Practices document. It offers pointers on how to diagnose symptoms and possibly address them.

  • Using XLA together with Horovod to parallelize training on a single node can result in out-of-memory errors. A workaround is to execute the job as follows. This will be fixed in a future release.

    Copy
    Copied!
                

    XLA_FLAGS=--xla_multiheap_size_constraint_per_heap=2000000000 TF_NUM_INTEROP_THREADS=1 horovodrun -np 8 bash -c 'CUDA_VISIBLE_DEVICES=$OMPI_COMM_WORLD_LOCAL_RANK python ...'

  • There is a known performance regression of up to 30% when training SSD models with fp32 data type on T4 GPUs. This will be addressed in a future release.

  • There is a known issue where attempting to convert some models using TF-TRT produces an error "Failed to import metagraph". This issue is still under investigation and will be resolved in a future release.

  • There is a known cuDNN performance regression affecting certain batch sizes of VGG based models by up to 45%. This will be fixed in a later release.

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 24. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
NVIDIA CUDA cuBLAS nvidia-cublas 11.5.1.101
NVIDIA CUDA CUPTI nvidia-cublas-cupti 11.3.58
NVIDIA CUDA NVCC nvidia-cuda-nvcc 11.3.58
NVIDIA CUDA NVRTC nvidia-cuda-nvrtc 11.*
NVIDIA CUDA Runtime nvidia-cuda-runtime 11.3.58
NVIDIA CUDA cuDNN nvidia-cudnn 8.2.0.41
NVIDIA CUDA cuFFT nvidia-cufft 10.4.2.58
NVIDIA CUDA cuRAND nvidia-curand 10.2.4.58
NVIDIA CUDA cuSOLVER nvidia-cusolver 11.1.1.58
NVIDIA CUDA cuSPARSE nvidia-cusparse 11.5.0.58
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin 1.0.0+nv21.04
NVIDIA DALI for CUDA 11.0 nvidia-dali-cuda110 1.0.0
NVIDIA DLprof binary installation nvidia-dlprof 1.1.0
NVIDIA Nsight Systems is a system-wide performance analysis tool designed to visualize an application’s algorithms, help you identify the largest opportunities to optimize, and tune to scale efficiently across any quantity or size of CPUs and GPUs; from large server to our smallest SoC nvidia-nsys-cli 2021.2.1.58
Distributed training framework for TensorFlow, Keras, and PyTorch nvidia-horovod 0.21.3+nv21.04
NVIDIA CUDA NCCL nvidia-nccl 2.9.6
DLprof TensorBoard plugin nvidia-tensorboard-plugin-dlprof 1.2.0
TensorBoard lets you watch Tensors Flow nvidia-tensorboard 1.15.0+nv21.4
NVIDIA TensorFlow nvidia-tensorflow 1.15.5
NVIDIA TensorRT, a high-performance deep learning inference library nvidia-tensorrt 7.2.3.4

Driver Requirements

Release 21.04 is based on NVIDIA CUDA 11.3.0, which requires NVIDIA Driver release 465.19.01 or later. However, if you are running on Data Center GPUs (formerly Tesla), for example, T4, you may use NVIDIA driver release 418.40 (or later R418), 440.33 (or later R440), 450.51 (or later R450), or 460.27 (or later R460). The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades and NVIDIA CUDA and Drivers Support.

Software Requirements

The 21.04 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

  • TensorFlow Wheel version 21.04 is based on 1.15.5.
  • Add GPU-deterministic tf.sparse.sparse_dense_matmul support (for the tf.float32 data type). When TF_DETERMINISTIC_OPS is set to "true" or "1" then tf.sparse.sparse_dense_matmul will operate deterministically in both the forward and backward direction.
  • Integrated CUDNN v8 API for RNN and fused conv+bias+activation ops.
  • Fixed an issue that caused OOM errors in some cases when using a batch size of 1.
  • Improved XLA handling of dynamic ops to avoid frequent recompilation.
  • Implemented XLA persistent cache.
  • Implemented custom learning rate support in Horovod.

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

Note: If you encounter functional or performance issues when XLA is enabled, please refer to the XLA Best Practices document. It offers pointers on how to diagnose symptoms and possibly address them.

  • Using XLA together with Horovod to parallelize training on a single node can result in out-of-memory errors. A workaround is to execute the job as follows. This will be fixed in a future release.

    Copy
    Copied!
                

    XLA_FLAGS=--xla_multiheap_size_constraint_per_heap=2000000000 TF_NUM_INTEROP_THREADS=1 horovodrun -np 8 bash -c 'CUDA_VISIBLE_DEVICES=$OMPI_COMM_WORLD_LOCAL_RANK python ...'

  • There is a known performance regression of up to 30% when training SSD models with fp32 data type on T4 GPUs. This will be addressed in a future release.

  • There is a known issue where attempting to convert some models using TF-TRT produces an error "Failed to import metagraph". This issue is still under investigation and will be resolved in a future release.

  • There is a known CUDNN performance regression affecting certain batch sizes of VGG based models by up to 45%. This will be fixed in a later release.

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 25. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
NVIDIA CUDA cuBLAS nvidia-cublas 11.4.1.1026
NVIDIA CUDA CUPTI nvidia-cublas-cupti 11.2.135
NVIDIA CUDA NVCC nvidia-cuda-nvcc 11.2.142
NVIDIA CUDA NVRTC nvidia-cuda-nvrtc 11.*
NVIDIA CUDA Runtime nvidia-cuda-runtime 11.2.146
NVIDIA CUDA cuDNN nvidia-cudnn 8.1.1.33
NVIDIA CUDA cuFFT nvidia-cufft 10.4.0.135
NVIDIA CUDA cuRAND nvidia-curand 10.2.3.135
NVIDIA CUDA cuSOLVER nvidia-cusolver 11.1.0.135
NVIDIA CUDA cuSPARSE nvidia-cusparse 11.4.0.135
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin 0.31.0+nv21.03
NVIDIA DALI for CUDA 11.0 nvidia-dali-cuda110 0.31.0
NVIDIA DLprof binary installation nvidia-dlprof 1.0.0
NVIDIA Nsight Systems is a system-wide performance analysis tool designed to visualize an application’s algorithms, help you identify the largest opportunities to optimize, and tune to scale efficiently across any quantity or size of CPUs and GPUs; from large server to our smallest SoC nvidia-nsys-cli 2020.4.1.117
Distributed training framework for TensorFlow, Keras, and PyTorch nvidia-horovod 0.21.3+nv21.03
NVIDIA CUDA NCCL nvidia-nccl 2.8.4
DLprof TensorBoard plugin nvidia-tensorboard-plugin-dlprof 0.12
TensorBoard lets you watch Tensors Flow nvidia-tensorboard 1.15.0+nv21.3
NVIDIA TensorFlow nvidia-tensorflow 1.15.5
NVIDIA TensorRT, a high-performance deep learning inference library nvidia-tensorrt 7.2.2.3

Driver Requirements

Release 21.03 is based on NVIDIA CUDA 11.2.1, which requires NVIDIA Driver release 460.32.03 or later. However, if you are running on Data Center GPUs (formerly Tesla), for example, T4, you may use NVIDIA driver release 418.xx, 440.30, 450.51, or 455.xx. The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades and NVIDIA CUDA and Drivers Support.

Software Requirements

The 21.03 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

  • TensorFlow Wheel version 21.03 is based on 1.15.5.
  • CUDA 11.2.1 including cuBLAS 11.4.1.1026
  • NVIDIA cuDNN 8.1.1
  • DALI 0.31.0
  • DLProf 1.0.0
  • Ubuntu 20.04
  • NVTX profiling annotation ranges more accurately report the execution of asynchronous operations. Note that when profiling NVTX ranges must now be explicitly enabled by setting the environment variable TF_ENABLE_NVTX_RANGES=1.
  • The CUDNN backend API is now used for convolutional ops. This provides a significant performance benefit by reducing CPU overheads of convolutions.
  • The fused Conv+Bias+Relu op regression in CUDNN has been fixed and this op has been re-enabled in both XLA and the TF grappler optimizers. This improves performance particularly for inference in convolutional models.
  • Bugs relating to auto-graph in TensorFlow 1.15 with Python 3.8 were fixed.

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

Note: If you encounter functional or performance issues when XLA is enabled, please refer to the XLA Best Practices document. It offers pointers on how to diagnose symptoms and possibly address them.

  • Using XLA together with Horovod to parallelize training on a single node can result in out-of-memory errors. A workaround is to execute the job as follows. This will be fixed in a future release.

    Copy
    Copied!
                

    XLA_FLAGS=--xla_multiheap_size_constraint_per_heap=2000000000 TF_NUM_INTEROP_THREADS=1 horovodrun -np 8 bash -c 'CUDA_VISIBLE_DEVICES=$OMPI_COMM_WORLD_LOCAL_RANK python ...'

  • There is a known performance regression of up to 30% when training SSD models with fp32 data type on T4 GPUs. This will be addressed in a future release.

  • There is a known issue where attempting to convert some models using TF-TRT produces an error "Failed to import metagraph". This issue is still under investigation and will be resolved in a future release.

  • Training the UNET3D models with a batch size of 1 can result in OOM (Out-Of-Memory) in the TensorFlow 1 container. This is caused by the map_and_batch_fusion optimizer from using the tf.datasets. One workaround solution is to add:

    Copy
    Copied!
                

    if self._batch_size == 1: options = dataset.options() options.experimental_optimization.map_and_batch_fusion = False dataset = dataset.with_options(options)

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 26. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
NVIDIA CUDA cuBLAS nvidia-cublas 11.3.1.68
NVIDIA CUDA CUPTI nvidia-cublas-cupti 11.2.67
NVIDIA CUDA NVCC nvidia-cuda-nvcc 11.2.67
NVIDIA CUDA NVRTC nvidia-cuda-nvrtc 11.*
NVIDIA CUDA Runtime nvidia-cuda-runtime 11.2.72
NVIDIA CUDA cuDNN nvidia-cudnn 8.1.0.77
NVIDIA CUDA cuFFT nvidia-cufft 10.4.0.72
NVIDIA CUDA cuRAND nvidia-curand 10.2.3.68
NVIDIA CUDA cuSOLVER nvidia-cusolver 11.0.2.68
NVIDIA CUDA cuSPARSE nvidia-cusparse 11.3.1.68
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin 0.29.0+nv21.02
NVIDIA DALI for CUDA 11.0 nvidia-dali-cuda110 0.29.0
NVIDIA DLprof binary installation nvidia-dlprof 0.19.0
NVIDIA Nsight Systems is a system-wide performance analysis tool designed to visualize an application’s algorithms, help you identify the largest opportunities to optimize, and tune to scale efficiently across any quantity or size of CPUs and GPUs; from large server to our smallest SoC nvidia-nsys-cli 2020.4.1.117
Distributed training framework for TensorFlow, Keras, and PyTorch nvidia-horovod 0.21.0+nv21.02
NVIDIA CUDA NCCL nvidia-nccl 2.8.4
DLprof TensorBoard plugin nvidia-tensorboard-plugin-dlprof 0.11
TensorBoard lets you watch Tensors Flow nvidia-tensorboard 1.15.0+nv21.2
NVIDIA TensorFlow nvidia-tensorflow 1.15.5
NVIDIA TensorRT, a high-performance deep learning inference library nvidia-tensorrt 7.2.2.3

Driver Requirements

Release 21.02 is based on NVIDIA CUDA 11.2.0, which requires NVIDIA Driver release 460.27.04 or later. However, if you are running on Data Center GPUs (formerly Tesla), for example, T4, you may use NVIDIA driver release 418.xx, 440.30, 450.51, or 455.xx. The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades and NVIDIA CUDA and Drivers Support.

Software Requirements

The 21.02 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

Note: If you encounter functional or performance issues when XLA is enabled, please refer to the XLA Best Practices document. It offers pointers on how to diagnose symptoms and possibly address them.

  • A regression (only observed with NVIDIA Ampere GPU architecture) in cuDNN’s fused Convolution+Bias+Activation implementation can cause performance regressions of up to 24% in some models such as UNet Medical. This will be fixed in a future cuDNN release.

  • Some image-based inference workloads see a regression of up to 50% for the smallest batch sizes. This is due to regressions in cuDNN 8.0.4, which will be addressed in a future release.

  • A few models see performance regressions compared to the 20.08 release. Training WideAndDeep sees regressions of up to 30% on A100. In FP32 the TF1 Unet Industrial and Bert fine tuning training regress from 10-20%. Also the TF2 Unet Medical and MaskRCNN models regress by about 20% in some cases. These regressions will be addressed in a future release.

  • There are several known performance regressions compared to 20.07. UNet Medical and Industrial on V100 and A100 GPUs can be up to 20% slower. VGG can be up to 95% slower on A100 and 15% slower on Turing GPUs. Googlenet can be up to 20% slower on V100. And ResNet50 inferencing can be up to 30% slower on A100 and Turing GPUs.

  • There is a known performance regression of up to 30% when training SSD models with fp32 data type on T4 GPUs. This will be addressed in a future release.

  • There is a known issue where attempting to convert some models using TF-TRT produces an error "Failed to import metagraph". This issue is still under investigation and will be resolved in a future release.

  • Training the UNET3D models with a batch size of 1 can result in OOM (Out-Of-Memory) in the TensorFlow 1 container. This is caused by the map_and_batch_fusion optimizer from using the tf.datasets. One workaround solution is to add:

    Copy
    Copied!
                

    if self._batch_size == 1: options = dataset.options() options.experimental_optimization.map_and_batch_fusion = False dataset = dataset.with_options(options)

The NVIDIA container image release for TensorFlow Wheel 21.01 has been canceled. The next release will be the 21.02 release which is expected to be released at the end of February.

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 27. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
NVIDIA CUDA cuBLAS nvidia-cublas 11.3.0.106
NVIDIA CUDA CUPTI nvidia-cublas-cupti 11.1.105
NVIDIA CUDA NVCC nvidia-cuda-nvcc 11.1.105
NVIDIA CUDA NVRTC nvidia-cuda-nvrtc 11.*
NVIDIA CUDA Runtime nvidia-cuda-runtime 11.1.74
NVIDIA CUDA cuDNN nvidia-cudnn 8.0.5.43
NVIDIA CUDA cuFFT nvidia-cufft 10.3.0.105
NVIDIA CUDA cuRAND nvidia-curand 10.2.2.105
NVIDIA CUDA cuSOLVER nvidia-cusolver 11.0.1.105
NVIDIA CUDA cuSPARSE nvidia-cusparse 11.3.0.10
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin 0.28.0+nv20.12
NVIDIA DALI for CUDA 11.0 nvidia-dali-cuda110 0.28.0
NVIDIA DLprof binary installation nvidia-dlprof 0.18.0
NVIDIA Nsight Systems is a system-wide performance analysis tool designed to visualize an application’s algorithms, help you identify the largest opportunities to optimize, and tune to scale efficiently across any quantity or size of CPUs and GPUs; from large server to our smallest SoC nvidia-nsys-cli 2020.4.1.117
Distributed training framework for TensorFlow, Keras, and PyTorch nvidia-horovod 0.20.2+nv20.12
NVIDIA CUDA NCCL nvidia-nccl 2.8.3
DLprof TensorBoard plugin nvidia-tensorboard-plugin-dlprof 0.10
TensorBoard lets you watch Tensors Flow nvidia-tensorboard 1.15.0+nv20.12
NVIDIA TensorFlow nvidia-tensorflow 1.15.3
NVIDIA TensorRT, a high-performance deep learning inference library nvidia-tensorrt 7.2.2.1

Driver Requirements

Release 20.12 is based on NVIDIA CUDA 11.1.0, which requires NVIDIA Driver release 455 or later. However, if you are running on Tesla (for example, T4 or any other Tesla board), you may use NVIDIA driver release 418.xx, 440.xx, or 450.xx. The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Software Requirements

The 20.12 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

  • In certain cases running on Pascal GPUs may result in out-of-memory errors which may present as apparent job hangs. This can be worked around by exporting the following environment variable:

    Copy
    Copied!
                

    TF_DEVICE_MIN_SYS_MEMORY_IN_MB=550

  • A regression in cuDNN’s fused Convolution+Bias+Activation implementation can cause performance regressions of up to 24% in some models such as UNet Medical. This will be fixed in a future cuDNN release.

  • Some image-based inference workloads see a regression of up to 50% for the smallest batch sizes. This is due to regressions in cuDNN 8.0.4, which will be addressed in a future release.

  • A few models see performance regressions compared to the 20.08 release. Training WideAndDeep sees regressions of up to 30% on A100. In FP32 the TF1 Unet Industrial and Bert fine tuning training regress from 10-20%. Also the TF2 Unet Medical and MaskRCNN models regress by about 20% in some cases. These regressions will be addressed in a future release.

  • There are several known performance regressions compared to 20.07. UNet Medical and Industrial on V100 and A100 GPUs can be up to 20% slower. VGG can be up to 95% slower on A100 and 15% slower on Turing GPUs. Googlenet can be up to 20% slower on V100. And ResNet50 inferencing can be up to 30% slower on A100 and Turing GPUs.

  • An out-of-memory condition can occur in TensorFlow (TF1) 20.08 for some models (such as ResNet-50, and ResNext) when Horovod and XLA are both in use. In XLA, we added an optimization that skips compiling a cluster the very first time it is executed, which can help avoid unnecessary recompilations for models with dynamic shapes. On the other hand, for models like ResNet-50, the preferred compilation strategy is to aggressively compile clusters, as compiled clusters are executed many times. Per the "XLA Best Practices" section of the TensorFlow User Guide, running XLA with the following environment variable opts in to that strategy:

    Copy
    Copied!
                

    TF_XLA_FLAGS=--tf_xla_enable_lazy_compilation=false

  • There is a known performance regression of up to 60% when running inference using TF-TRT for SSD models with small batch size. This will be addressed in a future release.

  • There is a known performance regression of up to 30% when training SSD models with fp32 data type on T4 GPUs. This will be addressed in a future release.

  • There is a known issue where attempting to convert some models using TF-TRT produces an error "Failed to import metagraph". This issue is still under investigation and will be resolved in a future release.

  • There is a known issue of OOM (Out-Of-Memory) when training the UNET3D models when batch size = 1 in TensorFlow (TF1) container.

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 28. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
NVIDIA CUDA cuBLAS nvidia-cublas 11.2.1.74
NVIDIA CUDA CUPTI nvidia-cublas-cupti 11.1.69
NVIDIA CUDA NVCC nvidia-cuda-nvcc 11.1.74
NVIDIA CUDA NVRTC nvidia-cuda-nvrtc 11.*
NVIDIA CUDA Runtime nvidia-cuda-runtime 11.1.74
NVIDIA CUDA cuDNN nvidia-cudnn 8.0.4.30
NVIDIA CUDA cuFFT nvidia-cufft 10.3.0.74
NVIDIA CUDA cuRAND nvidia-curand 10.2.2.74
NVIDIA CUDA cuSOLVER nvidia-cusolver 11.0.0.74
NVIDIA CUDA cuSPARSE nvidia-cusparse 11.2.0.275
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin 0.27.0+nv20.11
NVIDIA DALI for CUDA 11.0 nvidia-dali-cuda110 0.27.0
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin 0.27.0+nv20.11
NVIDIA DLprof binary installation nvidia-dlprof 0.17.0
NVIDIA Nsight Systems is a system-wide performance analysis tool designed to visualize an application’s algorithms, help you identify the largest opportunities to optimize, and tune to scale efficiently across any quantity or size of CPUs and GPUs; from large server to our smallest SoC nvidia-nsys-cli 2020.4.1.117
Distributed training framework for TensorFlow, Keras, and PyTorch nvidia-horovod 0.20.2+nv20.11
NVIDIA CUDA NCCL nvidia-nccl 2.8.2
DLprof TensorBoard plugin nvidia-tensorboard-plugin-dlprof 0.9
TensorBoard lets you watch Tensors Flow nvidia-tensorboard 1.15.0+nv20.11
NVIDIA TensorFlow nvidia-tensorflow 1.15.3
NVIDIA TensorRT, a high-performance deep learning inference library nvidia-tensorrt 7.2.1.6

Driver Requirements

Release 20.11 is based on NVIDIA CUDA 11.1.0, which requires NVIDIA Driver release 455 or later. However, if you are running on Tesla (for example, T4 or any other Tesla board), you may use NVIDIA driver release 418.xx, 440.xx, or 450.xx. The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Software Requirements

The 20.11 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

  • TensorFlow Wheel version 20.11 is based on 1.15.4.
  • CUDA 11.1.0
  • cuDNN 8.0.4
  • TensorRT 7.2.1
  • DALI 0.27
  • DLProf 0.17.0

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

  • In certain cases running on Pascal GPUs may result in out-of-memory errors which may present as apparent job hangs. This can be worked around by exporting the following environment variable:

    Copy
    Copied!
                

    TF_DEVICE_MIN_SYS_MEMORY_IN_MB=550

  • Some image-based inference workloads see a regression of up to 50% for the smallest batch sizes. This is due to regressions in cuDNN 8.0.4, which will be addressed in a future release.

  • A few models see performance regressions compared to the 20.08 release. Training WideAndDeep sees regressions of up to 30% on A100. In FP32 the TF1 Unet Industrial and Bert fine tuning training regress from 10-20%. Also the TF2 Unet Medical and MaskRCNN models regress by about 20% in some cases. These regressions will be addressed in a future release.

  • There are several known performance regressions compared to 20.07. UNet Medical and Industrial on V100 and A100 GPUs can be up to 20% slower. VGG can be up to 95% slower on A100 and 15% slower on Turing GPUs. Googlenet can be up to 20% slower on V100. And ResNet50 inferencing can be up to 30% slower on A100 and Turing GPUs.

  • An out-of-memory condition can occur in TensorFlow (TF1) 20.08 for some models (such as ResNet-50, and ResNext) when Horovod and XLA are both in use. In XLA, we added an optimization that skips compiling a cluster the very first time it is executed, which can help avoid unnecessary recompilations for models with dynamic shapes. On the other hand, for models like ResNet-50, the preferred compilation strategy is to aggressively compile clusters, as compiled clusters are executed many times. Per the "XLA Best Practices" section of the TensorFlow User Guide, running XLA with the following environment variable opts in to that strategy:

    Copy
    Copied!
                

    TF_XLA_FLAGS=--tf_xla_enable_lazy_compilation=false

  • There is a known performance regression of up to 60% when running inference using TF-TRT for SSD models with small batch size. This will be addressed in a future release.

  • There is a known performance regression of up to 30% when training SSD models with fp32 data type on T4 GPUs. This will be addressed in a future release.

  • There is a known issue where attempting to convert some models using TF-TRT produces an error "Failed to import metagraph". This issue is still under investigation and will be resolved in a future release.

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 29. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
NVIDIA CUDA cuBLAS nvidia-cublas 11.2.1.74
NVIDIA CUDA CUPTI nvidia-cublas-cupti 11.1.69
NVIDIA CUDA NVCC nvidia-cuda-nvcc 11.1.74
NVIDIA CUDA NVRTC nvidia-cuda-nvrtc 11.*
NVIDIA CUDA Runtime nvidia-cuda-runtime 11.1.74
NVIDIA CUDA cuDNN nvidia-cudnn 8.0.4.30
NVIDIA CUDA cuFFT nvidia-cufft 10.3.0.74
NVIDIA CUDA cuRAND nvidia-curand 10.2.2.74
NVIDIA CUDA cuSOLVER nvidia-cusolver 11.0.0.74
NVIDIA CUDA cuSPARSE nvidia-cusparse 11.2.0.275
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin 0.26.0+nv20.10
NVIDIA DALI for CUDA 11.0 nvidia-dali-cuda110 0.26.0
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin 0.25.1+nv20.09
NVIDIA DLprof binary installation nvidia-dlprof 0.16.0
NVIDIA Nsight Systems is a system-wide performance analysis tool designed to visualize an application’s algorithms, help you identify the largest opportunities to optimize, and tune to scale efficiently across any quantity or size of CPUs and GPUs; from large server to our smallest SoC nvidia-nsys-cli 2020.4.1.117
Distributed training framework for TensorFlow, Keras, and PyTorch nvidia-horovod 0.20.0+nv20.10
NVIDIA CUDA NCCL nvidia-nccl 2.7.8
DLprof TensorBoard plugin nvidia-tensorboard-plugin-dlprof 0.8
TensorBoard lets you watch Tensors Flow nvidia-tensorboard 1.15.0+nv20.10
NVIDIA TensorFlow nvidia-tensorflow 1.15.3
NVIDIA TensorRT, a high-performance deep learning inference library nvidia-tensorrt 7.2.1.4

Driver Requirements

Release 20.10 is based on NVIDIA CUDA 11.1.0, which requires NVIDIA Driver release 455 or later. However, if you are running on Tesla (for example, T4 or any other Tesla board), you may use NVIDIA driver release 418.xx or 440.xx. The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Software Requirements

The 20.10 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

  • TensorFlow Wheel version 20.10 is based on 1.15.4.
  • CUDA 11.1.0
  • cuDNN 8.0.4
  • TensorRT 7.2.1
  • DALI 0.26
  • DLProf 0.16.0

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

  • Some image-based inference workloads see a regression of up to 50% for the smallest batch sizes. This is due to regressions in cuDNN 8.0.4, which will be addressed in a future release.
  • A few models see performance regressions compared to the 20.08 release. Training WideAndDeep sees regressions of up to 30% on A100. In FP32 the TF1 Unet Industrial and Bert fine tuning training regress from 10-20%. Also the TF2 Unet Medical and MaskRCNN models regress by about 20% in some cases. These regressions will be addressed in a future release.
  • There are several known performance regressions compared to 20.07. UNet Medical and Industrial on V100 and A100 GPUs can be up to 20% slower. VGG can be up to 95% slower on A100 and 15% slower on Turing GPUs. Googlenet can be up to 20% slower on V100. And ResNet50 inferencing can be up to 30% slower on A100 and Turing GPUs.
  • An out-of-memory condition can occur in TensorFlow (TF1) 20.08 for some models (such as ResNet-50, and ResNext) when Horovod and XLA are both in use. In XLA, we added an optimization that skips compiling a cluster the very first time it is executed, which can help avoid unnecessary recompilations for models with dynamic shapes. On the other hand, for models like ResNet-50, the preferred compilation strategy is to aggressively compile clusters, as compiled clusters are executed many times. Per the "XLA Best Practices" section of the TensorFlow User Guide, running XLA with the following environment variable opts in to that strategy:

    Copy
    Copied!
                

    TF_XLA_FLAGS=--tf_xla_enable_lazy_compilation=false

  • There is a known performance regression of 10 to 30% compared to the 20.03 release when training the JoC V-Net Medical and U-Net Industrial models with small batch size on V100 and Turing GPUs. This will be addressed in a future release.

  • There is a known performance regression of up to 60% when running inference using TF-TRT for SSD models with small batch size. This will be addressed in a future release.

  • There is a known performance regression of up to 30% when training SSD models with fp32 data type on T4 GPUs. This will be addressed in a future release.

  • There is a known issue where attempting to convert some models using TF-TRT produces an error "Failed to import metagraph". This issue is still under investigation and will be resolved in a future release.

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 30. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
NVIDIA CUDA cuBLAS nvidia-cublas 11.2.0.252
NVIDIA CUDA CUPTI nvidia-cublas-cupti 11.0.221
NVIDIA CUDA NVCC nvidia-cuda-nvcc 11.0.221
NVIDIA CUDA NVRTC nvidia-cuda-nvrtc 11.0.*
NVIDIA CUDA Runtime nvidia-cuda-runtime 11.0.221
NVIDIA CUDA cuDNN nvidia-cudnn 8.0.4.12
NVIDIA CUDA cuFFT nvidia-cufft 10.2.1.245
NVIDIA CUDA cuRAND nvidia-curand 10.2.1.245
NVIDIA CUDA cuSOLVER nvidia-cusolver 10.6.0.245
NVIDIA CUDA cuSPARSE nvidia-cusparse 11.1.1.245
NVIDIA DALI for CUDA 11.0 nvidia-dali 0.25.1
NVIDIA DALI TensorFlow Plugin for CUDA 11.0 nvidia-dali-nvtf-plugin 0.25.1+nv20.09
Distributed training framework for TensorFlow, Keras, and PyTorch nvidia-horovod 0.19.2+nv20.09
NVIDIA CUDA NCCL nvidia-nccl 2.7.8
TensorBoard lets you watch Tensors Flow nvidia-tensorboard 1.15.0+nv20.09
NVIDIA TensorFlow nvidia-tensorflow 1.15.3
NVIDIA TensorRT, a high-performance deep learning inference library nvidia-tensorrt 7.1.3.4

Driver Requirements

Release 20.09 is based on NVIDIA CUDA 11.0.3, which requires NVIDIA Driver release 450 or later. However, if you are running on Tesla (for example, T4 or any other Tesla board), you may use NVIDIA driver release 418.xx or 440.xx. The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Software Requirements

The 20.09 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

  • TensorFlow Wheel version 20.09 is based on 1.15.3.
  • NVIDIA cuDNN 8.0.4
  • DALI 0.25

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

  • There are several known performance regressions compared to 20.07. UNet Medical and Industrial on V100 and A100 GPUs can be up to 20% slower. VGG can be up to 95% slower on A100 and 15% slower on Turing GPUs. Googlenet can be up to 20% slower on V100. And ResNet50 inferencing can be up to 30% slower on A100 and Turing GPUs.
  • An out-of-memory condition can occur in TensorFlow (TF1) 20.08 for some models (such as ResNet-50, and ResNext) when Horovod and XLA are both in use. In XLA, we added an optimization that skips compiling a cluster the very first time it is executed, which can help avoid unnecessary recompilations for models with dynamic shapes. On the other hand, for models like ResNet-50, the preferred compilation strategy is to aggressively compile clusters, as compiled clusters are executed many times. Per the "XLA Best Practices" section of the TensorFlow User Guide, running XLA with the following environment variable opts in to that strategy:

    Copy
    Copied!
                

    TF_XLA_FLAGS=--tf_xla_enable_lazy_compilation=false

  • There is a known performance regression of 10 to 30% compared to the 20.03 release when training the JoC V-Net Medical and U-Net Industrial models with small batch size on V100 and Turing GPUs. This will be addressed in a future release.

  • There is a known performance regression of up to 60% when running inference using TF-TRT for SSD models with small batch size. This will be addressed in a future release.

  • There is a known performance regression of up to 30% when training SSD models with fp32 data type on T4 GPUs. This will be addressed in a future release.

  • There is a known issue where attempting to convert some models using TF-TRT produces an error "Failed to import metagraph". This issue is still under investigation and will be resolved in a future release.

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 31. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
nvidia-cublas 11.2.0.252
nvidia-cublas-cupti 11.0.221
nvidia-cuda-nvcc 11.0.221
nvidia-cuda-nvrtc 11.0.*
nvidia-cuda-runtime 11.0.221
nvidia-cudnn 8.0.2.39
nvidia-cufft 10.2.1.245
nvidia-curand 10.2.1.245
nvidia-cusolver 10.6.0.245
nvidia-cusparse 11.1.1.245
nvidia-dali 0.24
nvidia-dali-tf-plugin 0.24
nvidia-horovod 0.19.5
nvidia-nccl 2.7.8
nvidia-tensorflow 1.15.3
nvidia-tensorrt 7.1.3.4
nvidia-cuda-nvrtc 11.0

Driver Requirements

Release 20.08 is based on NVIDIA CUDA 11.0.3, which requires NVIDIA Driver release 450 or later. However, if you are running on Tesla (for example, T4 or any other Tesla board), you may use NVIDIA driver release 418.xx or 440.30. The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Software Requirements

The 20.08 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

  • TensorFlow Wheel version 20.08 is based on 1.15.3.
  • Ubuntu 18.04 with July 2020 updates

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training. Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

  • The memory required to train MaskRCNN with a given batch size has increased from 20.07 to 20.08. As a result, the batch size may need to be decreased.

  • There are several known performance regressions compared to 20.07. UNet Medical and Industrial on V100 and A100 GPUs can be up to 20% slower. VGG can be up to 95% slower on A100 and 15% slower on Turing GPUs. Googlenet can be up to 20% slower on V100. And ResNet50 inferencing can be up to 30% slower on A100 and Turing GPUs.
  • An out-of-memory condition can occur in TensorFlow (TF1) 20.08 for some models (such as ResNet-50, and ResNext) when Horovod and XLA are both in use. In XLA, we added an optimization that skips compiling a cluster the very first time it is executed, which can help avoid unnecessary recompilations for models with dynamic shapes. On the other hand, for models like ResNet-50, the preferred compilation strategy is to aggressively compile clusters, as compiled clusters are executed many times. Per the "XLA Best Practices" section of the TensorFlow User Guide, running XLA with the following environment variable opts in to that strategy:

    Copy
    Copied!
                

    TF_XLA_FLAGS=--tf_xla_enable_lazy_compilation=false

  • There is a known performance regression of 15% compared to the 20.03 release when training the JoC V-Net Medical models with small batch size and fp32 data type on Turing GPUs. This will be addressed in a future release.

  • There is a known performance regression of up to 60% when running inference using TF-TRT for SSD models with small batch size. This will be addressed in a future release.

  • There is a known performance regression of up to 30% when training SSD models with fp32 data type on T4 GPUs. This will be addressed in a future release.

  • There is a known issue where attempting to convert some models using TF-TRT produces an error "Failed to import metagraph". This issue is still under investigation and will be resolved in a future release.

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 32. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
nvidia-cublas 11.1.0
nvidia-cublas-cupti 11.0.194
nvidia-cuda-nvcc 11.0.194
nvidia-cuda-nvrtc 11.0.194
nvidia-cuda-runtime 11.0.194
nvidia-cudnn 8.0.1
nvidia-cufft 10.2.0.218
nvidia-curand 10.2.1.218
nvidia-cusolver 10.5.0.218
nvidia-cusparse 11.1.0.218
nvidia-dali 0.23
nvidia-dali-tf-plugin 0.23
nvidia-horovod 0.19.5
nvidia-nccl 2.7.6
nvidia-tensorflow 1.15.3
nvidia-tensorrt 7.1.3

Driver Requirements

Release 20.07 is based on NVIDIA CUDA 11.0.194, which requires NVIDIA Driver release 450 or later. However, if you are running on Tesla (for example, T4 or any other Tesla board), you may use NVIDIA driver release 418.xx or 440.30. The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Software Requirements

The 20.07 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

  • TensorFlow Wheel version 20.07 is based on 1.15.3.
  • Improved XLA to avoid excessive recompilations
  • Enhancements for Automatic Mixed Precision with einsum, 3D Convolutions, and list operations
  • Improved 3D Convolutions to support NDHWC format
  • Default TF32 support

NVIDIA TensorFlow Wheel Versions

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training. Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

  • There is a known performance regression of 10 to 30% compared to the 20.03 release when training the JoC V-Net Medical and U-Net Industrial models with small batch size on V100. This will be addressed in a future release.

  • An out-of-memory condition can occur in TensorFlow (TF1) 20.07 for some models (such as ResNet-50, and ResNext) when Horovod and XLA are both in use. In XLA, we added an optimization that skips compiling a cluster the very first time it is executed, which can help avoid unnecessary recompilations for models with dynamic shapes. On the other hand, for models like ResNet-50, the preferred compilation strategy is to aggressively compile clusters, as compiled clusters are executed many times. Per the "XLA Best Practices" section of the TensorFlow User Guide, running XLA with the following environment variable opts in to that strategy: TF_XLA_FLAGS=--tf_xla_enable_lazy_compilation=false

  • There is a known performance regression of 15% compared to the 20.03 release when training the JoC V-Net Medical models with small batch size and fp32 data type on Turing GPUs. This will be addressed in a future release.
  • There is a known performance regression of up to 60% when running inference using TF-TRT for SSD models with small batch size. This will be addressed in a future release.
  • There is a known performance regression of up to 30% when training SSD models with fp32 data type on T4 GPUs. This will be addressed in a future release.
  • There is a known issue where attempting to convert some models using TF-TRT produces an error "Failed to import metagraph". This issue is still under investigation and will be resolved in a future release.
  • There is a known performance regression of 5-15% on the VAE-CF model when using the pip wheel compared to the corresponding NGC Docker container. This will be addressed in a future release.

Dependencies of NVIDIA TensorFlow Wheel

This installation of the NVIDIA TensorFlow Wheel will include several other components from NVIDIA.

Table 33. TensorFlow Wheel compatibility with NVIDIA components
NVIDIA Product Version
nvidia-cublas 11.1.0.213
nvidia-cublas-cupti 11.0.167
nvidia-cuda-nvcc 11.0.167
nvidia-cuda-nvrtc 11.0.167
nvidia-cuda-runtime 11.0.167
nvidia-cudnn 8.0.1.13
nvidia-cufft 10.1.3.191
nvidia-curand 10.2.0.191
nvidia-cusolver 10.4.0.191
nvidia-cusparse 11.0.0.191
nvidia-dali 0.22.0
nvidia-dali-tf-plugin 0.22.0
nvidia-horovod 0.19.1
nvidia-nccl 2.7.5
nvidia-tensorflow 1.15.2 + nv20.06
nvidia-tensorrt 7.1.2.8

Driver Requirements

Release 20.06 is based on NVIDIA CUDA 11.0.167, which requires NVIDIA Driver release 450.36. However, if you are running on Tesla (for example, T4 or any other Tesla board), you may use NVIDIA driver release 418.xx or 440.30. The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Software Requirements

The 20.06 release of TensorFlow Wheel requires the following software to be installed:

Key Features and Enhancements

This TensorFlow Wheel release includes the following key features and enhancements.

  • TensorFlow Wheel version 20.06 is based on TensorFlow 1.15.2.
  • Integrated latest NVIDIA Deep Learning SDK to support NVIDIA A100 using CUDA 11 and cuDNN 8
  • Improved NVTX annotations for XLA clusters for use with DLProf
  • Improved XLA to avoid excessive recompilations
  • Enhancements for Automatic Mixed Precision with einsum, 3D Convolutions, and list operations
  • Improved 3D Convolutions to support NDHWC format
  • Default TF32 support
  • Ubuntu 18.04 with May 2020 updates

NVIDIA TensorFlow Wheel Versions

20.06 is the first release of the TensorFlow Wheel.

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training. Each example model trains with mixed precision Tensor Cores on A100 and Volta architectures, therefore you can get results much faster than training without Tensor Cores. These models are tested against each NVIDIA Optimized Frameworks monthly container release to ensure consistent accuracy and performance over time.

Known Issues

  • An out-of-memory condition can occur in TensorFlow (TF1) 20.06 for some models (such as ResNet-50, and ResNext) when Horovod and XLA are both in use. In XLA in TensorFlow 20.05, we added an optimization that skips compiling a cluster the very first time it is executed, which can help avoid unnecessary recompilations for models with dynamic shapes. On the other hand, for models like ResNet-50, the preferred compilation strategy is to aggressively compile clusters, as compiled clusters are executed many times. Per the XLA Best Practices Guide, running XLA with the following environment variable opts in to that strategy:

    Copy
    Copied!
                

    TF_XLA_FLAGS=--tf_xla_enable_lazy_compilation=false

  • TensorFlow Wheel 20.06 NCF will not work until CuPy is updated to support CUDA 11.

  • There is a known performance regression of 10 to 30% compared to the 20.03 release when training the JoC V-Net Medical and and U-Net Industrial models with small batch size on V100. This will be addressed in a future release.

© Copyright 2024, NVIDIA. Last updated on Apr 5, 2024.