TensorFlow Release 24.07

The NVIDIA container image of TensorFlow, release 24.07, is available on NGC.

Note:

Deprecation notice: As of the 23.04 release, TF1 is no longer released monthly. Known issues may be resolved in a future release based on customer demand.

Contents of the TensorFlow container

This container image includes the complete source of the NVIDIA version of TensorFlow in /opt/tensorflow. It is prebuilt and installed as a system Python module.

To achieve optimum TensorFlow performance for image-based training, the container includes a sample script that demonstrates the efficient training of convolutional neural networks (CNNs). The sample script might need to be modified to fit your application.

The container also includes the following:

Ubuntu 22.04

Note:

The 24.07-tf2-py3 container image contains Python 3.10.6.
NVIDIA CUDA 12.5.1
NVIDIA cuBLAS 12.5.3.2
cuTENSOR 2.0.2.4
NVIDIA cuDNN 9.2.1.18
NVIDIA NCCL 2.22.3
NVIDIA RAPIDS™ 24.04
Horovod 0.28.1
OpenMPI 4.1.7
OpenUCX 1.15.0
SHARP 3.0.2
GDRCopy 2.3
NVIDIA HPC-X 2.19
TensorBoard 2.13.0
rdma-core 39.0
NVIDIA TensorRT™ 10.2.0.19
TensorFlow-TensorRT (TF-TRT)
NVIDIA DALI® 1.39
Nsight Compute 2024.2.1.2
Nsight Systems 2024.4.2.133
nvImageCodec 0.2.0.7
JupyterLab 2.3.2 including:
- Jupyter-TensorBoard
- Jupyter Client 8.6.0
- Jupyter Core 5.5.0
- Jupyter Notebook 6.4.10

Driver Requirements

Release 24.07 is based on CUDA 12.5.1 which requires NVIDIA Driver release 555 or later. However, if you are running on a data center GPU (for example, T4 or any other data center GPU), you can use NVIDIA driver release 470.57 (or later R470), 525.85 (or later R525), 535.86 (or later R535), or 545.23 (or later R545). The CUDA driver's compatibility package only supports particular drivers. Thus, users should upgrade from all R418, R440, R450, R460, R510, R520 and R545 drivers, which are not forward-compatible with CUDA 12.5. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Key Features and Enhancements

This TensorFlow release includes the following key features and enhancements.

TensorFlow container image version 24.07 is based on TensorFlow 2.16.1.

Announcements

Starting with the 24.03 release, NVIDIA/Transformer Engine will no longer be included with NVIDIA Optimized TensorFlow containers. Transformer Engine includes FP8 support to accelerate training LLM models with Ada and Hopper GPUs. We recommend using float8 training supported in Keras via XLA (PR).
Starting with the 23.11 release, NVIDIA Optimized TensorFlow containers supporting iGPU architectures are published, and run on Jetson devices. Please refer to the Frameworks Support Matrix for information regarding which iGPU hardware/software is supported by which container.
Starting with the 23.11 release, numpy has been updated to v1.24. This version removed some deprecated APIs. Here is a list of APIs that have expired.
Starting with the 23.06 release, the NVIDIA Optimized Deep Learning Framework containers are no longer tested on Pascal GPU architectures.
As of the 23.04 release, TF1 is no longer released monthly. Known issues may be solved in a future release based on customer demand.
Support for Slurm PMI2 has been removed from the 22.01 release.
PMIX is supported by the container, but is not supported by default in Slurm. Users who depend on Slurm integration might need to configure Slurm for PMIX in the base OS as appropriate to their OS distribution (for Ubuntu 20.04, the required package is slurm-wlm-basic-plugins).

NVIDIA TensorFlow Container Versions

The following table shows what versions of Ubuntu, CUDA, TensorFlow, and TensorRT are supported in each of the NVIDIA containers for TensorFlow. For older container versions, refer to the Frameworks Support Matrix.

Container Version	Ubuntu	CUDA Toolkit	TensorFlow	TensorRT
24.07	22.04	NVIDIA CUDA 12.5.1	2.16.1	TensorRT 10.2.0.19
24.06		NVIDIA CUDA 12.5.0.23	2.16.1	TensorRT 10.1.0.27
24.05		NVIDIA CUDA 12.4.1	2.15.0	TensorRT 10.0.1.6
24.04		NVIDIA CUDA 12.4.1	2.15.0	TensorRT 8.6.3
24.03		NVIDIA CUDA 12.4.0.41
24.02		NVIDIA CUDA 12.3.2
24.01			2.14.0	TensorRT 8.6.1.6
23.12
23.11		NVIDIA CUDA 12.3.0
23.10		NVIDIA CUDA 12.2.1	2.13.0
23.09		NVIDIA CUDA 12.2.1
23.08		NVIDIA CUDA 12.2.1
23.07		NVIDIA CUDA 12.1.1	2.12.0
23.06
23.05				TensorRT 8.6.1.2
23.04	20.04	NVIDIA CUDA 12.1.0		TensorRT 8.6.1
23.03		NVIDIA CUDA 12.1.0	2.11.0 1.15.5	TensorRT 8.5.3
23.02		NVIDIA CUDA 12.0.1		TensorRT 8.5.3
23.01		NVIDIA CUDA 12.0.1		TensorRT 8.5.2.2
22.12		NVIDIA CUDA 11.8.0	2.10.1 1.15.5	TensorRT 8.5.1
22.11			2.10.0 1.15.5	TensorRT 8.5.1
22.10			2.10.0 1.15.5	TensorRT 8.5 EA
22.09			2.9.1 1.15.5	TensorRT 8.5 EA
22.08		NVIDIA CUDA 11.7.1		TensorRT 8.4.2.4
22.07		NVIDIA CUDA 11.7 Update 1 Preview		TensorRT 8.4.1
22.06		NVIDIA CUDA 11.7 Update 1 Preview		TensorRT 8.2.5
22.05		NVIDIA CUDA 11.7.0	2.8.0 1.15.5	TensorRT 8.2.5
22.04		NVIDIA CUDA 11.6.2		TensorRT 8.2.4.2
22.03		NVIDIA CUDA 11.6.1		TensorRT 8.2.3
22.02		NVIDIA CUDA 11.6.0	2.7.0 1.15.5	TensorRT 8.2.3
22.01		NVIDIA CUDA 11.6.0	2.7.0 1.15.5	TensorRT 8.2.2
21.12		NVIDIA CUDA 11.5.0	2.6.2 1.15.5	TensorRT 8.2.1.8
21.11		NVIDIA CUDA 11.5.0	2.6.0 1.15.5	TensorRT 8.0.3.4 for x64 Linux TensorRT 8.0.2.2 for Arm SBSA Linux
21.10		NVIDIA CUDA 11.4.2 with cuBLAS 11.6.5.2
21.09		NVIDIA CUDA 11.4.2		TensorRT 8.0.3
21.08		NVIDIA CUDA 11.4.1	2.5.0 1.15.5	TensorRT 8.0.1.6
21.07		NVIDIA CUDA 11.4.0	2.5.0 1.15.5	TensorRT 8.0.1.6
21.06		NVIDIA CUDA 11.3.1	2.5.0 1.15.5	TensorRT 7.2.3.4
21.05		NVIDIA CUDA 11.3.0	2.4.0 1.15.5
21.04		NVIDIA CUDA 11.3.0
21.03		NVIDIA CUDA 11.2.1		TensorRT 7.2.2.3
21.02		NVIDIA CUDA 11.2.0		TensorRT 7.2.2.3+cuda11.1.0.024
20.12		NVIDIA CUDA 11.1.1	2.3.1 1.15.4	TensorRT 7.2.2
20.11	18.04	NVIDIA CUDA 11.1.0		TensorRT 7.2.1
20.10		NVIDIA CUDA 11.1.0		TensorRT 7.2.1
20.09		NVIDIA CUDA 11.0.3	2.3.0 1.15.3	TensorRT 7.1.3
20.08		NVIDIA CUDA 11.0.3	2.2.0 1.15.3
20.07		NVIDIA CUDA 11.0.194	2.2.0 1.15.3
20.06		NVIDIA CUDA 11.0.167	2.2.0 1.15.2	TensorRT 7.1.2
20.03 20.02		NVIDIA CUDA 10.2.89	2.1.0 1.15.2	TensorRT 7.0.0
20.01			2.0.0 1.15.0	TensorRT 7.0.0
19.12 19.11			2.0.0 1.15.0	TensorRT 6.0.1
19.10		NVIDIA CUDA 10.1.243	1.14.0
19.09
19.08				TensorRT 5.1.5

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on NVIDIA Volta, therefore you can get results much faster than training without Tensor Cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.

U-Net Medical model: This model is a convolutional neural network for 2D image segmentation.
This repository contains a U-Net implementation as described in the U-Net: Convolutional Networks for Biomedical Image Segmentation paper, without any alteration.

This model script is available on GitHub and NGC.
Neural Collaborative Filtering (NCF) model: This model is a neural network that provides collaborative filtering based on implicit feedback, specifically, it provides product recommendations based on user and item interactions.
The training data for this model should contain a sequence of user ID, item ID pairs indicating that the specified user has interacted with, for example, was given a rating to or clicked on, the specified item.

This model script is available on GitHub and NGC.
BERT model: Bidirectional Encoder Representations from Transformers (BERT) is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks.
This model is based on BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding paper. BERT is an optimized version of Google's official implementation, which leverages mixed-precision arithmetic and Tensor Cores on V100 GPUs for faster training times and maintains target accuracy.

This model script is available on GitHub and NGC.

Known Issues

There is a known CUPTI permissions issue that is new in the 24.01 iGPU container. This issue prevents the profiler from being able to capture cuda events and manifests itself as a CUPTI Runtime Error with error code 35. This may be worked around by running the following command:

rm -rf /usr/local/cuda/compat/lib.real
There is a known performance drop of about 10% with ResNet. This is under investigation.
There is a known performance drop of about 40% with efficientdet. This is under investigation.
Several networks crash with a CUDA OOM when used in multi-GPU configuration using NVLS. This could be a possible TF memory carveout issue, and it is suggested to try increasing the carveout using TF_DEVICE_MIN_SYS_MEMORY_IN_MB.
There is a known performance drop with Electra on V100 and T4 GPUs. This is under investigation.
Efficientnet is presently not compatible with horovod, resulting in application crash.
There is up to 22% perf regression for efficientnet training in TF that affects some GPUs (A2, A10, A40, L40, V100). This is under investigation.
The TensorFlow DLRM model may see a performance regression of up to 30% on A40 GPUs compared to the 23.05 release.
An illegal memory access violation is exposed in TensorFlow 2.12 by the Electra model as implemented in JoC.
Up to 99% perf regressions across all EfficientDet model configs.
Some DLRM models may regress by 10-40%. We are currently investigating.
A known performance regression of up to 50% affects some efficientnet models. The regression is inherited from upstream tensorflow and is still under investigation.
The TF-TRT native segment fallback has a known issue that causes a crash.
This issue occurs when you use TF-TRT to convert a model with a subgraph that is then converted to TensorRT, but the conversion fails to build. Instead of falling back to native TensorFlow, TF-TRT will crash.

To prevent the conversion of an OP that causes a native segment fallback, use export TF_TRT_OP_DENYLIST="ProblematicOp".
A known issue affects aarch64 libgomp, which might sometimes cause cannot allocate memory in static TLS block errors.
The workaround is to run the following command:
Copy

Copied!
```
            
            export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libgomp.so.1
        
```
There is a known performance regression in XLA that can cause performance regressions of up to 55% when training certain models such as EfficientNet with XLA enabled. The root cause is under investigation.