TensorFlow Release 22.08
The NVIDIA container image of TensorFlow, release 22.08, is available on NGC.
Contents of the TensorFlow container
This container image includes the complete source of the NVIDIA version of TensorFlow in
/opt/tensorflow. It is prebuilt and installed as a system Python module.
To achieve optimum TensorFlow performance for image-based training, the container includes a sample script that demonstrates the efficient training of convolutional neural networks (CNNs). The sample script might need to be modified to fit your application. The container also includes the following:
- Ubuntu 20.04
22.08-tf2-py3container images contain Python 3.8.
- NVIDIA CUDA® 11.7.1
- NVIDIA cuBLAS 126.96.36.199
- cuTENSOR 188.8.131.52
- NVIDIA cuDNN 184.108.40.206
- NVIDIA NCCL 2.12.12 (built with CUDA 11.7)
- NVIDIA RAPIDS™ 22.06 (Only these libraries are included: cudf, xgboost, rmm, cuml, and cugraph)
- Horovod 0.24.3
- OpenMPI 4.1.2rc4+
- OpenUCX 1.12.0
- SHARP 2.5
- GDRCopy 2.3
- NVIDIA HPC-X 2.10
- rdma-core 36.0
- NVIDIA TensorRT™ 220.127.116.11
- TensorFlow-TensorRT (TF-TRT)
- NVIDIA DALI® 1.16.0
- Nsight Compute 2022.2.1.3
- Nsight Systems 2022.1.3.18
- JupyterLab 2.3.2 including Jupyter-TensorBoard
- XLA-Lite (TensorFlow2 only)
Release 22.08 is based on CUDA 11.7.1, which requires NVIDIA Driver release 515 or later. However, if you are running on a data center GPU (for example, T4 or any other data center GPU), you can use NVIDIA driver release 450.51 (or later R450), 470.57 (or later R470), or 510.47 (or later R510). The CUDA driver's compatibility package only supports particular drivers. Thus, users should upgrade from all R418, R440, and R460 drivers, which are not forward-compatible with CUDA 11.7. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.
Release 22.08 supports CUDA compute capability 6.0 and later. This corresponds to GPUs in the NVIDIA Pascal, NVIDIA Volta™, NVIDIA Turing™, and NVIDIA Ampere Architecture GPU families. For a list of GPUs to which this compute capability corresponds, see CUDA GPUs. For additional support details, see Deep Learning Frameworks Support Matrix.
Key Features and Enhancements
This TensorFlow release includes the following key features and enhancements.
- TensorFlow container images version 22.08 are based on TensorFlow 1.15.5 and 2.9.1.
- We introduced a new environment variable
TF_ENABLE_LAYOUT_NHWCto enforce the NHWC layout at runtime. In some models with fp32 on NVIDIA Ampere Architecture GPUs, users may obtain better performance when specifying “
TF_ENABLE_LAYOUT_NHWC=1”, which can better utilize the TF32 tensor cores.
- Starting with the 22.05 release, the TensorFlow 1 and 2 containers are available for the Arm SBSA platform.
For example, pulling the Docker image
nvcr.io/nvidia/tensorflow:22.05-tf2-py3Docker image on an Arm SBSA machine will automatically fetch the Arm-specific image.
- Support for Slurm PMI2 has been removed from the 22.01 release.
PMIX is supported by the container, but is not supported by default in Slurm. Users who depend on Slurm integration might need to configure Slurm for PMIX in the base OS as appropriate to their OS distribution (for Ubuntu 20.04, the required package is
NVIDIA TensorFlow Container Versions
The following table shows what versions of Ubuntu, CUDA, TensorFlow, and TensorRT are supported in each of the NVIDIA containers for TensorFlow. For older container versions, refer to the Frameworks Support Matrix.
Tensor Core Examples
The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training. Each example model trains with mixed precision Tensor Cores on NVIDIA Volta, therefore you can get results much faster than training without Tensor Cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
- U-Net Medical model: This model is a convolutional neural network for 2D image segmentation.
This repository contains a U-Net implementation as described in the U-Net: Convolutional Networks for Biomedical Image Segmentation paper, without any alteration.
- SSD320 v1.2 model: This model is based on the SSD: Single Shot MultiBox Detector paper, which describes an SSD as a method for detecting objects in images using a single deep neural network.
Our implementation is based on the existing model from the TensorFlow models repository.
- Neural Collaborative Filtering (NCF) model: This model is a neural network that provides collaborative filtering based on implicit feedback, specifically, it provides product recommendations based on user and item interactions.
The training data for this model should contain a sequence of user ID, item ID pairs indicating that the specified user has interacted with, for example, was given a rating to or clicked on, the specified item.
- BERT model: Bidirectional Encoder Representations from Transformers (BERT) is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks.
This model is based on BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding paper. BERT is an optimized version of Google's official implementation, which leverages mixed-precision arithmetic and Tensor Cores on V100 GPUs for faster training times and maintains target accuracy.
- U-Net Industrial Defect Segmentation model: This model is adapted from the original version of the U-Net model, which is a convolutional auto-encoder for 2D image segmentation.
U-Net was first introduced by Olaf Ronneberger, Philip Fischer, and Thomas Brox in the U-Net: Convolutional Networks for Biomedical Image Segmentation paper. This work proposes a modified version of U-Net, called TinyUNet, which performs efficiently and with high accuracy on the industrial anomaly dataset DAGM2007.
- GNMT v2 model: This model is similar to the one discussed in the Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation paper.
The most important difference between the two models is in the attention mechanism. In our model, the output from the first LSTM layer of the decoder goes into the attention module, then the reweighted context is concatenated with inputs to all subsequent LSTM layers in the decoder at the current timestep.
- ResNet-50 v1.5 model: This model is a modified version of the original ResNet-50 v1 model.
The difference between v1 and v1.5 is in the bottleneck blocks that require downsampling. For example, v1 has
stride = 2in the first 1x1 convolution, and
v1.5 has stride = 2in the 3x3 convolution. The following features were implemented in this model:
- Data-parallel multi-GPU training with Horovod
- Tensor Cores (mixed precision) training
- Static loss scaling for Tensor Cores (mixed precision) training
- We have implemented a new feature called the Async Allocator that may cause a set of isolated issues such as hangs or crashes in a multi-GPU setting or it may affect performance by a substantial margin. If you observe any of these issues when upgrading from 22.07, consider turning off this feature by unsetting the corresponding environment variable using `
unset TF_GPU_ALLOCATOR`. We are actively working to address this issue in the next release.
- TF-TRT inference performance may also be affected by the above issue, so the above workaround also applied to TF-TRT models.
- Additionally, we have introduced another feature to be able to switch to the channel-last layout (NHWC) for harnessing the power of Tensor Core math (see the Key Features and Enhancements section above). If you observed a performance regression in TF-TRT models, consider enabling the environment variable using `
export TF_ENABLE_LAYOUT_NHWC=1` to check whether it helps regain the lost performance.
- The TF-TRT native segment fallback has a known issue that causes a crash.
This issue occurs when you use TF-TRT to convert a model with a subgraph that is then converted to TensorRT, but the conversion fails to build. Instead of falling back to native TensorFlow, TF-TRT will crash.
To prevent the conversion of an OP that causes a native segment fallback, use
- A known issue affects
aarch64 libgomp, which might sometimes cause
cannot allocate memory in static TLS blockerrors.
The workaround is to run the following command:
- IO-dominated CNN models, such as AlexNet and ResNet50 see a ~10% performance reduction on some platforms. The regression is under investigation and will be fixed in a future release.
- In some configurations, the UNet3D model on A100 fails to initialize CUDNN due to an OOM. This can be fixed by increasing the GPU memory carveout with the environment variable
- There is a known issue in XLA that could cause performance regressions of up to 55% as compared to the previous release; however, training with XLA is still faster than without XLA. This performance regression affects certain models such as EfficientNet with TF32 enabled. A potential workaround is disabling TF32 using the TensorFlow API. The root cause is under investigation and will be fixed in a future release.
- TF-TRT 22.07 may fail to build TensorRT engines for HF BERT or HF BART, which may manifest as large performance regressions. Please revert back to the previous version 22.06 if you see a TF-TRT warning stating that Myelin graph could not be created or see a substantial performance regression.
- Since the 22.08 release, the
GPU_TF_ALLOCATORis set to
CUDA_MALLOC_ASYNCby default, which may cause severe regression in some particular configurations (for example, BERT training in
fp16mode). When encountered such perf regressions, unset the environment variable: