TensorFlow Release 22.03

The NVIDIA container image of TensorFlow, release 22.03, is available on NGC.

Contents of the TensorFlow container

This container image includes the complete source of the NVIDIA version of TensorFlow in /opt/tensorflow. It is prebuilt and installed as a system Python module.

To achieve optimum TensorFlow performance for image-based training, the container includes a sample script that demonstrates the efficient training of convolutional neural networks (CNNs). The sample script might need to be modified to fit your application.

The container also includes the following:

Driver Requirements

Release 22.03 is based on CUDA 11.6.1, which requires NVIDIA Driver release 510 or later. However, if you are running on a Data Center GPU (for example, T4 or any other Tesla board), use NVIDIA driver release 418.40 (or later R418), 440.33 (or later R440), 450.51 (or later R450), 460.27 (or later R460), or 470.57 (or later R470). The CUDA driver's compatibility package only supports specific drivers. For a complete list of supported drivers, see CUDA Application Compatibility. For more information, see CUDA Compatibility and Upgrades and NVIDIA CUDA and Drivers Support.

GPU Requirements

Release 22.03 supports CUDA compute capability 6.0 and later. This corresponds to GPUs in the NVIDIA Pascal, NVIDIA Volta™, NVIDIA Turing™, and NVIDIA Ampere Architecture GPU families. For a list of GPUs to which this compute capability corresponds, see CUDA GPUs. For additional support details, see Deep Learning Frameworks Support Matrix.

Key Features and Enhancements

This TensorFlow release includes the following key features and enhancements.

Announcements

  • NVIDIA Deep Learning Profiler (DLProf) v1.8, which was included in the 21.12 container, was the last release of DLProf.

    Starting with the 22.01 container, DLProf is no longer included, but it can still be manually installed by using a pip wheel on nvidia-pyindex.

  • Starting with the 21.10 release, a beta version of the TensorFlow 1 and 2 containers is available for the Arm SBSA platform.

    For example, pulling the Docker image nvcr.io/nvidia/tensorflow:22.02-tf2-py3 Docker image on an Arm SBSA machine will automatically fetch the Arm-specific image.

  • Support for SLURM PMI2 has been removed from the 22.01 release.

    PMIX is supported by the container, but is not supported by default in SLURM. Users who depend on SLURM integration might need to configure SLURM for PMIX in the base OS as appropriate to their OS distribution (for Ubuntu 20.04, the required package is slurm-wlm-basic-plugins).

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without Tensor Cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
  • U-Net Medical model: This model is a convolutional neural network for 2D image segmentation.

    This repository contains a U-Net implementation as described in the U-Net: Convolutional Networks for Biomedical Image Segmentation paper, without any alteration.

    This model script is available on GitHub and NGC.

  • SSD320 v1.2 model: This model is based on the SSD: Single Shot MultiBox Detector paper, which describes an SSD as a method for detecting objects in images using a single deep neural network.

    Our implementation is based on the existing model from the TensorFlow models repository.

    This model script is available on GitHub and NGC.

  • Neural Collaborative Filtering (NCF) model: This model is a neural network that provides collaborative filtering based on implicit feedback, specifically, it provides product recommendations based on user and item interactions.

    The training data for this model should contain a sequence of user ID, item ID pairs indicating that the specified user has interacted with, for example, was given a rating to or clicked on, the specified item.

    This model script is available on GitHub and NGC.

  • BERT model: Bidirectional Encoder Representations from Transformers (BERT) is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks.

    This model is based on BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding paper. NVIDIA's BERT is an optimized version of Google's official implementation, which leverages mixed-precision arithmetic and Tensor Cores on V100 GPUS for faster training times and maintains target accuracy.

    This model script is available on GitHub and NGC.

  • U-Net Industrial Defect Segmentation model: This model is adapted from the original version of the U-Net model, which is a convolutional auto-encoder for 2D image segmentation.

    U-Net was first introduced by Olaf Ronneberger, Philip Fischer, and Thomas Brox in the U-Net: Convolutional Networks for Biomedical Image Segmentation paper. This work proposes a modified version of U-Net, called TinyUNet which performs efficiently and with high accuracy on the industrial anomaly dataset DAGM2007.

    This model script is available on GitHub and NGC.

  • GNMT v2 model: This model is similar to the one discussed in the Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation paper.

    The most important difference between the two models is in the attention mechanism. In our model, the output from the first LSTM layer of the decoder goes into the attention module, then the re-weighted context is concatenated with inputs to all subsequent LSTM layers in the decoder at the current timestep.

    This model script is available on GitHub and NGC.

  • ResNet-50 v1.5 model: This model is a modified version of the original ResNet-50 v1 model.
    The difference between v1 and v1.5 is in the bottleneck blocks that require downsampling. For example, v1 has stride = 2 in the first 1x1 convolution, and v1.5 has stride = 2 in the 3x3 convolution. The following features were implemented in this model:
    • Data-parallel multi-GPU training with Horovod
    • Tensor Cores (mixed precision) training
    • Static loss scaling for Tensor Cores (mixed precision) training
    This model script is available on GitHub and NGC.

Known Issues

Note: If you encounter functional or performance issues when XLA is enabled, refer to the XLA Best Practices document, which offers information about how to diagnose symptoms and possibly address them.
  • Experimental support for cudaGraphs in XLA has been dropped from the TensorFlow2 release.

    This support will be implemented again in a future release.

  • TensorFlow 2.8.0 suffers from a known performance regression of up to 60%, which has been observed for some Wide & Deep recommender models that are running under XLA.

    This issue is under investigation and will be fixed in a future release. If you notice a slowdown, a temporary workaround to improve performance is to disable XLA.

  • When you import the tensorflow_addons python module, the following spurious warning is printed:
    UserWarning: Tensorflow Addons supports using Python ops for all Tensorflow versions above or equal to 2.5.0 and strictly below 2.8.0 (nightly versions are not supported).
    The versions of TensorFlow you are currently using is 2.8.0 and is not supported...

    This warning can be safely ignored.

  • For TensorFlow 1.15, TF-TRT inference throughput, when compared to the 21.06-tf1 release, might regress for certain models by up to 37%.

    This issue will be fixed in a future release.

  • A CUDNN performance regression can cause slowdowns of up to 15% in certain ResNet models.

    This issue will be fixed in a future release.

  • There is a known performance regression that affects the UNet Medical 3D model training by up to 23%.

    This issue will be addressed in a future release.

  • The TF-TRT native segment fallback has a known issue that causes a crash.

    This issue occurs when you use TF-TRT to convert a model with a subgraph that is then converted to TensorRT, but the conversion fails to build.

    Instead of falling back to native TensorFlow, TF-TRT crashes. You can use export TF_TRT_OP_DENYLIST="ProblematicOp" to prevent the conversion of an OP that causes a native segment fallback.

  • There is a known issue that affects aarch64 libgomp, which might sometimes cause cannot allocate memory in static TLS block errors.

    The workaround is to run the following command:

    export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libgomp.so.1