PaddlePaddle Release 22.09

The NVIDIA container image for PaddlePaddle, release 22.09, is available on NGC.

Contents of the PaddlePaddle container

This container image includes the complete source of the NVIDIA version of PaddlePaddle in /opt/paddlepaddle. It is prebuilt and installed as a system Python module.

The container includes the following:

Ubuntu 20.04 including Python 3.8
NVIDIA CUDA 11.8.0
cuTENSOR 1.6.1.5
NVIDIA cuDNN 8.6.0.163
NVIDIA NCCL 2.15.1 (optimized for NVLink™)
rdma-core 36.0
OpenMPI 4.1.2rc4+
GDRCopy 2.3
Nsight Systems 2022.3.1.43
Nsight Compute 2022.3.0.22
NVIDIA HPC-X 2.12.1a0
TensorRT 8.5.0.12 for x64 Linux
Paddle-TRT 2.3.2
SHARP 2.6.0
DALI 1.17.0

Driver Requirements

Release 22.09 is based on CUDA 11.8.0, which requires NVIDIA Driver release 520 or later. However, if you are running on a data center GPU (for example, T4 or any other data center GPU), you can use NVIDIA driver release 450.51 (or later R450), 470.57 (or later R470), 510.47 (or later R510), or 515.65 (or later R515). The CUDA driver's compatibility package only supports particular drivers. Thus, users should upgrade from all R418, R440, and R460 drivers, which are not forward-compatible with CUDA 11.8. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Key Features and Enhancements

This PaddlePaddle release includes the following key features and enhancements.

The PaddlePaddle container image version 22.09 is based on v2.3.2.

Announcements

Paddle-TRT is now included.
Paddle-TRT is the TensorRT integration for PaddlePaddle and brings the capabilities of TensorRT to PadddlePaddle in a few lines in the Python and C++ APIs.

NVIDIA PaddlePaddle Container Versions

The following table shows what versions of Ubuntu, CUDA, PaddlePaddle, and TensorRT are supported in each of the NVIDIA containers for PaddlePaddle. For older container versions, refer to the Frameworks Support Matrix.

Container Version	Ubuntu	CUDA Toolkit	PaddlePaddle	TensorRT
22.09	22.04	NVIDIA CUDA 11.8.0	2.3.2	TensorRT 8.5.0.12
22.08		NVIDIA CUDA 11.7.1	2.3.1	TensorRT 8.4.2.4
22.07		NVIDIA CUDA 11.7 Update 1 Preview	2.3.0	TensorRT 8.4.1
22.06		NVIDIA CUDA 11.7 Update 1 Preview	2.2.2	TensorRT 8.2.5
22.05		NVIDIA CUDA 11.7	2.2.2	TensorRT 8.2.5

Automatic Mixed Precision (AMP)

Automatic Mixed Precision (AMP) for PaddlePaddle is available in this container through the native implementation. AMP enables users to try mixed precision training by adding only 3 lines of Python to an existing FP32 (default) script. AMP will select an optimal set of operations to cast to FP16. FP16 operations require 2X reduced memory bandwidth (resulting in a 2X speedup for bandwidth-bound operations like most pointwise ops) and 2X reduced memory storage for intermediates (reducing the overall memory consumption of your model). Additionally, GEMMs and convolutions with FP16 inputs can run on Tensor Cores, which provide an 8X increase in computational throughput over FP32 arithmetic.

For more information about AMP, see the Training With Mixed Precision Guide.

Tensor Core Examples

The tensor core examples provided in GitHub and NGC focus on achieving the best performance and convergence from NVIDIA Volta™ tensor cores by using the latest deep learning example networks and model scripts for training.

Each example model trains with mixed precision Tensor Cores on Volta and NVIDIA Turing™, so you can get results much faster than training without Tensor Cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.

ResNet50 v1.5 model: This model is a modified version of the regular ResNet model that was introduced in the Deep Residual Learning for Image Recognition paper.
The v1.5 has stride = 2 in the 3x3 convolution instead of 1x1 convolution. This model script is available on GitHub.

Known Issues

In rare cases, using the adam optimizer with multi-threading might cause segmentation fault. Setting the environment variable FLAGS_inner_op_parallelism to 1 can disable the multi-threading feature and resolve this issue.
On H100 NVLink systems using 2 GPUs for training, certain communication patterns can trigger a corner-case bug that manifests either as a hang or as an "illegal instruction" exception. A workaround for this case is to set the environment variable NCCL_PROTO=^LL128. This issue will be addressed in an upcoming release.