PaddlePaddle Release 23.08

NVIDIA Optimized Frameworks (Latest Release) Download PDF

The NVIDIA container image for PaddlePaddle, release 23.08, is available on NGC.

Contents of the PaddlePaddle container

This container image includes the complete source of the NVIDIA version of PaddlePaddle in /opt/paddlepaddle. It is prebuilt and installed as a system Python module. The container includes the following:

Driver Requirements

Release 23.08 is based on CUDA 12.2.1, which requires NVIDIA Driver release 535 or later. However, if you are running on a data center GPU (for example, T4 or any other data center GPU), you can use NVIDIA driver release 450.51 (or later R450), 470.57 (or later R470), 510.47 (or later R510), or 525.85 (or later R525), or 535.86 (or later R535). The CUDA driver's compatibility package only supports particular drivers. Thus, users should upgrade from all R418, R440, R460, and R520 drivers, which are not forward-compatible with CUDA 12.2. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

Key Features and Enhancements

This PaddlePaddle release includes the following key features and enhancements.

  • The PaddlePaddle container image version 23.08 is based on v2.5.0.

Announcements

  • The cuDNN frontend has been integrated into PaddlePaddle. It can be activated by turning on “fuse_resunit” and “fuse_dot_product_attention” flags in the “build_strategy”. The cuDNN frontend provides advanced fusion kernels which accelerates training speed.
  • The NVIDIA/LDDL has been integrated in PaddlePaddle. The Language Datasets and Data Loaders (LDDL) is a utility library that minimizes the friction during dataset retrieval, preprocessing and loading for the language models. It successfully accelerates BERT pre-training phase 2 to 2X. See the BERT example for details. The training results with multinode are added in the BERT example, including 1 node to 32 nodes.
  • The NVIDIA/Transformer Engine is built on top of PaddlePaddle v2.5.0. The Transformer Engine has FP8 training to accelerate LLM models with Ada and Hopper GPU. The legacy version of PaddlePaddle is not supported.

NVIDIA PaddlePaddle Container Versions

The following table shows what versions of Ubuntu, CUDA, PaddlePaddle, and TensorRT are supported in each of the NVIDIA containers for PaddlePaddle. For older container versions, refer to the Frameworks Support Matrix.

Automatic Mixed Precision (AMP)

Automatic Mixed Precision (AMP) for PaddlePaddle is available in this container through the native implementation. AMP enables users to try mixed precision training by adding only 3 lines of Python to an existing FP32 (default) script. AMP will select an optimal set of operations to cast to FP16. FP16 operations require 2X reduced memory bandwidth (resulting in a 2X speedup for bandwidth-bound operations like most pointwise ops) and 2X reduced memory storage for intermediates (reducing the overall memory consumption of your model). Additionally, GEMMs and convolutions with FP16 inputs can run on Tensor Cores, which provide an 8X increase in computational throughput over FP32 arithmetic.

For more information about AMP, see the Training With Mixed Precision Guide.

Tensor Core Examples

The tensor core examples provided in GitHub and NGC focus on achieving the best performance and convergence from NVIDIA Volta™ tensor cores by using the latest deep learning example networks and model scripts for training. Each example model trains with mixed precision Tensor Cores on Volta and NVIDIA Turing™, so you can get results much faster than training without Tensor Cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.

Known Issues

  • The data loader has a small chance to trigger segmentation fault if the loop of loading data has a “break” statement. For instance, running N-th step, then using “break” to suspend the process. It only affects NLP tasks for now. The root cause is under investigation and will be fixed in a future release. See the issue#48964 for details.
  • Some APIs are deprecated by PaddlePaddle 2.5. It affects the BERT example. It will be fixed in a future release.
  • The parameter server might have a small chance to trigger segmentation fault.
  • The output of FuseGemmEpiloguePassRelu is slightly different from the unfused kernel.

.

© Copyright 2024, NVIDIA. Last updated on Jul 3, 2024.