NVIDIA Optimized Frameworks
NVIDIA Optimized Frameworks (Latest Release) Download PDF

TensorFlow Release 19.12

The NVIDIA container image of TensorFlow, release 19.12, is available on NGC.

Contents of the TensorFlow container

This container image contains the complete source of the version of NVIDIA TensorFlow in /opt/tensorflow. It is pre-built and installed as a system Python module.

To achieve optimum TensorFlow performance, for image based training, the container includes a sample script that demonstrates efficient training of convolutional neural networks (CNNs). The sample script may need to be modified to fit your application. The container also includes the following:

Driver Requirements

Release 19.12 is based on NVIDIA CUDA 10.2.89, which requires NVIDIA Driver release 440.30. However, if you are running on Tesla (for example, T4 or any other Tesla board), you may use NVIDIA driver release 396, 384.111+, 410, 418.xx or 440.30. The CUDA driver's compatibility package only supports particular drivers. For a complete list of supported drivers, see the CUDA Application Compatibility topic. For more information, see CUDA Compatibility and Upgrades.

GPU Requirements

Release 19.12 supports CUDA compute capability 6.0 and higher. This corresponds to GPUs in the Pascal, Volta, and Turing families. Specifically, for a list of GPUs that this compute capability corresponds to, see CUDA GPUs. For additional support details, see Deep Learning Frameworks Support Matrix.

Key Features and Enhancements

This TensorFlow release includes the following key features and enhancements.

  • TensorFlow container image version 19.12 is based on TensorFlow 1.15.0 and TensorFlow 2.0.0.
  • Latest version of DALI 0.16.0 Beta
  • Latest version of DLProf 19.12
  • Latest version of Horovod 0.18.2
  • Latest version of Nsight Systems 2019.6.1
  • Latest version of TensorBoard for 19.12-tf2-py3 includes version 2.0.2
  • Jupyter Notebook, JupyterLab, and JupyterLab Server versions are now specific to which TensorFlow container version you choose to use.
  • Added optimized GenerateBoxPorposals op for object detection models.
  • Deterministic cuDNN convolutions, enabled via TF_CUDNN_DETERMINISTIC or TF_DETERMINISTIC_OPS are now available on a wider range of layer configurations. Prior to this version, some layer configurations would result in an exception with the message No algorithm worked!
  • Ubuntu 18.04 with November 2019 updates

Announcements

  • We will stop support for Python 2.7 in a future TensorFlow container release.
  • Deep learning framework containers 19.11 and later include experimental support for Singularity v3.0.

Accelerating Inference In TensorFlow With TensorRT (TF-TRT)

For step-by-step instructions on how to use TF-TRT, see Accelerating Inference In TensorFlow With TensorRT User Guide.

Key Features And Enhancements
  • Per channel and QDQ op support for Quantization API in TensorFlow 1.15 container

Known Issues
  • We have seen a performance regression in SSD Mobilenet V1 in 19.12 with both native TensorFlow and TF-TRT, mostly with batch size 8 but also 1 and 2, and with all types of GPUs. This could be due to a change in the SSD graph. We are still investigating this issue.

  • We have observed a regression in the performance of certain TF-TRT benchmarks in TensorFlow 1.15 including image classification models with precision INT8. We are still investigating this issue. Since 19.11 comes with a new version of TensorFlow (1.15), which includes a lot of changes in the TensorFlow backend, it’s very possible that the regression is caused by a change in the TensorFlow backend.

  • We have observed a regression in the performance of certain TF-TRT benchmarks in TensorFlow 1.15 including image classification models with precision INT8. We are still investigating this. Since 19.11 comes with a new version of TensorFlow (1.15), which includes a lot of changes in the TensorFlow backend, it’s very possible that the regression is caused by a change in the TensorFlow backend.

  • CUDA 10.2 and NCCL 2.5.x libraries require slightly more device memory than previous releases. As a result, some models that ran previously may exhaust device memory.

  • The accuracy of Faster RCNN with the backbone ResNet-50 using TensorRT6.0 INT8 calibration is lower than expected. This will be fixed in future releases of TensorRT.

  • The following sentence that appears in the log of TensorRT 6.0 can be safely ignored. This will be removed in the future releases of TensorRT.
    Copy
    Copied!
                

    Calling isShapeTensor before the entire network is constructed may result in an inaccurate result.

  • The following warning is issued when the method build() from the API is not called. This warning can be ignored.
    Copy
    Copied!
                

    OP_REQUIRES failed at trt_engine_resource_ops.cc:183 : Not found: Container TF-TRT does not exist. (Could not find resource: TF-TRT/TRTEngineOp_...

  • The following warning is issued because internally TensorFlow calls the TensorRT optimizer for certain objects unnecessarily. This warning can be ignored.
    Copy
    Copied!
                

    OP_REQUIRES failed at trt_engine_resource_ops.cc:183 : Not found: Container TF-TRT does not exist. (Could not find resource: TF-TRT/TRTEngineOp_...

  • We have seen failures when using INT8 calibration (post-training) within the same process that does FP32/FP16 conversion. We recommend to use separate processes for different precisions until this issue gets resolved.

  • We have seen failures when calling the TensorRT optimizer on models that are already optimized by TensorRT. This issue will be fixed in a future release.

  • In case you import nets from models/slim, you might see the following error:
    Copy
    Copied!
                

    AttributeError: module 'tensorflow_core.contrib' has no attribute 'tensorrt'

    Changing the order of imports can fix the issue. Therefore, import TensorRT before importing nets as follows:

    Copy
    Copied!
                

    import tensorflow.contrib.tensorrt as trt import nets.nets_factory

Automatic Mixed Precision (AMP)

Automatic mixed precision converts certain float32 operations to operate in float16 which can run much faster on Tensor Cores. Automatic mixed precision is built on two components:

  • a loss scaling optimizer
  • graph rewriter

For models already using an optimizer from tf.train or tf.keras.optimizers for both compute_gradients() and apply_gradients() operations (for example, by calling optimizer.minimize() or model.fit(), automatic mixed precision can be enabled by wrapping the optimizer with tf.train.experimental.enable_mixed_precision_graph_rewrite().

For more information on this function, see the TensorFlow documentation here. For backward compatibility with previous container releases, AMP can also be enabled for tf.train optimizers by defining the following environment variable:

Copy
Copied!
            

export TF_ENABLE_AUTO_MIXED_PRECISION=1


For more information about how to access and enable Automatic mixed precision for TensorFlow, see Automatic Mixed Precision Training In TensorFlow from the TensorFlow User Guide, along with Training With Mixed Precision.

Tensor Core Examples

The tensor core examples provided in GitHub focus on achieving the best performance and convergence by using the latest deep learning example networks and model scripts for training. Each example model trains with mixed precision Tensor Cores on Volta, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. This container includes the following tensor core examples.

Known Issues

  • There are known issues since the 19.11 release for NCF inference with XLA and VGG16 training without XLA; these benchmarks have performance that is lower than expected.

  • There are known issues regarding TF-TRT INT8 accuracy issues. See the Accelerating Inference In TensorFlow With TensorRT (TF-TRT) section above for more information.

  • For BERT Large training with the 19.08 release on Tesla V100 boards with 16 GB memory, performance with batch size 3 per GPU is lower than expected; batch size 2 per GPU may be a better choice for this model on these GPUs with the 19.08 release. 32 GB GPUs are not affected.

  • TensorBoard has a bug in its IPv6 support which can result in the following error: Tensorboard could not bind to unsupported address family ::. To workaround this error, pass the --host <IP>flag when starting TensorBoard.

  • A known issue in TensorFlow results in the error Cannot take the length of Shape with unknown rank when training variable sized images with the Keras model.fit API. Details are provided here and a fix will be available in a future release.

  • Support for CUDNN float32 Tensor Op Math mode first introduced in the 18.09 release is now deprecated in favor of Automatic Mixed Precision. It is scheduled to be removed after the 19.11 release.

  • There is a known issue when your NVIDIA driver release is older than 418.xx since the 19.10 release, the Nsight Systems profiling tool (for example, the nsys) might cause CUDA runtime API error. A fix will be included in a future release.

© Copyright 2024, NVIDIA. Last updated on Sep 30, 2024.