The Training with Mixed-Precision User Guide introduces NVIDIA's latest architecture called Volta. This guide summarizes the ways that a framework can be fine-tuned to gain additional speedups by leveraging the Volta architectural features.

1. Introduction

Volta is NVIDIA’s latest architecture for deep learning frameworks. Volta retains and extends the same programming models provided by previous NVIDIA architectures such as Pascal. Applications that follow the best practices for those architectures should typically see speedups on the Volta architecture without any code changes. This guide summarizes the ways that a framework can be fine-tuned to gain additional speedups by leveraging the Volta architectural features.

For more information about Volta and its architecture, see Volta.

2. Mixed-Precision Training

The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and CUDA 8 in the NVIDIA Deep Learning SDK.

Mixed-precision is the combined use of different numerical precisions in a computational method.

Half-precision (also known as FP16) data compared to higher precision FP32 or FP64 reduces memory usage of the neural network, allowing training and deployment of larger networks, and FP16 data transfers take less time than FP32 or FP64 transfers.

Single-precision (also known as 32-bit) is a common floating point format (float in C-derived programming languages), and 64-bit, known as double precision (double).

2.1. Half-Precision Format

IEEE 754 standard defines the following 16-bit half-precision floating point format: 1 sign bit, 5 exponent bits, and 10 fractional bits.

Exponent is encoded with 15 as the bias, resulting [-14, 15] exponent range (two exponent values, 0 and 31, are reserved for special values). An implicit lead bit 1 is assumed for normalized values, just like in other IEEE floating point formats.

Half-precision format leads to the following dynamic range and precision:
Normalized values
2-14 to 215, 11 bits of significand
Denormal values
2-24 to 2-15, significand bits decrease as the exponent gets smaller. Exponent k in [-24, -15] range results in (25 - k) bits of significand precision.
Some example magnitudes:
Maximum normalized
Minimum normalized
2-14= ~6.10e-5
Minimum denormal
2-24= ~5.96e-8

Half-precision dynamic range, including denormals, is 40 powers of 2. For comparison, single-precision dynamic range including denormals is 264 powers of 2.

2.2. Volta Tensor Core Math

Volta generation of GPUs introduces Tensor Cores, which provide 8x more throughput than single-precision math pipelines. Each Tensor Core performs D = A x B + C, where A, B, C and D are matrices. A and B are half-precision matrices, whereas D and C can be either half or single-precision matrices. In other words, Tensor Core math can accumulate half-precision products into either single or half-precision outputs. D and C are 4x4 matrices, while A and B can be multiples of 4x4.

In practice, higher performance is achieved when A and B dimensions are multiples of 8. cuDNN v7 and cuBLAS 9 include some functions that invoke Tensor Core operations, for performance reasons these require that input and output feature map sizes are multiples of 8. For more information, see cuDNN Developer Guide.

2.3. Training with Mixed-Precision

Given a framework that supports Volta Tensor Core math, many networks can be trained faster by simply enabling the Tensor Core path in the framework (choosing FP16 format for tensors and/or convolution/fully-connected layers; for more details see Adding a Framework) and keeping all the hyperparameters of FP32 training session. However, some networks require their gradient values to be shifted into FP16 representable range in order to match the accuracy of FP32 training sessions. The figure below illustrates one such case. Guidelines for gradient shifting is discussed in the next section.

Figure 1. Histogram of activation gradient magnitudes throughout FP32 training of Multibox SSD network. The x-axis is logarithmic, except for the zero entry. For example, 66.8% of values were 0, 4% had magnitude in the (2-32 , 2-30) range. Histogram of activation gradient magnitudes throughout FP32 training of Multibox SSD network. The x-axis is logarithmic, except for the zero entry. For example, 66.8% of values were 0, 4% had magnitude in the (2-32 , 2-30) range.
Figure 2. Histogram of activation gradient magnitudes throughout FP32 training of Multibox SSD network. Both x- and y-axes are logarithmic. Histogram of activation gradient magnitudes throughout FP32 training of Multibox SSD network. Both x- and y-axes are logarithmic.
Consider the histogram of activation gradient values (shown with linear and log y-scales above), collected across all layers during FP32 training of Multibox SSD detector network (VGG-D backbone). When converted to FP16, 31% of these values become zeros, leaving only 5.3% as non-zeros which for this network lead to divergence during training.
Note: Much of the FP16 representable range was left unused by the gradient values. Therefore, if we shift the gradient values to occupy more of that range, we can preserve many values that are otherwise lost to 0s.

For this particular network, shifting by 3 exponent values (multiply by 8) was sufficient to match the accuracy achieved with FP32 training by recovering the relevant values lost to 0. Shifting by 15 exponent values (multiplying by 32K) would recover all but 0.1% of values lost to 0 when converting to FP16 and still avoid overflow. In other words, FP16 dynamic is sufficient for training, but gradients may have to be scaled to move them into the range to keep them from becoming zeros in FP16.

2.3.1. Loss Scaling To Preserve Small Gradient Magnitudes

As was shown in the previous section, successfully training some networks requires gradient value scaling to keep them from becoming zeros in FP16. This can be efficiently achieved with a single multiplication by scaling the loss value computed in the forward pass, prior to starting back-propagation. By chain rule back-propagation ensures that all the gradient values are scaled by the same amount. This requires no extra operations during back-propagation and keeps the relevant gradient values from becoming zeros.

Weight gradients must be unscaled before weight update, to maintain the magnitude of updates the same as in FP32 training. It is simplest to perform this descaling right after the backward pass but before gradient clipping or any other gradient-related computations. This ensures that no hyperparameters (such as gradient clipping threshold, weight decay, etc.) have to be adjusted.

While many networks match FP32 training results when all tensors are stored in FP16, some require updating an FP32 copy of weights. Furthermore, values computed by large reductions should be left in FP32. Examples of this include statistics (mean and variance) computed by batch-normalization, SoftMax.

Batch-normalization can still take FP16 inputs and outputs, saving half the bandwidth compared to FP32, it’s just that the statistics and value adjustment should be done in FP32. This leads to the following high-level procedure for training:
  1. Maintain a master copy of weights in FP32
  2. For each iteration:
    1. Make an FP16 copy of the weights
    2. Forward propagation (FP16 weights and activations)
    3. Multiply the resulting loss with the scaling factor S
    4. Backward propagation (FP16 weights, activations, and their gradients)
    5. Multiply the weight gradient with 1/S
    6. Complete the weight update (including gradient clipping, etc.)

2.3.2. Choosing a Scaling Factor

The procedure described in the previous section requires you to pick a loss scaling factor to adjust the gradient magnitudes. There is no downside to choosing a large scaling factor as long as it doesn’t cause overflow during back-propagation, which would lead to weight gradients containing infinities or NaNs, that in turn would irreversibly damage the weights during the update. These overflows can be easily and efficiently detected by inspecting the computed weight gradients, for example, multiply the weight gradient with 1/S step in the previous section.

One option is to skip the weight update when an overflow is detected and simply move on to the next iteration.

There are several options to choose the loss scaling factor. The simplest one is to pick a constant scaling factor. We trained a number of feed-forward and recurrent networks with Tensor Core math for various tasks with scaling factors ranging from 8 to 32K (many networks did not require a scaling factor), matching the network accuracy achieved by training in FP32. However, since the minimum required scaling factor can depend on the network, framework, minibatch size, etc., some trial and error may be required when picking a scaling value. A constant scaling factor can be chosen more directly if gradient statistics are available. Choose a value so that its product with the maximum absolute gradient value is below 65,504 (the maximum value representable in FP16).

A more robust approach is to choose the loss scaling factor dynamically. The basic idea is to start with a large scaling factor and then reconsider it in each training iteration. If no overflow occurs for a chosen number of iterations N then increase the scaling factor. If an overflow occurs, skip the weight update and decrease the scaling factor. We found that as long as one skips updates infrequently the training schedule does not have to be adjusted to reach the same accuracy as FP32 training. Note that N effectively limits how frequently we may overflow and skip updates. The rate for scaling factor update can be adjusted by picking the increase/decrease multipliers as well as N, the number of non-overflow iterations before the increase. We successfully trained networks with N = 2000, increasing scaling factor by 2, decreasing scaling factor by 0.5, many other settings are valid as well. Dynamic loss-scaling approach leads to the following high-level training procedure:
  1. Maintain a master copy of weights in FP32.
  2. Initialize S to a large value.
  3. For each iteration:
    1. Make an FP16 copy of the weights.
    2. Forward propagation (FP16 weights and activations).
    3. Multiply the resulting loss with the scaling factor S.
    4. Backward propagation (FP16 weights, activations, and their gradients).
    5. If there is an Inf or NaN in weight gradients:
      1. Reduce S.
      2. Skip the weight update and move to the next iteration.
    6. Multiply the weight gradient with 1/S.
    7. Complete the weight update (including gradient clipping, etc.).
    8. If there hasn’t been an Inf or NaN in the last N iterations, increase S.

3. Multi-GPU Training

For multi-GPU training, the same strategy applies for loss scaling. NCCL supports both half-precision floats and normal floats, therefore, a developer can choose which precision they want to use to aggregate gradients. Batch size considerations depend on your training framework.

4. Prerequisites

To take advantage of the Volta architecture and mixed-precision training, ensure you meet the following minimum requirements:

  1. Install NVIDIA drivers. It is recommended to install the latest 384 series of NVIDIA drivers for use with the Tesla V100 GPUs. The latest recommended version of the Linux driver for Tesla V100 is 384.66.
  2. Install the CUDA 9 Toolkit.
  3. Install cuDNN v7.

5. Adding a Framework

Most major deep learning frameworks have begun to merge support for half-precision training techniques that exploit Tensor Core calculations in Volta. Additional optimization pull requests are at various stages and listed in their respective section.

For NVCaffe, Caffe2, MXNet, PyTorch, TensorFlow and Theano, Tensor Core acceleration is automatically enabled if FP16 storage is enabled.

While frameworks like Microsoft Cognitive Toolkit (formerly CNTK) , Chainer and Torch will tolerate the Volta architecture, they currently do not exploit Tensor Core functionality.

5.1. NVCaffe

NVCaffe includes support for FP16 storage and Tensor Core math. To achieve optimum performance, you can train a model using Tensor Core math and FP16 mode on NVCaffe.

5.1.1. Running FP16 Training on NVCaffe

  1. Install ImageNet.
  2. Install the following packages:
    sudo apt-get install git
    sudo apt-get install build-essential
    sudo apt-get install libboost-all-dev
    sudo apt-get install protobuf-compiler
    sudo apt-get install hdf5-tools
    sudo apt-get install hdf5-helpers
    sudo apt-get install libhdf5-dev
    sudo apt-get install libgflags-dev
    sudo apt-get install libprotobuf-dev
    sudo apt-get install libgoogle-glog-dev
    sudo apt-get install libblas-dev
    sudo apt-get install cmake
    sudo apt-get install libopencv-dev
    sudo apt-get install liblmdb-dev
    sudo apt-get install libleveldb-dev
    sudo apt-get install libsnappy-dev
    sudo apt-get install libatlas-base-dev
    sudo apt-get install libatlas-dev
    sudo apt-get install python-numpy
    sudo apt-get install python-pip
    sudo pip install --upgrade pip
    sudo -H pip install scipy
    sudo -H pip install scikit-image
    sudo -H pip install protobuf
  3. Clone the latest code from the NVCaffe GitHub Repository.
  4. Build NVCaffe. Issue the following commands:
    caffe_dir$ mkdir build
    caffe_dir$ cd build
    caffe_dir/build$ cmake -DCMAKE_BUILD_TYPE=Release -DTEST_FP16=ON -DUSE_CUDNN=ON -DCPU_ONLY=OFF -G "Unix Makefiles" ..
    caffe_dir/build$ make -j
    caffe_dir/build$ cd ..

    Ensure the following paths link to the ImageNet installation directory:
    caffe_dir$ ls -al examples/imagenet/ilsvrc12_train_lmdb
    lrwxrwxrwx 1 user user 45 May 30 13:53 examples/imagenet/ilsvrc12_train_lmdb -> /some_path_to/ilsvrc12_train_lmdb
    caffe_dir$ ls -al examples/imagenet/ilsvrc12_val_lmdb
    lrwxrwxrwx 1 user user 43 May 30 13:53 examples/imagenet/ilsvrc12_val_lmdb -> /some_path_to/ilsvrc12_val_lmdb
    caffe_dir$ ls -al data/ilsvrc12/imagenet_mean.binaryproto
    lrwxrwxrwx 1 user user 51 May 30 13:51 data/ilsvrc12/imagenet_mean.binaryproto -> /some_path_to/imagenet_mean.binaryproto
  5. Experiment with the following training parameters:
    1. Before running the training script below, adjust the batch size for better performance. To do so, open the training settings with your choice of editor, for example, vim:
      caffe$ vim models/resnet50/train_val_fp16.prototxt

      And change the batch_size: 32 setting value to [64...128] * <Number of GPUs installed>.

    2. Experiment with pure FP16 mode by setting:
      default_forward_math:  FLOAT16
      default_backward_math: FLOAT16

      And by adding solver_data_type: FLOAT16 to the file models/resnet50/solver_fp16.prototxt.

  6. Train ResNet-50. Open:
    caffe$ ./models/resnet50/train_resnet50_fp16.sh
    When the training is finished, it should look similar to the following:
    I0806 06:54:20.037241   276 parallel.cpp:79] Overall multi-GPU performance: 5268.04 img/sec*
    Note: The performance number of 5268 img/sec was trained on an 8-GPU system. For a single GPU system, you could expect around 660 img/sec training with NVCaffe.
  7. View the output. Issue the following command:
    caffe$ python plot_top5.py -s 

    Your output should look similar to the following:

    Figure 3. ResNet-50 FP16 training log ResNet-50 FP16 training log

5.1.2. NVCaffe Example

For examples on optimization, see the models/resnet50/train_val_fp16.prototxt file.

5.2. Caffe2

Caffe2 includes support for FP16 storage and Tensor Core math. To achieve optimum performance, you can train a model using Tensor Core math and FP16 mode on Caffe2.

When training a model on Caffe2 using Tensor Core math and FP16, the following actions take place:
  • Prepare your data. You can generate data in FP32 and then cast it down to FP16. The GPU transforms path of the ImageInput operation can do this casting in a fused manner.
  • Forward pass. Since data is given to the network in FP16, all of the subsequent operations will run in FP16 mode, therefore:
    • Select which operators need to have both FP16 and FP32 parameters by setting the type of Initializer used. Typically, the Conv and FC operators need to have both parameters.
      • Case the output of forward pass, before SoftMax, back to FP32.
      • To enable Tensor Core, pass enable_tensor_core=True to ModelHelper when representing a new model.
    • Update the master FP32 copy of the weights using the FP16 gradients you just computed. For example:
      • Cast up gradients to FP32.
      • Update the FP32 copy of parameters.
      • Cast down the FP32 copy of parameters to FP16 for the next iteration.
  • Gradient scaling.
    • To scale, multiply the loss by the scaling factor.
    • To descale, divide LR and weight_decay by the scaling factor.

5.2.1. Running FP16 Training on Caffe2

  1. Clone the latest code from the Caffe2 GitHub Repository and build.
  2. Define the installation path to where Python can find your modules.
    1. Set PYTHONPATH to $<INSTALL_PATH>/caffe2/python.
  3. Run the following Python script with the appropriate command line arguments. You can test using the ResNet-50 image classification training script included in Caffe2.
    python caffe2/python/examples/resnet50_trainer.py --train_data
    <path> --test_data <path> --num-gpus <int> --batch-size <int>
    --dtype float16 --enable-tensor-core --cudnn_workspace_limit_mb 
    1024 --image_size 224
    For more information about the additional command-line arguments, issue the following command:
    caffe2/python/examples/resnet50_trainer.py --help
    To enhance performance, the following changes must be made:
    • The network definition in caffe2/python/models/resnet.py must be changed to reflect version 1 of the network by changing the residual block striding from the 3x3 convolution to the first 1x1 convolution operator.
    • Enable optimized communication operators and disable some communication ops by adding the use_nccl=True and broadcast_computed_params=False flags to the data_parallel_model.Parallelize call in caffe2/python/examples/resnet50_trainer.py.
    • Add decode_threads=3 and use_gpu_transform=True to the brew.image_input call. This tweaks the amount of CPU threads used for data decode and augmentation (value is per-GPU) and enables the use of the GPU for some data augmentation work.
    • Increase the number of host threads used to schedule operators on the GPUs by adding train_model.net.Proto().num_workers = 4 * len(gpus) after the call to data_parallel_model.Parallelize.

5.2.2. Caffe2 Example

For more information, you can find examples at: Caffe2 Python Examples.

5.3. MXNet

MXNet includes support for FP16 storage and Tensor Core math. To achieve optimum performance, you need to train a model using Tensor Core math and FP16 mode on MXNet.

The following procedure is typical for when you want to have your entire network in FP16. Alternatively, you can take output from any layer and cast it to FP16. Subsequent layers will be in FP16 and will use Tensor Core math if applicable.

5.3.1. Running FP16 Training on MXNet

  1. Install OpenCV 3.1.0. There are two ways to install OpenCV:
    1. Install OpenCV to user prefix. Ensure that you modify the PKG_CONFIG_PATH environment variable to include the <opencv-build-dir>/lib/pkgconfig directory.
    2. Install OpenCV for system-wide install. This method requires root privileges. The following code sample installs OpenCV for a system-wide install.
    OPENCV_VERSION=3.1.0 && \
         wget -q -O - 
    | tar -xzf - && \
         cd opencv-${OPENCV_VERSION} && \
          -DWITH_CUDA=OFF -DWITH_1394=OFF \
          -DBUILD_opencv_cudalegacy=OFF -DBUILD_opencv_stitching=OFF
         -DWITH_IPP=OFF . && \
         make -j"$(nproc)" install && \
         cp lib/cv2.so /usr/local/lib/python2.7/site-packages/ && \
         rm -rf opencv-${OPENCV_VERSION}
  2. Install libjpeg-turbo.
         apt-get install autoconf automake libtool nasm 
         JPEG_TURBO_VERSION=1.5.2 && \
         wget -q -O -
    O_VERSION}.tar.gz | tar -xzf - && \
         cd libjpeg-turbo-${JPEG_TURBO_VERSION} && \
         autoreconf -fiv && \
         ./configure --enable-shared --prefix=/usr 2>&1 >/dev/null && 
         make -j"$(nproc)" install 2>&1 >/dev/null && \
         rm -rf libjpeg-turbo-${JPEG_TURBO_VERSION}
  3. Clone the latest code from the MXNet GitHub Repository, as well as its sub-modules.
  4. Apply the following pull requests:
    1. Apply pull request 7152 for faster and improved IO pipeline.
      git fetch origin pull/7152/head:improved_io
      git checkout improved_io
    2. Cherry pick commit 2580dc063cb91fe29d17d55d84f2e39b83676ae8 that enables persistent batch normalization in cuDNN v7.
      git cherry-pick 2580dc063cb91fe29d17d55d84f2e39b83676ae8
    3. Cherry pick commit 087f96e45fb2cb2d305557ba03789ae2ee367417 that fixes usages of ModernGPU library in MXNet.
      git cherry-pick 087f96e45fb2cb2d305557ba03789ae2ee367417
    4. Cherry pick commit 842c096284623d9a5bc195259a157d7749fcd593 that fixes hangs in depthwise convolutions on Volta
      git cherry-pick 842c096284623d9a5bc195259a157d7749fcd593
    5. Apply pull request 7654 that enables mixed precision training on all optimizers.
      git fetch origin pull/7654/head:mp_pr
      	git merge mp_pr
    6. Update MXNet submodules.
      git submodule init 
      git submodule update
  5. Copy the config.mk file from make/ directory to the main MXNet directory.
    cp make/config.mk config.mk
  6. Modify the config.mk file.
    echo "USE_CUDA=1" >> config.mk && \
    echo "USE_CUDNN=1" >> config.mk && \
    echo "CUDA_ARCH :=" \
             "-gencode arch=compute_35,code=sm_35" \
             "-gencode arch=compute_52,code=sm_52" \
             "-gencode arch=compute_60,code=sm_60" \
             "-gencode arch=compute_61,code=sm_61" \
             "-gencode arch=compute_70,code=sm_70" \
                  "-gencode arch=compute_70,code=compute_70" >> 
    config.mk && \
    echo "USE_CUDA_PATH=/usr/local/cuda" >> config.mk && \
    echo "USE_TURBO_JPEG=1" >> config.mk && \
    echo "USE_TURBO_JPEG_PATH=/usr" >> config.mk
  7. Implement the improved IO pipeline. Use the IndexedRecordIO format of input, which differs from the legacy RecordIO format, by including an additional index file with an .idx extension. The .idx file is automatically generated when using the im2rec.py tool, but if you do not have the tool and therefore, only have the .rec file, you can generate the index file with tools/rec2idx.py tool:
    python tools/rec2idx.py <path to .rec file> <path to newly created .idx file>


To verify MXNet is trained with FP16:
  1. Cast the data (input to the network) to FP16.
    mxnet.sym.Cast(data=input_data, dtype=numpy.float16)
  2. Cast back to fp32 before SoftMax layer.
  3. If you encounter precision problems, it is beneficial to scale the loss up by 128, and scale the application of the gradients down by 128. This ensures higher gradients during the backward pass calculation, but will still correctly update the weights. For example, if out last layer is mx.sym.SoftmaxOutput (cross-entropy loss), and the initial learning rate is 0.1, add a grad_scale parameter:
    mxnet.sym.SoftmaxOutput(other_args, grad_scale=128.0)
    When initializing the optimizer, rescale the gradient down prior to the application:
    mxnet.optimizer.SGD(other_args, rescale_grad=1.0/128)
    Tip: When training in FP16, it is best to use multi-precision optimizers that keep the weights in FP32 and perform the backward pass in FP16. For example, for SGD with momentum, you would issue the following:
    mxnet.optimizer.SGD(other_args, momentum=0.9, multi_precision=True)

    Alternatively, you can pass 'multi_precision': True to the optimizer_params option in the model.fit method.

5.3.2. MXNet Example

You can enable FP16 in these networks by passing the --dtype float16 option to the train_imagenet.py script.

In the following example, ResNet-50 v1 is trained on 8 Volta GPUs with ImageNet data. This example assumes that data is in the /data/imagenet/ directory and is pre-resized to 480px shorter side.
python example/image-classification/train_imagenet.py 
--gpu 0,1,2,3,4,5,6,7 --batch-size 1024 --data-train /data/imagenet/train.rec 
--data-train-idx /data/imagenet/train.idx --data-val /data/imagenet/val.rec 
--disp-batches 100 --network resnet-v1 --num-layers 50 --data-nthreads 40 
--min-random-scale 0.533 --max-random-shear-ratio 0 --max-random-rotate-angle 0 
--max-random-h 0 --max-random-l 0 --max-random-s 0 --dtype float16

For example networks, refer to the example/image-classification/symbols/ directory.

5.4. PyTorch

PyTorch includes support for FP16 storage and Tensor Core math. To achieve optimum performance, you can train a model using Tensor Core math and FP16 mode on PyTorch.

5.4.1. Running FP16 Training on PyTorch

  1. Clone the latest code from the PyTorch GitHub Repository - Code.
  2. Download the latest examples from the PyTorch GitHub Repository - Examples.
  3. Apply the following pull requests:
    1. Apply pull request 2388 to use cuDNN batch normalization to the github.com/pytorch/pytorch directory.
    2. Apply pull request 203 for FP16 support to the github.com/pytorch/examples directory.
  4. Compile PyTorch with the following:
    • CUDA 9 RC compiler with the Toolkit
    • cuDNN v7
    • NCCL
    For compiling instructions, see GitHub PyTorch instructions.
    1. Ensure your CUDA_HOME directory points to the CUDA 9 Toolkit.
    2. Ensure CUDNN_LIB_DIR and CUDNN_INCLUDE_DIR point to the cuDNN v7 library and header.
    3. Ensure the NCCL library is installed and discoverable by PyTorch installation scripts. If system-wide NCCL is not detected, PyTorch will compile its own library from GitHub.
    4. Optional: To run examples, install the torchvision package (for example, pip install torchvision). This package contains pre-defined models for image training and utilities to load the data.
    Do not install magma. The magma conda package is compiled with CUDA 8 and running it on a multi-GPU machine with CUDA 9 leads to hangs. PyTorch will compile without magma support. Compiling magma with CUDA 9.0 in a way suitable for PyTorch is outside the scope of this document.

5.4.2. PyTorch Example

To run FP16 training jobs, you need to make modifications to the PyTorch framework on the user script level. The following steps are implemented in the ImageNet and world language model examples in the PyTorch examples repository.

To run these examples with FP16, follow the instructions for the corresponding examples and add --fp16 to command line arguments. For example, assuming your ImageNet training and validation folders are in the examples/imagenet folder, you would issue the following commands:
python imagenet/main.py -a resnet50 imagenet/ --workers 40 --batch-size 512 --fp16
python/word_language_model/main.py --cuda --emsize 1536 --nhid 1536 --dropout 0.65 --epochs 40 --fp16
  1. Cast the model and inputs to FP16.
    model = model.cuda().half()
    input = input.cuda().half()
    1. Optional: For parallel training, instead of directly casting inputs to FP16 on the GPU, add a layer to the model that would convert inputs on the GPU from FP32 to FP16.
  2. Create 32-bit master copy of the parameters. Create the optimizer using the master copy of the parameters.
    param_copy = [param.clone().type(torch.cuda.FloatTensor).detach()
    for param in model.parameters()]
    for param in param_copy:
        param.requires_grad = True
    optimizer = torch.optim.SGD(param_copy, lr,momentum=momentum, 
  3. Optional: If the model uses batch normalization, replace the batch normalization layers in the model definition with a special batch normalization layer that uses cuDNN and stores its parameters and buffers in FP32.
    nn.BatchNorm2d = torch.nn.contrib.BatchNorm2dFP16
  4. Optional: Scale the loss.
    loss = loss * scale_factor
  5. At each optimization step in the training loop, perform the following:
    1. Cast gradients to FP32. If a loss was scaled, descale the gradients.
    2. Apply updates in FP32 precision and copy the updated parameters to the model, casting them to FP16.
    params = list(model.parameters())
    for i in range(len(params)):
       param_copy[i]._grad =params[i].grad.clone().type_as(param_copy[i]).detach()
    for i in range(len(params)):

5.5. TensorFlow

TensorFlow plans to support FP16 storage and Tensor Core math in TensorFlow 1.4, in the meantime, you can apply patches from pull requests to enable Tensor Core math by default. Therefore, any models that contain convolutions or matrices, multiplies using the tf.float16 data type will automatically take advantage of Tensor Core hardware whenever possible.

When updating a TensorFlow model to use FP16 storage and Tensor Core math, the following guidelines are recommended:
  • Keep trainable variables in float32 precision and cast them to float16 before using them in the model. For example:
    tf.cast(tf.get_variable(..., dtype=tf.float32), tf.float16)
  • Perform SoftMax calculation in float32 precision. For example:
        target, tf.cast(logits, tf.float32))
  • Apply loss-scaling as outlined in the Mixed-Precision Training section. Loss scaling involves multiplying the loss by a scale factor before computing gradients, and then dividing the resulting gradients by the same scale again to re-normalize them. For example, to apply a constant loss scaling factor of 128:
    loss, params = ...
    scale = 128
    grads = [grad / scale for grad in tf.gradients(loss * scale, params)]

5.5.1. Running FP16 Training on TensorFlow

  1. Apply the following pull requests to the public TensorFlow sources:
    1. Apply pull request 12504 for Nan propagation for GPU pooling ops.
    2. Apply pull request 12920 for a workaround for NVCC 9.0 bug.
    3. Apply pull request 13070 to use AllClose instead of AllEqual in layers test.
    4. Apply pull request 13252 to update Tensor Core math selection in GetConvolveAlgorithms .
    $ git clone https://github.com/tensorflow/tensorflow
    $ cd tensorflow
    $ git remote add nv-patches https://github.com/nluehr/tensorflow
    $ git fetch nv-patches
    $ git checkout -b volta-build ea94bbe9fa9f9b3d01fb057c02ef7873d76bf09c
    $ git cherry-pick nv-patches/cuda9-internal-error-fix
    $ git cherry-pick nv-patches/NaN-prop
    $ git cherry-pick nv-patches/relax-layers-test
    $ git cherry-pick nv-patches/tensor-op-search
  2. Install Bazel and Tensorflow’s Python dependencies.
    $ sudo apt-get install python-numpy python-dev python-pip python-wheel
    $ sudo apt-get install pkg-config zip g++ zlib1g-dev unzip python
    $ mkdir bazel-tmp
    $ cd bazel-tmp
    $ curl -fSsL -O https://github.com/bazelbuild/bazel/releases/download/0.5.4/bazel-0.5.4-installer-linux-x86_64.sh
    $ bash ./bazel-0.5.4-installer-linux-x86_64.sh
    $ cd ..
    $ rm -rf bazel-tmp
  3. Run the configure script to enable CUDA 9 and cuDNN 7 support.
    $ ./configure
  4. Build and install Tensorflow’s pip package.
    $ bazel build -c opt --config=cuda tensorflow/tools/pip_package:build_pip_package
    $ bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/pip
    $ pip install --upgrade /tmp/pip/tensorflow-*.whl
    $ rm -rf /tmp/pip/tensorflow-*.whl
    $ bazel clean --expunge

5.5.2. TensorFlow Example

The following script demonstrates construction and training of a simple multinomial logistic regression model. The script uses the FP16-training guidelines described in the previous section.

import tensorflow as tf
import numpy as np

def float32_variable_storage_getter(getter, name, shape=None, dtype=None,
                                    initializer=None, regularizer=None,
                                    *args, **kwargs):
    """Custom variable getter that forces trainable variables to be stored in
    float32 precision and then casts them to the training precision.
    storage_dtype = tf.float32 if trainable else dtype
    variable = getter(name, shape, dtype=storage_dtype,
                      initializer=initializer, regularizer=regularizer,
                      *args, **kwargs)
    if trainable and dtype != tf.float32:
        variable = tf.cast(variable, dtype)
    return variable

def gradients_with_loss_scaling(loss, variables, loss_scale):
    """Gradient calculation with loss scaling to improve numerical stability
    when training with float16.
    return [grad / loss_scale
            for grad in tf.gradients(loss * loss_scale, variables)]

def create_simple_model(nbatch, nin, nout, dtype):
    """A simple softmax model."""
    data    = tf.placeholder(dtype, shape=(nbatch, nin))
    weights = tf.get_variable('weights', (nin, nout), dtype)
    biases  = tf.get_variable('biases',        nout,  dtype,
    logits  = tf.matmul(data, weights) + biases
    target  = tf.placeholder(tf.float32, shape=(nbatch, nout))
    # Note: The softmax should be computed in float32 precision
    loss    = tf.losses.softmax_cross_entropy(
        target, tf.cast(logits, tf.float32))
    return data, target, loss

if __name__ == '__main__':
    nbatch = 64
    nin    = 100
    nout   = 10
    learning_rate = 0.1
    momentum      = 0.9
    loss_scale    = 128
    dtype         = tf.float16

    # Create training graph
    with tf.device('/gpu:0'), \
             # Note: This forces trainable variables to be stored as float32
             'fp32_storage', custom_getter=float32_variable_storage_getter):
        data, target, loss = create_simple_model(nbatch, nin, nout, dtype)
        variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)
        # Note: Loss scaling can improve numerical stability for fp16 training
        grads = gradients_with_loss_scaling(loss, variables, loss_scale)
        optimizer = tf.train.MomentumOptimizer(learning_rate, momentum)
        training_step_op = optimizer.apply_gradients(zip(grads, variables))
        init_op = tf.global_variables_initializer()

    # Run training
    sess = tf.Session()
    np_data   = np.random.normal(size=(nbatch, nin)).astype(np.float16)
    np_target = np.zeros((nbatch, nout), dtype=np.float32)
    np_target[:,0] = 1
    print 'Step Loss'
    for step in xrange(30):
        np_loss, _ = sess.run([loss, training_step_op],
                              feed_dict={data: np_data, target: np_target})
        print '%4i %6f' % (step + 1, np_loss)

5.6. Theano

Theano includes support for FP16 storage and Tensor Core math. To make use of Tensor Core math, set the dnn.conv.algo_xxx configuration parameter to time_once or time_on_shape_change, for example:

5.6.1. Running FP16 Training on Theano

Theano is fully parameterized on floatX type, therefore, to run most Theano scripts in FP16, you can issue:

5.7. Microsoft Cognitive Toolkit

Microsoft Cognitive Toolkit will run on Volta architecture but does not currently support FP16 storage and therefore does not exploit Tensor Core math operations available in Volta. Our internal benchmarks have observed about a 1.5X speedup of CNN training in the Cognitive Toolkit on a single Volta GPU over a single Pascal GPU.

6. Deploying DNNs

After you have trained a neural network, you can optimize and deploy the model for GPU inferencing with TensorRT. For more information about optimizing and deploying using TensorRT, see Deep Learning SDK Documentation.





NVIDIA makes no representation or warranty that the product described in this guide will be suitable for any specified use without further testing or modification. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to ensure the product is suitable and fit for the application planned by customer and to do the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this guide. NVIDIA does not accept any liability related to any default, damage, costs or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this guide, or (ii) customer product designs.

Other than the right for customer to use the information in this guide with the product, no other license, either expressed or implied, is hereby granted by NVIDIA under this guide. Reproduction of information in this guide is permissible only if reproduction is approved by NVIDIA in writing, is reproduced without alteration, and is accompanied by all associated conditions, limitations, and notices.


NVIDIA, the NVIDIA logo, and cuBLAS, CUDA, cuDNN, cuFFT, cuSPARSE, DIGITS, DGX, DGX-1, Jetson, Kepler, NVIDIA Maxwell, NCCL, NVLink, Pascal, Tegra, TensorRT, and Tesla are trademarks and/or registered trademarks of NVIDIA Corporation in the Unites States and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.