Profiling PyTorch with PyProf


For FLOP and bandwidth calculations, we use a relatively straightforward approach. For example, for matrices AMxK and BKxN, the FLOP count for a matrix multiplication is 2 * M * N * K, and bandwidth is M * K + N * K + M * N. Note that the numbers PyProf generates are based on the algorithm, not the actual performance of the specific kernel. For more details, see NVIDIA’s Deep Learning Performance Guide.

Using the information provided by PyProf, the user can identify various issues to help tune the network. For instance, according to the Tensor Core Performance Guide, the M, N and K dimensions that result in Tensor Core usage need to be divisible by 8. In fact, PyProf comes with a flag that lets the user obtain information regarding whether Tensor Cores were used by the kernel. Other useful information might include knowing that a particular kernel did not exploit much thread parallelism, as determined by the grid/block dimensions. Since many PyTorch kernels are open-source (or even custom written by the user, as in CUDA Extensions), this provides the user with information that helps root cause performance issues and prioritize optimization work.

Components and Flow

There are four steps to the PyProf profile flow:

  1. Import PyProf: PyProf module is required to intercept all PyTorch, custom functions and modules.

  2. Profile PyTorch Model: Profile the model with either NVProf or Nsight Systems to obtain an SQL database.

  3. Extract information from the SQL database.

  4. Use this information to calculate flops and bytes.

Enable Profiler in PyTorch Network

Pyprof makes use of the profiler functionality available in Pytorch. The profiler allows you to inspect the cost of different operators inside your model, both CPU and GPU, via the “emit_nvtx()” function.

To enable the profiler, you must add the following lines to your PyTorch network:

import torch.cuda.profiler as profiler
import pyprof

Run the training/inference loop with the PyTorch’s NVTX context manager with torch.autograd.profiler.emit_nvtx(). Optionally, you can use profiler.start() and profiler.stop() to pick an iteration (say after warm-up) for which you would like to capture data. Here’s an example:

iters = 500
iter_to_capture = 100

# Define network, loss function, optimizer etc.

# PyTorch NVTX context manager
with torch.autograd.profiler.emit_nvtx():

    for iter in range(iters):

        if iter == iter_to_capture:

        output = net(images)
        loss = criterion(output, labels)

        if iter == iter_to_capture:

Profile with NVIDIA Profilers

After modifying the PyTorch script to improt pyprof, you will need to use either NVProf or Nsight Systems to profile the performance. Both profilers will output an SQLite database containing the results of the profile.

Please note that NVProf is currently being phased out, and it is recommend to use Nsight Systems to future proof the profile process.

Additionally, only Nsight Systems is available in either the pre-built NGC container or manually built Docker container.

Profile with NVProf

If you are not using Nvprof, skip ahead to Profile with Nsight Systems.

Run NVprof to generate a SQL (NVVP) file. This file can be opened with NVVP.

$ nvprof
    -f          # Overwrite existing file
    -o net.sql  # Create net.sql

If using profiler.start() and profiler.stop() in

$ nvprof
    -o net.sql
    --profile-from-start off  # Profiling start/stop insiode

Note: if you’re experiencing issues with hardware counters and you get a message such as

**_ERR_NVGPUCTRPERM The user running <tool_name/application_name> does not
have permission to access NVIDIA GPU Performance Counters on the target device_**

Please follow the steps described in Hardware Counters.

Profile with Nsight Systems

Run Nsight Systems to generate a SQLite file.

$ nsys profile
    -f true                  # Overwrite existing files
    -o net                   # Create net.qdrep (used by Nsys viewer)
    -c cudaProfilerApi       # Optional argument required for profiler start/stop
    --stop-on-range-end true # Optional argument required for profiler start/stop
    --export sqlite          # Export net.sql (similar to NVProf)

If using profiler.start() and profiler.stop() in, the options -c cudaProfilerApi --stop-on-range-end true are required.

Note: if you are experiencing slow profiling, nsys contains an option

**-s none**

which will disable CPU sampling and significantly speed up profiling.

Parse the SQL file

Run parser on the SQL file. The output is an ASCII file. Each line is a python dictionary which contains information about the kernel name, duration, parameters etc. This file can be used as input to other custom scripts as well. Nsys will create a file called net.sqlite.

$ python -m pyprof.parse net.sqlite > net.dict
Extracted information for each GPU kernel





Kernel Name



44736 ns

Grid and block dimensions


Thead ID, Device ID, Stream ID

23, 0, 7

+ PyProf

Call stack,

Layer name




Tensor Shapes

[32, 64, 56, 56]



Run the Prof Script

Using the python dictionary created in step 3 as the input, Pyprof can produce a CSV output, a columnated output (similar to column -t for terminal readability) and a space separated output (for post processing by AWK for instance). It produces 20 columns of information for every GPU kernel but you can select a subset of columns using the -c flag. Note that a few columns might have the value “na” implying either its a work in progress or the tool was unable to extract that information. Assuming the directory is prof, here are a few examples of how to use

  • Print usage and help. Lists all available output columns:

    $ python -m -h
  • Columnated output of width 150 with some default columns:

    $ python -m -w 150 net.dict
  • CSV output:

    $ python -m --csv net.dict
  • Space seperated output:

    $ python -m net.dict
  • Columnated output of width 130 with columns index,direction,kernel name,parameters,silicon time:

    $ python -m -w 130 -c idx,dir,kernel,params,sil net.dict
  • CSV output with columns index,direction,kernel name,parameters,silicon time:

    $ python -m --csv -c idx,dir,kernel,params,sil net.dict
  • Space separated output with columns index,direction,kernel name,parameters,silicon time:

    $ python -m -c idx,dir,kernel,params,sil net.dict
  • Input redirection:

    $ python -m < net.dict
Options for




Input file for Generated by


See column option table below


Print a csv output. Exclusively use –csv or -w


Width of columnated output. Exclusively use –csv or -w

Column Options






PyTorch Sequence Id


PyTorch Alternate Sequence Id


Thread Id


User annotated NVTX string (can be nested)


Function Call Trace




Sub Sequence Id






Kernel Name




Silicon Time (in ns)


Tensor Core Usage


GPU Device Id


Stream Id


Grid Dimensions


Block Dimensions


Floating point ops (FMA = 2 FLOPs)


Number of bytes in and out of DRAM

The default options are “idx,dir,sub,mod,op,kernel,params,sil”.

Hardware Counters

Profiling GPU workloads may require access to hardware performance counters. Due to a fix in recent NVIDIA drivers addressing CVE‑2018‑6260, the hardware counters are disabled by default, and require elevated privileges to be enabled again. If you’re using a recent driver, you may see the following message when trying to run nvprof

_ERR_NVGPUCTRPERM The user running <tool_name/application_name> does not have permission to access NVIDIA GPU Performance Counters on the target device._

For details, see here.

Permanent Solution

Follow the steps here. The current steps for Linux are:

sudo systemctl isolate multi-user
sudo modprobe -r nvidia_uvm nvidia_drm nvidia_modeset nvidia-vgpu-vfio nvidia
sudo modprobe nvidia NVreg_RestrictProfilingToAdminUsers=0
sudo systemctl isolate graphical

The above steps should result in a permanent change.

Temporary Solution

When running on bare metal, you can run nvprof with sudo.

If you’re running in a Docker image, you can temporarily elevate your privileges with one of the following (oldest to newest syntax):

nvidia-docker run --privileged
docker run --runtime nvidia --privileged
docker run --gpus all --privileged