NVIDIA Nsight Deep Learning Designer

Information on all views, controls, and workflows within the tool.

Overview

Nsight Deep Learning Designer is a software tool whose goal is to speed up Deep Learning developers’ workflow by providing tools to design and profile models in an interactive manner.

Using Nsight Deep Learning Designer, you can iterate faster on your model by rapidly launching inference runs and profiling the layers’ behavior using GPU performance counters.

Nsight Deep Learning Designer uses ONNX, the Open Neural Network Exchange format, to represent models. Profiling uses TensorRT and ONNX Runtime as companion inference frameworks.

Model Design

Understanding how to efficiently design an ONNX model inside Nsight Deep Learning Designer and leverage the various features to do so is crucial.

Creating a New Model

Nsight Deep Learning Designer can both open an existing ONNX model or create a new one from scratch. To do so, a dedicated wizard can be found under File > New File.

../_images/create-model-dialog.png

From this dialog, the desired ONNX Opset version for the new model can be selected. All Opsets from version 1 up to and included version 19 are currently supported. The ONNXRuntime Contrib Operator set may be imported as well.

Workspace

Each opened ONNX model is represented by a document tab in the Nsight Deep Learning Designer workspace. Multiple models can be opened simultaneously and arranged using the docking system. The central element in the Nsight Deep Learning Designer workspace is the canvas where you create/edit your model graph by dropping layer nodes and creating connections between them. The workspace can be arranged using the dockable tool windows to best fit your desired workflow. All tool windows can be found under View > Windows. Refer to the commands under the Window menu to save, apply or reset layouts.

../_images/model-design-workspace.png

The default workspace is composed of the Layer Palette, the Parameter Window, Initializer Editor, and the Type Checking window.

To make it easier to align nodes visually in the canvas a background grid can be enabled using the View > Show Grid menu action. Nodes in the canvas can either be identified by their unique name, automatically generated by Nsight Deep Learning Designer if not present in the ONNX model, and by their layer type.

Editing actions that modify the current graph or model will mark the document as modified. Changes are only reflected on disk once the model is saved using the File > Save menu action or the Ctrl + S shortcut. Unsaved editing actions can be undone and redone, using the Edit > Undo/Redo menu actions or the respective Ctrl + Z and Ctrl + Y shortcuts.

Layout

The first time a model is opened in Nsight Deep Learning Designer, a layout algorithm automatically positions the nodes on the canvas. ONNX models saved through Nsight Deep Learning Designer preserve individual node positions as model metadata in order to restore node positions when re-opened. The layout algorithm can be run explicitly on any model using the View > Arrange Nodes menu action.

Colors

Nodes in the canvas are colored according to their type. As there are too many ONNX node types for each one to receive a distinguishable unique color, related node types share the same color. Certain classes of nodes also have different shapes: input and output nodes are represented as diamonds, while composite nodes like local functions or ones containing subgraphs are represented as squared-off rectangles rather than rounded rectangles.

Nsight Deep Learning Designer provides some alternative color schemes, which may be helpful for individuals with color vision deficiency. The color scheme can be changed from the Network Canvas Preferences page of the Options (Tools > Options) dialog. The figure below shows the available color schemes.

../_images/layer-scheme-options.png

Context Menu

Some actions can be performed from the canvas context menu, accessible by right-clicking anywhere on the canvas. Selected nodes and links may be copied, cut, or deleted via the corresponding actions in the context menu. Similarly, a previous copied selection can be pasted on the canvas.

By selecting only one node in the graph, its documentation can be opened using the Go To Documentation context menu action. When a link is selected, the Go To Source and Go To Destination actions jump to the source or destination node of the link, respectively.

Finally, a selected portion of the graph from the canvas can be extracted to a standalone ONNX model using the Extract Subgraph action. All initializers in use in the subgraph will also be exported, and model inputs/outputs will be created to account for missing connections from the extracted graph.

Exporting the Canvas

The entire model canvas may be exported to a single image file using the File > Export > Export Canvas As Image menu action. The background color, grid color and presence, and image save location may be set from the dialog pictured below. The supported image formats are PNG, JPEG, and SVG.

../_images/export-canvas-dialog.png

Layers

The Layer Palette holds the list of available operators that can be added to the model from the operator sets it imports. Layers can be arranged by name, collection, or category. The Layer palette can also be sorted or filtered. To add new layer instances to the canvas, simply drag and drop from the palette.

../_images/model-design-layer-palette.png

Alternatively, place the mouse cursor anywhere in the model canvas and press ‘Control + Space’ to open a quick node add dialog, pictured below. Typing in the search box will filter the list of available layers, the up and down arrow keys can be used to change the selected layer, and pressing ‘Enter’ or double-clicking on a list entry will add the selected layer to the canvas under the mouse cursor.

../_images/layer-add-dialog.png

The Layer Explorer displays the list of layers currently in the model. Model layers can be organized by layer type or name, and filtered by name from the ‘Filter by name’ search box. The sort order of the layers may be toggled in the toolbar, and when organizing layers by type, all types can be expanded or collapsed from the toolbar. Layer selection is synchronized between the Layer Explorer and the canvas. Double-click on a layer in the Layer Explorer to jump to the layer in the canvas.

../_images/layer-explorer.png

The Layer Explorer’s advanced filtering options are accessible by clicking on the gear icon next to the ‘Filter by name’ text box. The advanced filtering options allow filtering by layer type and any attributes (such as TensorRT layer precision) assigned to the layer. Checking a filter name enables it. The figure below shows an example of filtering for only those layers that have a precision constraint specified. A red badge with the number of active advanced filters is displayed over the Hide/Show Advanced Filters button (represented by the gear icon).

../_images/layer-explorer-advanced-filtering.png

By default, constant nodes are hidden in the Model Canvas and Layer Explorer. To show constant nodes, go to View > Show All Constants. To no longer hide constants by default when opening a new model, go to Tools > Options and under the Network Model Canvas Preferences page, set Hide Constant Nodes to No.

../_images/hide-constants-setting.png

If a layer takes a constant node as an input, the input terminal connected to the constant node will be rendered in the style of a miniature constant node (a rounded gray rectangle). A single Constant node may be selectively hidden or shown by double-clicking on the terminal connected to it. The figure below shows two nodes with input terminals connected to constants: one hidden and one visible.

../_images/constant-terminals.png

Parameters

The parameter tool window allows the interactive modification of any node’s parameters. To do so, select a node from the canvas; available parameters for that operator will then be listed. The node name can also be edited but note that it must be unique across the whole graph.

Parameters with unspecified values receive the default value specified in the Opset and appear in the collapsible Default Value Parameters section of the Parameter Editor. When a parameter is modified, it is moved out of the default value section. To revert a parameter to its default value, click the circular arrow icon, visible on the right hand side of the parameter input field when the mouse is over the field.

Tensor/List values are expected in the following format: [1, 2, 3, 4] and can be nested for multidimensional tensors: [[1, 2], [3, 4]]. String values must be single quoted: 'Some text'. Single quotes can be embedded within string literals using \: 'It\'s alive!'. For more advanced tensor or list editing capability, use the Tensor Editor, by clicking on the up-right arrow button at the very right of any tensor/list type parameter.

If no layer is currently selected, then the model information such as the ONNX opset version imported is displayed. Multiple layers can be selected at once to allow for batch parameter editing. Only parameters held in common by the selected nodes are shown.

../_images/model-design-parameters-window.png

Type Check List

Iteration on a model is a major part of the design workflow. To ensure fast and interactive iteration, the type check list reports any errors, warnings, or issues caused by the current model structure. You can double-click on any messages from the type checker to focus the corresponding operator in the canvas. This aids identification of latent issues within the model during the design process.

../_images/model-design-type-check.png

In Nsight Deep Learning Designer, model type checking is provided by the Polygraphy linter.

Model validation is run automatically after any editing operation that impacted the ONNX model.

The type checker takes more time for larger ONNX models, as such the validation process can be cancelled using the Cancel Check button on the top-left of the tool window. A new model validation run can be requested explicitly by clicking on the Check Model button.

Note that for very large ONNX models, automatic type checking is disabled, but explicit checking can still be performed.

Editing a Model

Dropping an operator into the canvas creates a new instance of that operator type with an automatically generated name. All instances must have a unique name that you can edit; names should be valid C90 identifiers. Operators are represented by a rectangular node on the canvas. This node shows the name and type of the operator, as well as an icon if the type checker has reported any issues.

../_images/model-design-glyph.png

The node glyph represents the operator’s inputs and outputs using terminals. Triangles at the top of the node indicate inputs, and circles at the bottom of the node indicate outputs. Most input terminals need to be connected for the model to be valid, but optional input terminals are not mandatory. Optional inputs are represented on the glyph with a smaller triangle. Multiple links can start from a single output terminal but only one link may be connected to a given input terminal. Unconnected terminals are green, terminals with a link are dark blue, and input terminals connected to an initializer are light blue.

Some operators accept a variable number of input tensors to a given parameter, or produce a variable number of output tensors for a given output name. Nsight Deep Learning Designer represents these by special “infinite” terminals. Upon making a connection to an infinite terminal, more gray-colored terminals will appear between each connected terminal, representing potential new connection points. The figure below shows an example of an operator with a variable number of outputs, as well as subgraphs.

../_images/model-design-complex-glyph.png

See the operator’s description in the editor’s documentation browser (Help > Layer Documentation) for details of a specific operator’s inputs and parameters.

To connect node A to node B, click and drag from any input terminal on node A to any output terminal on node B (or from any output terminal on node A to any input terminal on node B). This action creates a link. Operators and links can be removed by selecting them and using the delete key. Upon successful model validation, intermediate tensor sizes are calculated and displayed alongside the corresponding link.

../_images/model-design-linking.png

Nodes are automatically laid out when a model is loaded, and can be rearranged freely by clicking and dragging the nodes on the canvas. The background grid can be turned on with View > Show Grid to help align layers.

Double-clicking an unassigned input or output terminal on a node will create an input or output operator automatically and link it to that terminal.

Initializers

Initializers are used in ONNX to represent constant tensor values such as weights. They can be used directly as inputs without introducing an extra Constant operator. A single initializer can be used by multiple operators in the graph. Initializer values can either be embedded directly in the ONNX model or referenced from an external binary file. Each initializer is identified by a unique name within the model.

In Nsight Deep Learning Designer, the Initializer Editor tool window allows the user to view and edit the initializers of an opened ONNX model.

../_images/initializer-editor.png

The initializer editor is divided in two parts: an upper section for viewing, creating, editing, and deleting any of the initializers in the model, and a lower section for viewing and connecting the initializers of the currently selected node. If no node is currently selected, the lower section will be hidden.

The network initializer list can be filtered by name. The selected initializer can be:

  • Removed from the model which will also disconnect them from any node currently using it.

  • Edited using the Tensor Editor.

An initializer can be created from scratch with the dialog opened using the Create Initializer button.

../_images/create-initializer.png

From there, the initializer information can be provided such as the name, tensor type, and tensor value. Tensor value are expected in the following format: [1, 2, 3, 4] and can be nested for multidimensional tensors: [[1, 2], [3, 4]]. String values must be single quoted: 'Some text'. Single quotes can be embedded within string literals using \: 'It\'s alive!'. Tensor values can be loaded from a Numpy file. Note that the data type must match the one from the Numpy file before loading.

If the initializer is sparse, then in addition to the non-default value tensor, an indices tensor and dense tensor dimension must be provided. See ONNX SparseTensorProto specification for more information on the format expected by ONNX.

If the initializer is marked as external, a path to a binary file on disk must be provided. The external file must be at a location relative to where the model is stored. Offset is the byte location within the file at which the stored data begins, and length is the number of bytes containing data. External initializer tensor data are not stored directly in the ONNX model. This can reduce the size of the model file.

When selecting a node from the canvas and if it has a free terminal, the Connect Initializer button is enabled to allow connecting a model initializer to it. Once connected, the terminal will change to a yellow color on the node glyph, indicating that it is connected to an initializer. Connecting a link to it will disconnect the initializer from the node input.

The bottom part of the initializer editor lists all terminals of the currently selected node which are connected to initializers. The initializer used by a terminal can be switched using the drop-down menu. The drop-down is filterable by name. Using the cross button, the initializer can be disconnected from the terminal. The plus button opens a preview of the tensor values and the diagonal arrow button will open the Tensor Editor.

../_images/node-initializers.png

Tensor Editor

ONNX tensor data may be modified with the Tensor Editor. Tensor values can be directly edited and the tensor dimension will be updated automatically once the edit is validated. If the tensor has more than ten thousand elements, it is not editable but the tensor value can still be updated with a Numpy file. The current tensor data can also be exported to a Numpy file for external processing.

In the case of a sparse tensor, the indices and dense tensor dimension can be modified. For external tensors, the path, offset, and length information can be edited. In all cases, only the name of the initializer cannot be modified.

It is possible to convert a tensor’s data to a different data type by using the Convert button; the only exceptions are external data and tensors of Boolean or string type. The data type conversion dialog lists all available data types to which the tensor can be converted. Note that depending on the source and target conversion data type, data precision loss and/or truncation can occur.

../_images/tensor-editor.png

Subgraphs

Some ONNX control flow operators such as Loop and If take one or more subgraphs as parameters. Subgraphs are ONNX graphs that share initializers and imported opsets with their containing model.

The scope of a subgraph is determined by its parent, whether that be an operator instance or local function definition. All subgraphs within a scope must have unique names.

Subgraphs can be exported to a standalone ONNX model in Nsight Deep Learning Designer by using the Tools > Extract Model Subgraph command. Within the dialog that opens, select the subgraph by identifying the type of its parent (operator or local function), the name of its parent scope, and the subgraph name within that scope. Then choose an output path for saving the extracted subgraph.

Only operators from the current document are listed. To export subgraphs from operators inside another subgraph or local function, the Tools > Extract Model Subgraph wizard must be opened from the corresponding subgraph or local function document.

../_images/export-model-graph.png

A subgraph can be created from scratch using the + button next to an operator parameter of subgraph type. In the creation dialog, a unique subgraph name given its scope must be provided. Some options can be enabled both to open the subgraph once created, as well as to duplicate an existing subgraph from an operator or local function subgraph.

../_images/create-subgraph-dialog.png

For parameters of subgraph list type, a list view shows all the graphs contained in the parameter list, and the + button can be used to create and add a new subgraph to the list, while the - button will remove the selected subgraph. Using the up-right arrow button, the currently selected graph can be opened in a separate document inside Nsight Deep Learning Designer.

../_images/subgraphlist-parameter.png

Subgraphs can be visualized in separate document tabs. To open a subgraph, click on the subgraph button on the operator glyph and select the desired subgraph from the menu, or click on the subgraph’s link within the Parameter Editor tool window. The subgraph button is shown boxed in red below.

../_images/subgraph-hyperlink.png

Once opened, subgraphs can be edited within Nsight Deep Learning Designer just like a normal ONNX model, though model initializers do not have an analog within subgraphs and therefore cannot be edited. Preserve changes within a subgraph within the enclosing document using the Confirm Subgraph Edits button on the main toolbar. This command updates the parent document to reflect changes to the subgraph. Parent documents are otherwise read-only while their subgraphs are open for editing.

Local Functions

Graph components are commonly repeated within a model. Local functions can be used to represent these recurring patterns. This creates a higher-level representation of the model by abstracting the patterns as single nodes.

Local functions are defined at the model level and can be instantiated like any other operator type from the Layer Palette. Local function instances can be recognized in the canvas by their black background color, square edges, and f(x) symbol. Clicking on the symbol will open the function definition in a separate document inside of Nsight Deep Learning Designer.

../_images/local-function-glyph.png

Once opened in Nsight Deep Learning Designer, a local function can be edited just like a normal model, apart from creating or removing model initializers. Changes made to a local function must be applied by using the Confirm Local Function Edits button on the main toolbar. Once applied, the local function definition will be updated in the model.

Additionally, local functions can be parameterized. When editing a function in Nsight Deep Learning Designer the Local Function Definition Settings dialog can be opened using the Tools > Local Function Settings action or using the Edit Local Function Parameters button from the Parameter Editor when no nodes are selected.

From this dialog, new parameter can be defined using the + button. A unique name must be provided as well as the parameter type. A parameter of type undefined requires that each instance of the function provides a type of information when passing a value for this parameter. Otherwise, a default value must be provided for all other types. Parameters can also be removed by clicking on the respective - button.

../_images/local-function-parameters.png

Operators that are part of a local function may reference parameters provided to the enclosing function instead of providing a specific parameter value. Operators with parameter references will use the corresponding reference value provided by the local function instance.

When selecting an operator, its parameters can be assigned to a reference using the Set Reference Parameter button at the top of the Parameter Editor. A dialog opens with a drop-down list containing available parameters of this operator which do not already use references. The second drop-down list contains all the available local function parameters that can be used as a reference.

Local function parameters of an incompatible type are grayed out and cannot be selected. Reference assignments are validated when the dialog is closed.

../_images/local-function-parameter-reference.png

Reference parameters are represented with a drop-down list in the Parameter Editor. The current reference can be directly switched using the drop-down control. The reference can be removed using the X button; this reverts the parameter back to its original type and default value.

Local functions can be managed using the Model Local Functions tool window. It lists all functions currently defined in the model, the list can be filtered by name.

The selected function can be opened or extracted to a standalone ONNX model using the arrow and save buttons respectively.

The + button allows you to create a local function from scratch. You must first provide a function name and domain. Finally, local functions can be deleted from the model using the - button. All instances of that function will be transformed into custom operators.

../_images/model-local-functions.png

Batch Modifications

In certain workflows it can be necessary to modify large portions of an ONNX model or perform specific modifications upon every node. Nsight Deep Learning Designer has batch modification actions for some common use cases. They can be found under the Tools > Global Model Modification dialog.

../_images/batch-modifications.png

Convert Model to FP16

../_images/convert-model-fp16.png

A common model optimization technique is to convert model weights to a half-precision format (such as FP16). This can decrease the model size by as much as half and improve performance on some GPUs, at the potential cost of some accuracy.

Using Nsight Deep Learning Designer’s Convert Model to Float16 batch modification action under the Tools > Global Model Modification dialog, an ONNX model can be converted to use Float16. Provide an output path for the converted model and click Finish. A spinning wheel will appear while conversion is being performed. When the process is finished, a dialog box will show the status of the conversion with an expandable section containing detailed logs.

../_images/convert-model-fp16-status.png

Nsight Deep Learning Designer’s Float16 conversion is provided by Polygraphy convert subtool, which converts initializers and tensors to Float16 when applicable and can insert Cast operators to maximize the number of operators that will be running with Float16 data.

Sanitize Graph

../_images/sanitize-model.png

The Sanitize Graph batch modification action, under the Tools > Global Model Modification dialog, can help reduce an ONNX model’s size by performing constant folding and removing unused nodes.

To perform graph sanitization, provide an output path for the sanitized model in the dialog and click Finish. A spinning wheel will appear while sanitation is being performed. When the process is finished, a dialog box will show the status with an expandable section containing detailed logs.

../_images/sanitize-model-status.png

The conversion is provided by the Polygraphy sanitize subtool.

Some options are available:

  • Enable Constant Folding: On by default. If off, no constant folding will be performed.

    • Fold size threshold: Sets the maximum per-tensor size threshold, in bytes, for which to apply constant folding. Any nodes generating tensors larger than this size will not be folded away.

    • Number of passes: Sets the number of constant folding passes to run. Subgraphs that compute tensor shapes may not be foldable in a single pass. If left empty, Polygraphy will determine the number of passes necessary.

Convert Tensors

../_images/batch-tensor-convert.png

Multiple tensors may be converted as a batch by using the Batch Tensor Conversion dialog. To access this dialog, open the Tools > Global Model Modifications dialog, and then click Convert Tensors. Individual tensors can be converted inside the Tensor Editor.

The batch tensor conversion dialog is separated in two panels: the top one lists all model initializers while the bottom part contains all tensor or list based node attributes. Initializers can be filtered by name and data type, while node’s tensor or list can be filtered by node or tensor name and data type.

A mix of node tensors, lists, and initializers can be selected for conversion to a single target data type, using the combo box at the top of the dialog. Once all the necessary tensors have been selected, click OK to start batch conversion. A dialog will show the progress of the conversion and any errors observed during the process.

Depending on the source and target conversion data type, data precision loss and/or truncation can occur. Note that undoing the batch conversion will revert all previously converted tensors to their original data type and values.

User Tools

Nsight Deep Learning Designer supports custom user tools when user workflows require processing beyond what the Global Model Modification system provides. User tools are a way to incorporate custom processing of an ONNX model as part of the Nsight Deep Learning Designer design workflow.

../_images/custom-user-tools.png

Custom tools can be managed through a dialog accessible under Tools > Custom Tools. The dialog contains a list of user-defined custom tools. Selecting a tool from the list will show its application path and arguments at the bottom. A selected custom tool can be deleted or edited using the corresponding buttons on the right side of the dialog.

../_images/create-custom-user-tool.png

A new custom tool can be created using the Create button, this opens a new dialog window where the tools information must be provided:

  • A unique name used to identify the tool.

  • The application path representing the executable process launched to start the tool.

  • Optional arguments to pass to the application. Two special replacement arguments are available:

    • $Model: is replaced by the current ONNX document absolute path when launching the tool.

    • $Output: is replaced by a destination file the tool should use to save the modified model. The Prompt for Model Output option prompts for this path. If the option is disabled, this is a path to a temporary file.

  • Prompt for model output: if turned on, when invoking the tool a dialog box will open asking for a path that will be used to replace the $Output argument.

  • Automatically open model output: if turned on, when the tool successfully finished and, if an $Output variable was set in the argument list, Nsight Deep Learning Designer will automatically open the output document.

Note that to run a Python script, the application path should point to the Python interpreter and the first argument provided should be the path to the Python script.

Custom user tools can be found under the Tools > User Tools sub-menu, selecting a tool from that menu will invoke it given the currently focused ONNX model. A dialog window will open and provide the current status of the tool process with the standard output and error logs. A custom tool process can be canceled using the Cancel button, this will result in the process getting killed. Depending on whether the Automatically open model output option was set or not for the tool, Nsight Deep Learning Designer will open the output model, if any, when the tool exits successfully.

Activities Platform Settings

Nsight Deep Learning Designer activities run on a target machine. Activities can be run locally on Linux, Windows, or NVIDIA L4T, or remotely on Linux and NVIDIA L4T target machines (from any supported Nsight Deep Learning Designer host). The Host platform refers to the the machine where Nsight Deep Learning Designer is running. And the Target platform to the machine where an activity will be run. For activities run locally, the host and target machines are the same.

Connection Management

When starting an activity in Nsight Deep Learning Designer, the top part of the activity window is used to select on which target machine the activity will be performed. Both local and remote targets are supported depending on the platform type. The platform on which the host application is running is selected by default.

../_images/activity-platform-settings.png

Currently Nsight Deep Learning Designer supports the following platforms:

  • Windows x86_64: local only.

  • Linux x86_64: local and remote.

  • NVIDIA L4T arm64: local and remote.

When using a remote target, a Connection must be selected or created from the top drop down. To create a new connection, select + and enter the Remote Connection details. When using the local platform, localhost will be selected as the default and no further connection settings are required.

Remote Connections

Remote targets that support SSH can be configured as a target in the Connection Dialog. To configure a remote device, ensure an SSH-capable Target Platform is selected, then press the + button. The following configuration dialog will be presented.

../_images/remote-connection-password.png

Nsight Deep Learning Designer supports both password and private key authentication methods. In this dialog, select the authentication method and enter the following information:

  • Password

    • IP/Host Name: The IP address or host name of the target device.

    • User Name: The user name to be used for the SSH connection.

    • Password: The user password to be used for the SSH connection.

    • Port: The port to be used for the SSH connection. (The default value is 22.)

    • Deployment Directory: The directory to use on the target device to deploy supporting files. The specified user must have write permissions to this location. Relative paths are supported.

    • Connection Name: The name of the remote connection that will show up in the Connection Dialog. If not set, it will default to <User>@<Host>:<Port>.

  • Private Key

    • IP/Host Name: The IP address or host name of the target device.

    • User Name: The user name to be used for the SSH connection.

    • SSH Private Key: The private key that is used to authenticate to the SSH server.

    • SSH Key Passphrase: The passphrase for your private key.

    • Port: The port to be used for the SSH connection. (The default value is 22.)

    • Deployment Directory: The directory to use on the target device to deploy supporting files. The specified user must have write permissions to this location. Relative paths are supported.

    • Connection Name: The name of the remote connection that will show up in the Connection Dialog. If not set, it will default to <User>@<Host>:<Port>.

../_images/remote-connection-private-key.png

In addition to keyfiles specified by path and plain password authentication, Nsight Deep Learning Designer supports interactive authentication and standard keyfile path searching. When all information is entered, click the Add button to make use of this new connection.

Once an activity has been launched remotely, the required binaries and libraries will be copied, if necessary, to the Deployment Directory on the remote machine.

On Linux and NVIDIA L4T host platforms, Nsight Deep Learning Designer supports SSH remote profiling on target machines which are not directly addressable from the machine the UI is running on through the ProxyJump and ProxyCommand SSH options. These options can be used to specify intermediate hosts to connect to or actual commands to run to obtain a socket connected to the SSH server on the target host and can be added to your SSH configuration file.

Note that for both options, Nsight Deep Learning Designer runs external commands and does not implement any mechanism to authenticate to the intermediate hosts using the credentials entered in the Connection Dialog. These credentials will only be used to authenticate to the final target in the chain of machines.

When using the ProxyJump option, Nsight Deep Learning Designer uses the OpenSSH client to establish the connection to the intermediate hosts. This means that in order to use ProxyJump or ProxyCommand, a version of OpenSSH supporting these options must be installed on the host machine.

A common way to authenticate to the intermediate hosts in this case is to use an SSH agent and have it hold the private keys used for authentication.

Since the OpenSSH SSH client is used, you can also use the SSH askpass mechanism to handle these authentications in an interactive manner.

For more information about available options for the OpenSSH client and the ecosystem of tools it can be used with for authentication refer to the official manual pages.

Deployment Workflow

Activities in Nsight Deep Learning Designer depend on shared libraries to support inference. For example, the TensorRT profiler depends on the TensorRT libraries, CUDA toolkit, and cuDNN. Nsight Deep Learning Designer uses an on-demand deployment workflow, meaning that those dependencies are not installed alongside Nsight Deep Learning Designer but are deployed on the selected target before launching an activity.

Before starting an activity, Nsight Deep Learning Designer will check if all the necessary dependencies are present on the target machine; for remote targets, Nsight Deep Learning Designer will look inside the provided Deployment Folder. A dialog shows the list of dependencies for the activity and the verification progress. If some dependencies are missing or not up to date, their entries in the dialog will show a warning icon and Nsight Deep Learning Designer will start their deployment on the target.

../_images/on-demand-deployment.png

To deploy dependencies, Nsight Deep Learning Designer runs an helper binary on the target machine (deployed over SSH for remote targets) which downloads the necessary packages from a storage server over HTTPS. The helper binary then extracts the new libraries from the packages. The deployment dialog shows the progress of package downloads and the extraction process. Note that the target machine must have internet access for on-demand deployment to work.

When HTTPS deployment to a remote target fails, Nsight Deep Learning Designer proposes a fallback workflow which involves first deploying the dependencies on the local machine where the host application is running, and then transferring each dependency over SSH to the target machine. Note that this fallback workflow is expected to be slower as files that needs to be transferred are usually large (over 200 MB).

After all dependencies have been deployed on the target machine, Nsight Deep Learning Designer will proceed with the activity launch. Subsequent launches will be faster as Nsight Deep Learning Designer will not redeploy dependencies as long as they still match activity requirements.

Nsight Deep Learning Designer stores downloaded dependencies and helper binaries for the target machine on the host, and stores timing caches and some other validation caches on the target machine, in a local cache directory. By default, the local cache directory is stored in $HOME\AppData\Local on Windows, and in $HOME/.config on Linux. This directory can be changed by setting the NV_DLD_CACHE_DIR environment variable.

Working with TensorRT

Nsight Deep Learning Designer can export ONNX models to TensorRT engines and optionally profile them. The resulting engine files are fully compatible with other TensorRT 10.8 applications.

Notes:

  • TensorRT engines created with Nsight Deep Learning Designer are specific to both the TensorRT version with which they were created and the GPU on which they were created. See the TensorRT documentation for details.

  • Nsight Deep Learning Designer uses a timing cache when building TensorRT networks. Tactic timings for frequently used layers will be loaded from the cache when possible.

  • The engine build and profiling phases rely on accurate timings of inference algorithms for engine optimization and performance reporting. For best results, do not run other GPU work in parallel with TensorRT activities, as this will skew results.

  • Both export and profiling activities can be launched from the Start Activity dialog accessible from the Welcome page.

Dynamic Shapes and TensorRT

TensorRT requires an optimization profile when working with dynamic input sizes. Statically determined input sizes do not require additional information, but each dynamic input size (such as ['batch', 3, 544, 960] or ['W', 'H']) requires optimization profile details. If an input is not fully specified in this fashion, TensorRT will fail with an error such as input_name: dynamic input is missing dimensions in profile 0.

Nsight Deep Learning Designer provides two ways to define optimization profiles within the host GUI:

  • Dynamic inputs with a single leading wildcard (such as ['batch', 3, 544, 960], ['size'], or ['n', 204800, 4]) may be set automatically with a prefilled value called the Inferred Batch Size. Inputs with more than one wildcard dimension may not be defined using inferred batch sizes.

  • Optimization profiles for individual layers can be set via the TensorRT Optimization Profile attribute. Right click on an input layer in the canvas or Layer Explorer and select the Set TensorRT Optimization Profile context menu item. This will open a dialog where you can define the minimum, maximum, and optimal sizes for the input. The optional size-min and size-max fields are used to define the minimum and maximum sizes for the input. This option is recommended for detailed exploration of models with multiple wildcards. The figure below shows the optimization profile editor. Note that optimization profiles can only be applied to inputs to the top-level graph, and not to subgraphs or local functions.

../_images/trt-opt-profile-editor.png

Specifying Layer Precisions to TensorRT

When run with the default settings, TensorRT will use autotuning to select datatype precision for each layer from the enabled tactics to maximize performance. However, precision constraints may be enforced on a per-layer basis by use of the TensorRT Layer Precision attribute in Nsight Deep Learning Designer. The available floating-point precision options are: fp32, fp16, bf16, and fp8. The integral options are int64, int32, int8, int4, uint8, and bool. If a layer is not assigned a precision, TensorRT will use autotuning to select the best precision for that layer. Note that in addition to specifying precision constraints, you must also set the Typing Mode to Obey Precision Constraints or Prefer Precision Constraints in the activity settings. See below for more details.

../_images/layer-precision-context-menu.png

Layer precision constraints may be set in Nsight Deep Learning Designer by right-clicking selected rows in the Layer Explorer or nodes in the model canvas, and selecting the Set Layer Precision context menu item, and then selecting a precision option in the dialog that appears. To clear a precision constraint, select the No Constraint option for the selected layers.

../_images/layer-precision-dialog.png

Layers with precision constraints will have the constraint value shown in the TRT-Prec column in the Layer Explorer, and a small badge with selected precision will appear on the layer glyph. Clicking this badge shows a quick menu of options to change or remove the constraint.

../_images/layer-precision-glyph-explorer-fused.png

Exporting a TensorRT Engine

To export a TensorRT engine, open the ONNX model you wish to export and use the File > Export > TensorRT Engine menu item. ONNX models can also be exported from the Start Activity dialog without opening them beforehand.

Selecting a Version

../_images/tensorrt-version-select.png

Nsight Deep Learning Designer supports multiple minor versions of TensorRT 10. Use the dropdown selector to choose the version of TensorRT you wish to use. There are selected versions of TensorRT available for automatic download and deployment (along with a recommended version). You may also elect to use the system-installed TensorRT on the target system, or specify a custom path. If you specify a custom path, it must be a relative path from the deployment directory on the target machine to the location of the TensorRT shared libraries.

Common Settings

../_images/export-tensorrt-common.png
  • The ONNX Model parameter is the local path to the model you wish to export. It will be copied to the target system if necessary.

  • The Engine Output parameter is the local destination where you wish to save the exported TensorRT engine. It will be copied from the target system if necessary. The activity suggests a default name for this parameter based on the ONNX model filename.

  • The Save Engine Metadata parameter controls the amount of metadata stored in the TensorRT engine. When set to Yes, DETAILED-level metadata (full information) will be stored in the engine. Setting this option to No removes all layer information.

  • The Metadata Output parameter is optional. If a local path is provided for this parameter, Nsight Deep Learning Designer will create an instance of the TensorRT IEngineInspector class after export and copy its output from the target system. If this parameter is left blank, no metadata file will be created. The activity suggests a default name for this parameter based on the ONNX model filename.

  • The Open Metadata in DLD parameter controls whether the metadata file is opened for visualization as a model in Nsight Deep Learning Designer after export. This option is only available if a metadata file is generated.

  • The Device Index option controls which CUDA device to use on multi-GPU systems. Device zero represents the default CUDA device, and devices are ordered as in the cudaSetDevice call. If this setting is left blank, Nsight Deep Learning Designer will use the first CUDA device.

  • The Custom Plugin parameter allows passing paths to optional custom TensorRT plugins to load during engine building. Provided paths must be relative to the selected target system. Plugins must be compatible with TensorRT 10.8. Refer to the TensorRT documentation for more details on custom plugins.

Tactics Settings

Most settings in this section map closely to TensorRT’s BuilderFlags enumeration.

FP32 tensor formats and tactics are always available to TensorRT. TensorRT may still choose a higher-precision layer format if it results in overall lower runtime or if no lower-precision implementations exist.

../_images/export-tensorrt-tactics.png
  • The Typing Mode setting controls TensorRT’s type system:

    • The TensorRT Defaults option instructs TensorRT’s optimizer to use autotuning to determine tensor types. This option generates the fastest engine but can result in accuracy loss when model accuracy requires a layer to run with higher precision than what TensorRT chooses. Layer precision constraints are ignored in this mode.

    • The Strongly Typed option instructs TensorRT’s optimizer to determine tensor types using the rules in the ONNX operator type specification. Types are not autotuned and may result in a slower engine than one where TensorRT chooses tensor types, but the smaller set of kernel alternatives can improve the engine build time. Layer precision constraints and the FP16, BF16, INT8, and FP8 tactics settings are ignored in this mode.

    • The Obey Precision Constraints option uses TensorRT autotuning where layer precision constraints have not already been set using Nsight Deep Learning Designer. If no layer implementation exists for a particular precision constraint, the engine build will fail.

    • The Prefer Precision Constraints option is similar to Obey Precision Constraints, but TensorRT will issue warning messages instead of failing to build an engine if layer precision constraints cannot be observed or result in a slower network.

  • The Allow TF32 Tactics setting allows TensorRT’s optimizer to select TensorFloat-32 precision. This format requires an NVIDIA Ampere GPU architecture or newer.

  • The Allow FP16 Tactics setting allows TensorRT’s optimizer to select IEEE 754 half precision.

  • The Allow BF16 Tactics setting allows TensorRT’s optimizer to select Bfloat16 precision. This format requires an NVIDIA Ampere GPU architecture or newer.

  • The Allow INT8 Tactics setting allows TensorRT’s optimizer to use quantized eight-bit integer precision. Explicitly quantized networks are recommended, but Nsight Deep Learning Designer will assign placeholder dynamic ranges (similar to trtexec) if the network is implicitly quantized and no calibration cache is provided.

  • The Allow FP8 Tactics setting allows TensorRT’s optimizer to use quantized eight-bit floating-point precision. This setting is mutually exclusive with the INT8 setting and is typically needed only for networks with optional FP8 tensors generated by plugins.

  • The Examine Weights for Sparsity setting instructs TensorRT’s optimizer to examine weights and use optimized functions when weights have suitable sparsity.

  • The Allow cuDNN and cuBLAS Tactics setting allows TensorRT to use the cuDNN and cuBLAS libraries for layer implementations. When this setting is disabled, only internal TensorRT kernels will be considered. Enabling this setting will cause cuDNN to be downloaded to the target.

  • The Native Instance Norm setting instructs TensorRT to use its own instance normalization implementation instead of a plugin-based implementation that uses cuDNN. Disabling this setting will cause cuDNN to be downloaded to the target.

Optimizer Settings

Settings in this page primarily control the TensorRT IBuilderConfig interface.

../_images/export-tensorrt-optimizer.png
  • The Builder Optimization Level option controls the tradeoffs made between engine build time and inference time. Higher optimization levels allow the optimizer to spend more time searching for optimization opportunities, which may result in better performance at runtime. See the TensorRT setBuilderOptimizationLevel function documentation for more details.

  • The Maximum Worker Streams option controls multi-stream inference. If the model contains operators that can run in parallel, TensorRT can execute them on auxiliary streams. The value of this setting defines the maximum number of streams to provide to TensorRT at build time. If this setting is left blank, TensorRT will use internal heuristics to choose an appropriate number. Set this value to zero to disable stream parallelism.

  • The Inferred Batch Size option allows implicit specification of TensorRT optimization profiles for dynamic inputs of the form ['N', sizes...]. See Dynamic Shapes and TensorRT for more details on the inferred batch feature.

  • The Hardware Compatible Engine option creates a TensorRT engine that works on all TensorRT-supported discrete GPUs with an Ampere architecture or newer. Use of this feature may have a performance impact as it precludes optimizations for later GPU architectures.

  • The Workspace Pool Limit (MiB) option controls the size of the workspace memory pool used by TensorRT. The value should be specified in mibibytes; one MiB is 220 bytes. Setting this value too small may prevent TensorRT from finding a valid implementation for a layer. Leaving this value blank (the default) removes the limit and allows TensorRT to use all available global memory on the GPU.

  • The INT8 Calibration Cache option allows you to specify a calibration cache file for implicitly quantized INT8 networks. Leaving this value blank (the default) will disable calibration. Calibration caches are neither required nor used for explicitly quantized networks or networks not using INT8 tactics.

  • The Weights Refitting option controls whether weights are stored in the generated TensorRT model and whether they can be altered at inference time. The Not refittable option is the TensorRT default. It embeds weights which may not be refitted. The Refittable (Weights included) option corresponds to TensorRT’s kREFIT flag; it embeds weights and permits all weights to be refitted. The Refittable (Weights stripped) option corresponds to TensorRT’s kSTRIP_PLAN flag. This option embeds only those weights with performance-sensitive optimizations; all other weights are omitted and refittable. Applications are expected to refit the original weights into the engine at inference time.

  • The Version Compatible Engine option creates a TensorRT engine that can be used for inference with later versions of TensorRT. See the TensorRT documentation for details.

Visualizing a TensorRT Engine

The layers of a TensorRT engine can be visualized in Nsight Deep Learning Designer via the metadata file optionally generated during TensorRT engine export activities or on the command line with the trtexec tool. If generating a metadata file outside of Nsight Deep Learning Designer, the --profilingVerbosity flag must be set to detailed.

An existing TensorRT engine metadata file can be opened in Nsight Deep Learning Designer by using the File > Open menu item and selecting the metadata file. The engine will be visualized as a model in the workspace. As this is only a metadata file describing an engine; no edits to parameters or layers can be made. Note that Nsight Deep Learning Designer expects a .trt.json file extension.

../_images/tensorrt-metadata-visualization.png

Profiling

Nsight Deep Learning Designer supports profiling networks using either TensorRT or ONNX Runtime as the inference framework. GPU performance metrics are available only when profiling with TensorRT.

To profile, open the ONNX model you wish to profile and use the Profile Model toolbar button or Tools > Profile Model menu item. ONNX models can also be profiled from the Start Activity dialog without opening them beforehand.

Note: When targeting the NVIDIA L4T platform, the user (local or remote) needs to be a member of the debug group in order to profile.

Profiling with ONNX Runtime

To profile a model using ONNX Runtime, open the ONNX model you wish to profile and use the Tools > Profile Model menu item. ONNX models can also be profiled from the Start Activity dialog (accessible through the Welcome page) without opening them beforehand.

../_images/profile-ort-common.png

Nsight Deep Learning Designer’s ONNX Runtime profiler is based on the ONNXRuntime Performance Test binary. The options for ONNX Runtime profiling are below:

  • The ONNX Model parameter is the local path to the model you wish to profile. The model file will be copied to the target system if necessary.

  • The Iterations option controls how many inference iterations are performed when gathering data. Increasing this value reduces noise when computing the median inference pass as more data points are sampled, but correspondingly increases the time taken to profile the model.

  • The Execution Provider option defines which backend the ONNX Runtime profiler will use during inference. The CPU and CUDA providers are supported on all target platforms. Windows targets also support the DirectML provider.

  • The Enable Model Optimization option controls whether the profiler should first apply graph-level transformations to optimize the model before running inference. If turned on, the profiler applies the highest level of optimization as described in Graph Optimizations in ONNX Runtime.

  • The Generate Random Input(s) Data option controls if the profiler should generate random data for the model’s inputs without data embedded in the model or referenced from an external file. Free dimensions are treated as 1. If turned off, an Input Data Folder must be provided.

  • The Input Data Folder option controls where the profiler should find the data for the model’s input. Generate Random Input(s) Data must be turned off to define a data folder. It should point to a directory containing one file with an ONNX TensorProto per model input. The Protobuf file needs to be named as its corresponding model’s input, example: input_0.pb. The input data will be copied to the target system if necessary.

  • The Output Profile parameter is the local destination where you wish to save the profiler report. It will be copied from the target system if necessary. The activity suggests a default name for this parameter based on the ONNX model filename.

Profiling with TensorRT

To profile a model using TensorRT, open the ONNX model you wish to profile and use the Tools > Profile Model menu item. ONNX models can also be profiled from the Start Activity dialog (accessible through the Welcome page) without opening them beforehand.

Profiling ONNX models with dynamic input sizes requires a TensorRT optimization profile. Input optimal sizes are used by the profiler when generating random input data.

Selecting a Version

../_images/tensorrt-version-select.png

Nsight Deep Learning Designer supports multiple minor versions of TensorRT 10 for profiling. Use the dropdown selector to choose the version of TensorRT you wish to use. There are selected versions of TensorRT available for automatic download and deployment (along with a recommended version). You may also elect to use the system-installed TensorRT on the target system, or specify a custom path. If you specify a custom path, it must be a relative path from the deployment directory on the target machine to the location of the TensorRT shared libraries.

Common Settings

../_images/profile-tensorrt-common.png
  • The ONNX Model parameter is the local path to the model you wish to profile. It will be copied to the target system if necessary.

  • The Output Profile parameter is the local destination where you wish to save the profiler report. It will be copied from the target system if necessary. The activity suggests a default name for this parameter based on the ONNX model filename.

  • The Device Index option controls which CUDA device to use on multi-GPU systems. Device 0 represents the default CUDA device, and devices are ordered as in the cudaSetDevice call. If this setting is left blank, Nsight Deep Learning Designer will use the first CUDA device.

  • The Use Prebuilt Engine option allows you to profile a pre-existing TensorRT engine from the Export TensorRT Engine activity or other workflows such as trtexec instead of building a new one. The engine file must have been built from the ONNX model being profiled, must have kDETAILED profiling verbosity, and will automatically be refitted before inference if possible. Settings in the Tactics and Optimizer pages are ignored when profiling a prebuilt engine. The engine is considered trusted, and any embedded host code (as by the TensorRT version compatibility or plugin embedding options) will be deserialized and executed as necessary.

  • The Custom Plugin parameter allows passing paths to optional custom TensorRT plugins to load during engine building. Provided paths must be relative to the selected target system. Plugins must be compatible with TensorRT 10.8. Refer to the TensorRT documentation for more details on custom plugins.

Tactics Settings

Most settings in this section map closely to TensorRT’s BuilderFlags enumeration.

FP32 tensor formats and tactics are always available to TensorRT. TensorRT may still choose a higher-precision layer format if it results in overall lower runtime or if no lower-precision implementations exist.

../_images/profile-tensorrt-tactics.png
  • The Typing Mode setting controls TensorRT’s type system:

    • The TensorRT Defaults option instructs TensorRT’s optimizer to use autotuning to determine tensor types. This option generates the fastest engine but can result in accuracy loss when model accuracy requires a layer to run with higher precision than what TensorRT chooses. Layer precision constraints are ignored in this mode.

    • The Strongly Typed option instructs TensorRT’s optimizer to determine tensor types using the rules in the ONNX operator type specification. Types are not autotuned and may result in a slower engine than one where TensorRT chooses tensor types, but the smaller set of kernel alternatives can improve the engine build time. Layer precision constraints and the FP16, BF16, INT8, and FP8 tactics settings are ignored in this mode.

    • The Obey Precision Constraints option uses TensorRT autotuning where layer precision constraints have not already been set using Nsight Deep Learning Designer. If no layer implementation exists for a particular precision constraint, the engine build will fail.

    • The Prefer Precision Constraints option is similar to Obey Precision Constraints, but TensorRT will issue warning messages instead of failing to build an engine if layer precision constraints cannot be observed or result in a slower network.

  • The Allow TF32 Tactics setting allows TensorRT’s optimizer to select TensorFloat-32 precision. This format requires an NVIDIA Ampere GPU architecture or newer.

  • The Allow FP16 Tactics setting allows TensorRT’s optimizer to select IEEE 754 half precision.

  • The Allow BF16 Tactics setting allows TensorRT’s optimizer to select Bfloat16 precision. This format requires an NVIDIA Ampere GPU architecture or newer.

  • The Allow INT8 Tactics setting allows TensorRT’s optimizer to use quantized eight-bit integer precision. Explicitly quantized networks are recommended, but Nsight Deep Learning Designer will assign placeholder dynamic ranges (similar to trtexec) if the network is implicitly quantized and no calibration cache is provided.

  • The Allow FP8 Tactics setting allows TensorRT’s optimizer to use quantized eight-bit floating-point precision. This setting is mutually exclusive with the INT8 setting and is typically needed only for networks with optional FP8 tensors generated by plugins.

  • The Examine Weights for Sparsity setting instructs TensorRT’s optimizer to examine weights and use optimized functions when weights have suitable sparsity.

  • The Allow cuDNN and cuBLAS Tactics setting allows TensorRT to use the cuDNN and cuBLAS libraries for layer implementations. When this setting is disabled, only internal TensorRT kernels will be considered. Enabling this setting will cause cuDNN to be downloaded to the target.

  • The Native Instance Norm setting instructs TensorRT to use its own instance normalization implementation instead of a plugin-based implementation that uses cuDNN. Disabling this setting will cause cuDNN to be downloaded to the target.

Optimizer Settings

Settings in this page primarily control the TensorRT IBuilderConfig interface.

../_images/profile-tensorrt-optimizer.png
  • The Builder Optimization Level option controls the tradeoffs made between engine build time and inference time. Higher optimization levels allow the optimizer to spend more time searching for optimization opportunities, which may result in better performance at runtime. See the TensorRT setBuilderOptimizationLevel function documentation for more details.

  • The Maximum Worker Streams option controls multi-stream inference. If the model contains operators that can run in parallel, TensorRT can execute them on auxiliary streams. The value of this setting defines the maximum number of streams to provide to TensorRT at build time. If this setting is left blank, TensorRT will use internal heuristics to choose an appropriate number. Set this value to zero to disable stream parallelism.

  • The Inferred Batch Size option allows implicit specification of TensorRT optimization profiles for dynamic inputs of the form ['N', sizes...].

  • The Hardware Compatible Engine option creates a TensorRT engine that works on all TensorRT-supported discrete GPUs with an Ampere architecture or newer. Use of this feature may have a performance impact as it precludes optimizations for later GPU architectures.

  • The Workspace Pool Limit (MiB) option controls the size of the workspace memory pool used by TensorRT. The value should be specified in mibibytes; one MiB is 220 bytes. Setting this value too small may prevent TensorRT from finding a valid implementation for a layer. Leaving this value blank (the default) removes the limit and allows TensorRT to use all available global memory on the GPU.

  • The INT8 Calibration Cache option allows you to specify a calibration cache file for implicitly quantized INT8 networks. Leaving this value blank (the default) will disable calibration. Calibration caches are neither required nor used when the network is explicitly quantized or does not enable INT8 tactics.

  • The Weights Refitting option controls whether weights are stored in the generated TensorRT model and whether they can be altered at inference time. The Not refittable option is the TensorRT default. It embeds weights which may not be refitted. The Refittable (Weights included) option corresponds to TensorRT’s kREFIT flag; it embeds weights and permits all weights to be refitted. The Refittable (Weights stripped) option corresponds to TensorRT’s kSTRIP_PLAN flag. This option embeds only those weights with performance-sensitive optimizations; all other weights are omitted and refittable. Applications are expected to refit the original weights into the engine at inference time.

  • The Version Compatible Engine option creates a TensorRT engine that can be used for inference with later versions of TensorRT. See the TensorRT documentation for details.

Profiler Settings

Settings in this page control the behavior of the Nsight Deep Learning Designer profiler rather than TensorRT.

../_images/profile-tensorrt-profiler.png
  • The Measurement Passes option controls how many inference iterations are performed when gathering data. Increasing this value reduces noise when computing the median inference pass as more data points are sampled, but correspondingly increases the time taken to profile the model.

  • The Sampling Rate option controls the frequency of GPU performance counter collection. Increasing this value collects more data for the profiling report but may overflow the collection buffer on large models.

  • The Lock Clocks to Base option controls whether GPU clocks are locked to their base values, disabling clock boosting during profiling. Locking clocks improves measurement consistency at the cost of decreased inference performance.

  • The Hide Non-Compute Operations option controls whether to include profiler overhead such as host/device memory copies in the measurement loop and resulting profiler report.

Clocks should be locked to base values when making incremental changes to a model design. Individual layer timing values will reflect consistent performance states from the GPU and be meaningfully comparable across versions of the model. Clocks should be unlocked, non-compute operations should be hidden, and the Measurement Passes value set to a large number (two hundred is usually enough), when measuring end-to-end performance in real-world configurations. Increasing the pass count ensures the GPU stays active long enough to reach its maximum clock rate. Omitting non-compute operations from the profiling operation further saturates the GPU as SM operations will not be interleaved with memory copies.

Profiling from the Command Line

Nsight Deep Learning Designer includes a lightweight command-line TensorRT profiler for non-interactive use cases. The command-line profiler is called ndld-prof. It can be copied to a remote target’s deployment directory after interactively profiling from the Nsight Deep Learning Designer GUI.

Full details on the command-line arguments accepted by the profiler can be viewed using the --help option. All options from the Profile TensorRT Model activity interface are supported when using the command-line profiler.

It is not necessary to save a full profiling report when using the command-line profiler. The profiler will display performance triage information on stdout.

../_images/profile-tensorrt-cli.png

Example output from the command-line profiler

Profiling Reports

Nsight Deep Learning Designer uses a common report format to store profiling data from ONNX Runtime and TensorRT. Existing profiling reports can be reopened using the File > Open File command. The Profile TensorRT Model and Profile ONNX Model activities will automatically open the new profiler report upon a successful profiling run.

The profiling report describes the execution of the ONNX model as performed by the selected inference framework, which typically refers to a runtime-optimized version of the network. Groups of nodes from the ONNX model may be fused together into single optimized layers, and other nodes may be removed entirely during an optimization pass.

Profiling reports have four major sections. Each section is described below, including any differences between the ONNX Runtime and TensorRT profilers.

Network Summary

../_images/report-summary.png

The summary section shows high-level details about the profiling run.

  • Network Name: The name of the ONNX model that was profiled.

  • Backend: The inference framework used for profiling, whether ONNX Runtime or TensorRT.

  • Execution Provider (ONNX Runtime only): The preferred ONNX Runtime execution provider used for profiling.

  • Inference Device (TensorRT only): The name of the GPU used for inference, as returned by the CUDA driver.

  • Median Network Inference Time: The elapsed wall clock time taken by the median inference pass.

  • # Inference Passes: The number of inference passes performed during profiling.

  • Device Memory Required (TensorRT only): The maximum device memory required by the compiled TensorRT engine during inference, as returned by ICudaEngine::getDeviceMemorySize.

Inference Timeline

../_images/report-timeline.png

The inference timeline shows the execution of the individual layers within the median inference pass. Each concurrent stream of execution (as in TensorRT’s multi-stream inference feature) is shown in a separate row. The view can be scrolled using the scrollbar and zoomed by holding Ctrl while moving the mouse wheel. Specific time durations may be measured by clicking the left mouse button and dragging. Additional zoom options are available through a context menu, accessed by clicking the right mouse button within the timeline.

TensorRT only: Overhead from host-to-device (H2D) input copies and device-to-host (D2H) output copies is included in the timeline.

Layers identifiable as “overhead,” such as reformat operations, memory copies, and TensorRT NoOp layers, are depicted in the timeline using a different color from normal computation layers.

GPU Metrics (TensorRT only)

The Nsight Deep Learning Designer TensorRT profiler collects GPU-level performance metrics during inference. These metrics are displayed as extra rows within the timeline. Values are normalized for rendering in order to fill the available vertical space; hover the mouse cursor over a data point to see the actual value collected.

  • SM Utilization: Fraction of time where at least one warp was executing on the streaming multiprocessor (SM). Heavy utilization of the SM is associated with compute-bound operations.

  • VRAM Utilization: Fraction of peak throughput of the GPU DRAM controller. Heavy utilization of the DRAM controller is associated with memory-bound operations on the GPU.

  • PCIe Utilization: Fraction of peak throughput of the PCI Express bus. Heavy utilization of the PCIe bus is associated with memory copies between the CPU and GPU.

  • Tensor Core Utilization: Fraction of time where the SM’s tensor cores were active. Heavy utilization of SM tensor cores indicates the inference framework performed significant work with reduced-precision tensor formats.

Network Metrics and Layer Table

The Network Metrics section provides a detailed view of individual layers’ execution behavior within the inference framework.

../_images/report-layers-summary.png

The summary view shows the most common values for each column of the table. When multiple values appear for a discrete variable such as Precision or Input Dimensions, the full list of values remains accessible. Hovering the mouse cursor over the list will show a tooltip with every observed value in descending order of frequency. Continuous variables such as inference time are presented as minimum, maximum, total, and average (arithmetic mean) values.

When selecting multiple layer rows, the summary view changes to reflect the selected layers. If only a single layer is selected, or the selection is cleared, the summary view returns to showing the entire inference pass.

../_images/report-layers-table.png

The layer table displays information on each layer of the executed network. This table can be exported to CSV format using the Export to CSV button. Typing into the Filter by layer name text box filters the table to only those layers which contain the filter text as a substring. The filtering check is case-insensitive.

The Settings button (displayed as a gear) displays a popup menu with options for the layer table:

  • Show Inference Time Heat Colors renders inference time columns using a color scale such that longer-executing layers appear in a more intense color. Multiple color schemes are available. The color scale can be changed from the Environment tab of the Options dialog, accessible using Tools > Options.

  • Use Full Pass Time in Percentages controls how inference time percentages are calculated. By default, which is the unchecked state, layer inference time percentages are relative to the sum of individual layer inference times, hiding framework overhead such as input buffer copies. When this setting is checked, layer inference time percentages are instead calculated relative to the duration of the entire inference pass.

Each layer of the executed network is displayed as a separate row in the table. Double-clicking an entry in the table zooms the timeline to fit the layer within view. This layer table contains the following information columns:

  • #: The inference order of each layer. For multi-stream inference, the order reflects when a layer is enqueued to a stream, not necessarily when it begins executing.

  • Name: The name of the layer. Many optimized layer names contain information about their originating ONNX nodes. Clicking a hyperlink within the layer will open the original ONNX model within the Nsight Deep Learning Designer editor and select the corresponding node. Clicking the magnifying glass button for a selected layer will pop up a menu of associated ONNX source nodes: those nodes contributing directly to the optimized layer, those nodes which generated the input tensors consumed by the optimized layer, and those nodes which consumed outputs from the optimized layer. Selecting any of those options within the menu will open the source ONNX model and select the node as though clicking a hyperlink.

  • Type: The type of the layer as reported by the inference framework.

  • Input Dimensions: The dimensions of the layer’s input tensors.

  • Output Dimensions: The dimensions of the layer’s output tensors.

  • Execution Provider (ONNX Runtime only): The ONNX Runtime execution provider used for this layer.

  • Precision: The input precision of this layer. For layers with multiple inputs, this is the precision of the first input tensor. Layers with no direct input tensors, such as Constant layers, are shown here as N/A.

  • Inference Time (%): The inference time of this layer represented as a percentage. See Use Full Pass Time in Percentages for a description of how this percentage is calculated.

  • Inference Time (ms): The inference time of this layer in milliseconds.

Network Graphs

The profiler report includes two graphs summarizing layer behavior by type. These graphs can be resized by dragging the splitter that separates them from the layer table.

../_images/report-graphs-latency-by-type.png

The Average Latency per Layer Type graph displays the average inference time for each layer type in the network. Longer bars represent longer inference times. The horizontal axis scale is in milliseconds by default. Checking the View as Percentage checkbox will rescale the X axis to represent percentages of the summed layer inference times. The Use Full Pass Time in Percentages option described earlier does not apply to this graph.

The Show Grid Lines checkbox controls whether grid lines are displayed within the graph.

../_images/report-graphs-latency-by-type-percent.png

The previous graph, but with View as Percentage checked and Show Grid Lines unchecked.

../_images/report-graphs-precision-by-type.png

The Precision per Layer Type graph displays the distribution of input tensor precisions used for each layer type. The inner ring of the pie chart contains the various layer types executed by the network. Each layer type is subdivided in the outer ring by its instance precisions as defined in the layer table. In the example above, all CaskConvolution layers executed in FP16 precision but the PointWiseV2 layers executed in an equal mix of FP16 and FP32 precision. Layers executing in FP32 represent an opportunity to reduce precision for improvements in memory footprint and execution performance.

Networks with many operator types may result in small individual segments within the chart. Hover the mouse over a graph segment to explode it visually and display a tooltip with the segment’s text and value. Layer type percentages (the inner ring of the graph) are relative to the total number of layers within the network. Layer precision percentages (the outer ring) are relative to the number of layers of that type.