Integrating TAO Models with NSight DL Designer#

NVIDIA NSight DL Designer (NDLD) is a comprehensive visualization, debugging, and profiling tool for deep learning models. TAO Toolkit models can be seamlessly integrated with NSight DL Designer for advanced model optimization, profiling, and deployment workflows.

Overview#

NSight DL Designer integration with TAO enables you to:

Visualize TAO model architectures exported as ONNX
Profile TensorRT engines generated from TAO models
Optimize layer-level precision constraints for improved performance
Debug inference performance bottlenecks
Deploy optimized TensorRT engines with fine-grained control

This integration is available through TAO Deploy workflows starting with TAO 6.25.11.

Prerequisites#

Software Requirements#

Operating System: Linux Ubuntu 22.04 LTS or newer
GPU: NVIDIA Ampere (A100) or NVIDIA Hopper™ (H100) architecture or newer
NSight DL Designer: Version 2025.3 or later
TAO Deploy Container: Version 6.25.11 or later
NVIDIA Driver: Compatible with your GPU and TAO Deploy container

Installation#

Install NSight DL Designer

Download and install NSight DL Designer from the NVIDIA Developer Portal.
```
# Extract and install NDLD
chmod +x NVIDIA_DeepLearning_*.run
./NVIDIA_DeepLearning_*.run
```

Install TAO CLI

Install the TAO CLI using Python virtualenv:

# Create and activate virtual environment
python3 -m venv tao_env
source tao_env/bin/activate

# Install TAO
pip install nvidia-tao

Configure TAO Mounts

Create or update ~/.tao_mounts.json with your directory mappings:

{
   "Mounts": [
      {
         "source": "/home/user/tao-experiments",
         "destination": "/workspace/tao-experiments"
      },
      {
         "source": "/home/user/data",
         "destination": "/data"
      }
   ]
}

Configuration#

Configuring NSight DL Designer for TAO#

Launch NSight DL Designer

cd <NDLD_INSTALL_DIR>/host/linux-desktop-dl-x64
./nsight-dl

Configure TAO Integration

Navigate to Tools > Options > Activities > NVIDIA TAO Integration:
- Python/Conda Activation Command: Enter your environment activation command
  
  Example: . ~/tao_env/bin/activate
- Keep Activity Log Window Open: Set to Yes to view TAO output after activity completion
Configure Connection
- For local testing: Select localhost
- For remote testing: Add an SSH connection to your target machine
See the NSight DL Designer User Guide for detailed connection configuration.

TAO Deploy Workflows with NSight DL Designer#

Deploy TAO Model#

This workflow generates a TensorRT engine from a TAO ONNX model using the gen_trt_engine command.

Steps:

In NSight DL Designer, click Start Activity
Select Linux (x86-64) as the target platform
Choose Deploy TAO Model activity
Configure the Common tab:
- ONNX Model: Path to your TAO-exported ONNX model
- Experiment Spec: Path to your TAO experiment YAML file
- Results Directory: Local path for output files
Configure the TAO-Deploy tab (optional overrides):
- Data Type: FP32, FP16, or INT8 (overrides YAML setting)
- Device Index: GPU device ID (leave blank for device 0)
- GPU Workspace Size: TensorRT workspace size in gigabytes
- Batch Size: Min/Opt/Max batch sizes for optimization profile
Click Start to launch the activity

Expected Output:

The results directory will contain:

results/
    gen_trt_engine.log
    gen_trt_engine/
        experiment.yaml
        status.json
        tao_engine.trt

Profile TAO Model#

This workflow profiles a TensorRT engine (newly generated or prebuilt) and provides detailed performance metrics.

With Newly Generated Engine:

In NSight DL Designer, click Start Activity
Select Linux (x86-64) as the target platform
Choose Profile TAO Model activity
Configure the Common tab:
- ONNX Model: Path to your TAO-exported ONNX model
- Experiment Spec: Path to your TAO experiment YAML file
- Prebuilt TensorRT Engine: Leave empty
- Results Directory: Local path for output files
Configure the TAO-Deploy tab as needed
Configure the Profiler tab:
- Inference Batch Size: Batch size for profiling runs
- Additional profiler settings (see NSight DL Designer User Guide)
Click Start to launch the activity

Expected Output:

Interactive profiler report opens in NSight DL Designer
Results directory contains:

results/
    gen_trt_engine.log
    gen_trt_engine/
        experiment.yaml
        status.json
        tao_engine.trt
    tao_report.nv-dld-report

With Prebuilt Engine:

To profile an existing TensorRT engine without regenerating it:

Follow steps 1-4 above, but set Prebuilt TensorRT Engine to the path of your existing .trt file
Configure profiler settings and launch

This skips the gen_trt_engine step and profiles the provided engine directly.

Advanced Features#

Setting Layer Precision Constraints#

NSight DL Designer allows you to set per-layer precision constraints for fine-grained optimization.

Steps:

Open your ONNX model in NSight DL Designer: File > Open File
Navigate the model graph and select one or more ONNX nodes
Right-click and select Set TensorRT Layer Precision
Choose a precision (FP32, FP16, or INT8) from the dropdown
Click OK and save the model

Applying Precision Constraints:

Configure how NSight DL Designer precision constraints are applied using the TAO-Deploy > Apply Layer Type Constraints setting:

From YAML only (default): Only YAML constraints are applied; DLD constraints are ignored
From DLD only: Only DLD constraints are applied; YAML constraints are ignored
Merge YAML and DLD: Both are applied; DLD constraints take precedence if there is a conflict

Example Workflow:

# In your TAO experiment YAML
gen_trt_engine:
  tensorrt:
    data_type: fp16
    layers_precision:
      layer_name_1: fp32
      layer_name_2: int8

Set precision constraints in NSight DL Designer for additional layers
Choose Merge YAML and DLD mode
Profile the model to validate layer precisions in the report

The NSight DL Designer profiler report includes a Precision column showing the actual precision used for each layer.

Variable Batch Size Optimization#

Configure dynamic batch size support for TensorRT engines:

In TAO-Deploy > TensorRT Optimization Profile:

Batch Size: Optimal batch size (used during engine generation)
Batch Size (Min): Minimum supported batch size
Batch Size (Max): Maximum supported batch size

In Profiler tab:

Inference Batch Size: Batch size for profiling (must be within [Min, Max] range)

Example Configuration:

# These settings can be overridden in NSight DL Designer
Batch Size: 2
Batch Size (Min): 1
Batch Size (Max): 4
Inference Batch Size: 3  # Must be between 1 and 4

The profiler report will show input dimensions reflecting the inference batch size.

INT8 PTQ Calibration#

For INT8 quantized models, configure Post-Training Quantization (PTQ) calibration parameters:

In TAO-Deploy > PTQ INT8 Calibration:

Batch Size: Calibration batch size
Calibration Batches: Number of batches to use for calibration

These settings map to:

gen_trt_engine:
  tensorrt:
    calibration:
      cal_batch_size: <value>
      cal_batches: <value>

Profiler Reports#

Understanding Profiler Output#

The NSight DL Designer profiler report provides comprehensive insights:

Network Metrics Table:

Layer Name: TensorRT layer identifier
Layer Type: Operation type (Convolution, MatMul, etc.)
Input/Output Dimensions: Tensor shapes
Precision: Actual precision used (FP32, FP16, INT8)
Execution Time: Per-layer latency
Memory Usage: GPU memory consumption

Performance Metrics:

Total inference time
Per-layer execution breakdown
Memory bandwidth utilization
GPU utilization

Use these metrics to:

Identify performance bottlenecks
Validate precision constraints
Optimize batch sizes
Guide model architecture improvements

Troubleshooting#

Common Issues#

TAO CLI Not Found:

Ensure TAO CLI is installed and in your $PATH
Verify the activation command in Tools > Options > Activities > NVIDIA TAO Integration
Check that NSight DL Designer can execute the activation command

Directory Mapping Errors:

Verify ~/.tao_mounts.json is properly configured
Ensure source directories exist and are accessible
Check permissions on mounted directories

TensorRT Engine Generation Fails:

Review gen_trt_engine.log in the results directory
Check YAML configuration for errors
Verify ONNX model compatibility
Ensure sufficient GPU memory (especially for large batch sizes)

Insufficient GPU Memory:

Reduce batch sizes (Min/Opt/Max)
Reduce GPU workspace size
Use FP16 or INT8 instead of FP32
Close other GPU-intensive applications

Profiling Errors:

Ensure inference batch size is within [Min, Max] range
Verify TensorRT engine is valid
Check GPU device availability

Best Practices#

Model Optimization Workflow#

Initial Profiling: Profile with default settings to establish baseline performance
Identify Hotspots: Use profiler reports to find time-consuming layers
Apply Precision Constraints: Selectively reduce precision for compute-heavy layers
Re-profile: Measure performance impact and accuracy trade-offs
Iterate: Refine constraints based on performance and accuracy requirements
Deploy: Export final optimized engine for production use

Performance Tuning Tips#

Mixed Precision: Use FP16 for most layers, FP32 only where numerical stability requires it
Batch Size: Profile multiple batch sizes to find optimal throughput/latency balance
Workspace Size: Increase if seeing TensorRT optimization warnings
Layer Fusion: Let TensorRT automatically fuse layers; avoid unnecessary constraints
Calibration: For INT8, use representative calibration data covering expected inputs

Remote Testing#

For testing on remote machines:

Configure SSH connection in NSight DL Designer
Install TAO CLI on the remote machine
Ensure ~/.tao_mounts.json exists on the remote machine
Configure network access and firewall rules
Use the results archive extraction feature to retrieve outputs

Additional Resources#

Supported TAO Models#

The following TAO PyTorch models are compatible with NSight DL Designer integration:

Computer Vision:

Classification (classification)
Object Detection (DINO, RT-DETR)
Open Vocabulary Object Detection (Grounding DINO)
Visual Change Detection (Visual ChangeNet)
Semantic Segmentation (Segformer, Mask2Former, OneFormer)

Requirements:

Models must be exported to ONNX format via TAO export workflow
Experiment YAML file must be available
Models must be compatible with TensorRT (see TAO export documentation)

Limitations#

Remote profiling requires SSH connectivity and proper network configuration
INT8 quantization requires calibration data accessible from the TAO container
Very large models may require significant GPU memory (16 GB or more recommended)
Some TAO models with custom operations may have limited TensorRT support

Note

NSight DL Designer integration is available starting with TAO 6.25.11. Earlier TAO versions are not supported.