Integrating TAO Models with NSight DL Designer#

NVIDIA NSight DL Designer (NDLD) is a comprehensive visualization, debugging, and profiling tool for deep learning models. TAO Toolkit models can be seamlessly integrated with NSight DL Designer for advanced model optimization, profiling, and deployment workflows.

Overview#

NSight DL Designer integration with TAO enables you to:

  • Visualize TAO model architectures exported as ONNX

  • Profile TensorRT engines generated from TAO models

  • Optimize layer-level precision constraints for improved performance

  • Debug inference performance bottlenecks

  • Deploy optimized TensorRT engines with fine-grained control

This integration is available through TAO Deploy workflows starting with TAO 6.25.11.

Prerequisites#

Software Requirements#

  • Operating System: Linux Ubuntu 22.04 LTS or newer

  • GPU: NVIDIA Ampere (A100) or NVIDIA Hopperâ„¢ (H100) architecture or newer

  • NSight DL Designer: Version 2025.3 or later

  • TAO Deploy Container: Version 6.25.11 or later

  • NVIDIA Driver: Compatible with your GPU and TAO Deploy container

Installation#

  1. Install NSight DL Designer

    Download and install NSight DL Designer from the NVIDIA Developer Portal.

    # Extract and install NDLD
    chmod +x NVIDIA_DeepLearning_*.run
    ./NVIDIA_DeepLearning_*.run
    
  2. Install TAO CLI

    Install the TAO CLI using Python virtualenv:

    # Create and activate virtual environment
    python3 -m venv tao_env
    source tao_env/bin/activate
    
    # Install TAO
    pip install nvidia-tao
    
  3. Configure TAO Mounts

    Create or update ~/.tao_mounts.json with your directory mappings:

    {
       "Mounts": [
          {
             "source": "/home/user/tao-experiments",
             "destination": "/workspace/tao-experiments"
          },
          {
             "source": "/home/user/data",
             "destination": "/data"
          }
       ]
    }
    

Configuration#

Configuring NSight DL Designer for TAO#

  1. Launch NSight DL Designer

    cd <NDLD_INSTALL_DIR>/host/linux-desktop-dl-x64
    ./nsight-dl
    
  2. Configure TAO Integration

    Navigate to Tools > Options > Activities > NVIDIA TAO Integration:

    • Python/Conda Activation Command: Enter your environment activation command

      Example: . ~/tao_env/bin/activate

    • Keep Activity Log Window Open: Set to Yes to view TAO output after activity completion

  3. Configure Connection

    • For local testing: Select localhost

    • For remote testing: Add an SSH connection to your target machine

    See the NSight DL Designer User Guide for detailed connection configuration.

TAO Deploy Workflows with NSight DL Designer#

Deploy TAO Model#

This workflow generates a TensorRT engine from a TAO ONNX model using the gen_trt_engine command.

Steps:

  1. In NSight DL Designer, click Start Activity

  2. Select Linux (x86-64) as the target platform

  3. Choose Deploy TAO Model activity

  4. Configure the Common tab:

    • ONNX Model: Path to your TAO-exported ONNX model

    • Experiment Spec: Path to your TAO experiment YAML file

    • Results Directory: Local path for output files

  5. Configure the TAO-Deploy tab (optional overrides):

    • Data Type: FP32, FP16, or INT8 (overrides YAML setting)

    • Device Index: GPU device ID (leave blank for device 0)

    • GPU Workspace Size: TensorRT workspace size in gigabytes

    • Batch Size: Min/Opt/Max batch sizes for optimization profile

  6. Click Start to launch the activity

Expected Output:

The results directory will contain:

results/
    gen_trt_engine.log
    gen_trt_engine/
        experiment.yaml
        status.json
        tao_engine.trt

Profile TAO Model#

This workflow profiles a TensorRT engine (newly generated or prebuilt) and provides detailed performance metrics.

With Newly Generated Engine:

  1. In NSight DL Designer, click Start Activity

  2. Select Linux (x86-64) as the target platform

  3. Choose Profile TAO Model activity

  4. Configure the Common tab:

    • ONNX Model: Path to your TAO-exported ONNX model

    • Experiment Spec: Path to your TAO experiment YAML file

    • Prebuilt TensorRT Engine: Leave empty

    • Results Directory: Local path for output files

  5. Configure the TAO-Deploy tab as needed

  6. Configure the Profiler tab:

    • Inference Batch Size: Batch size for profiling runs

    • Additional profiler settings (see NSight DL Designer User Guide)

  7. Click Start to launch the activity

Expected Output:

  • Interactive profiler report opens in NSight DL Designer

  • Results directory contains:

results/
    gen_trt_engine.log
    gen_trt_engine/
        experiment.yaml
        status.json
        tao_engine.trt
    tao_report.nv-dld-report

With Prebuilt Engine:

To profile an existing TensorRT engine without regenerating it:

  1. Follow steps 1-4 above, but set Prebuilt TensorRT Engine to the path of your existing .trt file

  2. Configure profiler settings and launch

This skips the gen_trt_engine step and profiles the provided engine directly.

Advanced Features#

Setting Layer Precision Constraints#

NSight DL Designer allows you to set per-layer precision constraints for fine-grained optimization.

Steps:

  1. Open your ONNX model in NSight DL Designer: File > Open File

  2. Navigate the model graph and select one or more ONNX nodes

  3. Right-click and select Set TensorRT Layer Precision

  4. Choose a precision (FP32, FP16, or INT8) from the dropdown

  5. Click OK and save the model

Applying Precision Constraints:

Configure how NSight DL Designer precision constraints are applied using the TAO-Deploy > Apply Layer Type Constraints setting:

  • From YAML only (default): Only YAML constraints are applied; DLD constraints are ignored

  • From DLD only: Only DLD constraints are applied; YAML constraints are ignored

  • Merge YAML and DLD: Both are applied; DLD constraints take precedence if there is a conflict

Example Workflow:

# In your TAO experiment YAML
gen_trt_engine:
  tensorrt:
    data_type: fp16
    layers_precision:
      layer_name_1: fp32
      layer_name_2: int8
  1. Set precision constraints in NSight DL Designer for additional layers

  2. Choose Merge YAML and DLD mode

  3. Profile the model to validate layer precisions in the report

The NSight DL Designer profiler report includes a Precision column showing the actual precision used for each layer.

Variable Batch Size Optimization#

Configure dynamic batch size support for TensorRT engines:

In TAO-Deploy > TensorRT Optimization Profile:

  • Batch Size: Optimal batch size (used during engine generation)

  • Batch Size (Min): Minimum supported batch size

  • Batch Size (Max): Maximum supported batch size

In Profiler tab:

  • Inference Batch Size: Batch size for profiling (must be within [Min, Max] range)

Example Configuration:

# These settings can be overridden in NSight DL Designer
Batch Size: 2
Batch Size (Min): 1
Batch Size (Max): 4
Inference Batch Size: 3  # Must be between 1 and 4

The profiler report will show input dimensions reflecting the inference batch size.

INT8 PTQ Calibration#

For INT8 quantized models, configure Post-Training Quantization (PTQ) calibration parameters:

In TAO-Deploy > PTQ INT8 Calibration:

  • Batch Size: Calibration batch size

  • Calibration Batches: Number of batches to use for calibration

These settings map to:

gen_trt_engine:
  tensorrt:
    calibration:
      cal_batch_size: <value>
      cal_batches: <value>

Profiler Reports#

Understanding Profiler Output#

The NSight DL Designer profiler report provides comprehensive insights:

Network Metrics Table:

  • Layer Name: TensorRT layer identifier

  • Layer Type: Operation type (Convolution, MatMul, etc.)

  • Input/Output Dimensions: Tensor shapes

  • Precision: Actual precision used (FP32, FP16, INT8)

  • Execution Time: Per-layer latency

  • Memory Usage: GPU memory consumption

Performance Metrics:

  • Total inference time

  • Per-layer execution breakdown

  • Memory bandwidth utilization

  • GPU utilization

Use these metrics to:

  • Identify performance bottlenecks

  • Validate precision constraints

  • Optimize batch sizes

  • Guide model architecture improvements

Troubleshooting#

Common Issues#

TAO CLI Not Found:

  • Ensure TAO CLI is installed and in your $PATH

  • Verify the activation command in Tools > Options > Activities > NVIDIA TAO Integration

  • Check that NSight DL Designer can execute the activation command

Directory Mapping Errors:

  • Verify ~/.tao_mounts.json is properly configured

  • Ensure source directories exist and are accessible

  • Check permissions on mounted directories

TensorRT Engine Generation Fails:

  • Review gen_trt_engine.log in the results directory

  • Check YAML configuration for errors

  • Verify ONNX model compatibility

  • Ensure sufficient GPU memory (especially for large batch sizes)

Insufficient GPU Memory:

  • Reduce batch sizes (Min/Opt/Max)

  • Reduce GPU workspace size

  • Use FP16 or INT8 instead of FP32

  • Close other GPU-intensive applications

Profiling Errors:

  • Ensure inference batch size is within [Min, Max] range

  • Verify TensorRT engine is valid

  • Check GPU device availability

Best Practices#

Model Optimization Workflow#

  1. Initial Profiling: Profile with default settings to establish baseline performance

  2. Identify Hotspots: Use profiler reports to find time-consuming layers

  3. Apply Precision Constraints: Selectively reduce precision for compute-heavy layers

  4. Re-profile: Measure performance impact and accuracy trade-offs

  5. Iterate: Refine constraints based on performance and accuracy requirements

  6. Deploy: Export final optimized engine for production use

Performance Tuning Tips#

  • Mixed Precision: Use FP16 for most layers, FP32 only where numerical stability requires it

  • Batch Size: Profile multiple batch sizes to find optimal throughput/latency balance

  • Workspace Size: Increase if seeing TensorRT optimization warnings

  • Layer Fusion: Let TensorRT automatically fuse layers; avoid unnecessary constraints

  • Calibration: For INT8, use representative calibration data covering expected inputs

Remote Testing#

For testing on remote machines:

  1. Configure SSH connection in NSight DL Designer

  2. Install TAO CLI on the remote machine

  3. Ensure ~/.tao_mounts.json exists on the remote machine

  4. Configure network access and firewall rules

  5. Use the results archive extraction feature to retrieve outputs

Additional Resources#

Supported TAO Models#

The following TAO PyTorch models are compatible with NSight DL Designer integration:

Computer Vision:

  • Classification (classification)

  • Object Detection (DINO, RT-DETR)

  • Open Vocabulary Object Detection (Grounding DINO)

  • Visual Change Detection (Visual ChangeNet)

  • Semantic Segmentation (Segformer, Mask2Former, OneFormer)

Requirements:

  • Models must be exported to ONNX format via TAO export workflow

  • Experiment YAML file must be available

  • Models must be compatible with TensorRT (see TAO export documentation)

Limitations#

  • Remote profiling requires SSH connectivity and proper network configuration

  • INT8 quantization requires calibration data accessible from the TAO container

  • Very large models may require significant GPU memory (16 GB or more recommended)

  • Some TAO models with custom operations may have limited TensorRT support

Note

NSight DL Designer integration is available starting with TAO 6.25.11. Earlier TAO versions are not supported.