Integrating TAO Models with NSight DL Designer#
NVIDIA NSight DL Designer (NDLD) is a comprehensive visualization, debugging, and profiling tool for deep learning models. TAO Toolkit models can be seamlessly integrated with NSight DL Designer for advanced model optimization, profiling, and deployment workflows.
Overview#
NSight DL Designer integration with TAO enables you to:
Visualize TAO model architectures exported as ONNX
Profile TensorRT engines generated from TAO models
Optimize layer-level precision constraints for improved performance
Debug inference performance bottlenecks
Deploy optimized TensorRT engines with fine-grained control
This integration is available through TAO Deploy workflows starting with TAO 6.25.11.
Prerequisites#
Software Requirements#
Operating System: Linux Ubuntu 22.04 LTS or newer
GPU: NVIDIA Ampere (A100) or NVIDIA Hopperâ„¢ (H100) architecture or newer
NSight DL Designer: Version 2025.3 or later
TAO Deploy Container: Version 6.25.11 or later
NVIDIA Driver: Compatible with your GPU and TAO Deploy container
Installation#
Install NSight DL Designer
Download and install NSight DL Designer from the NVIDIA Developer Portal.
# Extract and install NDLD chmod +x NVIDIA_DeepLearning_*.run ./NVIDIA_DeepLearning_*.run
Install TAO CLI
Install the TAO CLI using Python virtualenv:
# Create and activate virtual environment python3 -m venv tao_env source tao_env/bin/activate # Install TAO pip install nvidia-tao
Configure TAO Mounts
Create or update
~/.tao_mounts.jsonwith your directory mappings:{ "Mounts": [ { "source": "/home/user/tao-experiments", "destination": "/workspace/tao-experiments" }, { "source": "/home/user/data", "destination": "/data" } ] }
Configuration#
Configuring NSight DL Designer for TAO#
Launch NSight DL Designer
cd <NDLD_INSTALL_DIR>/host/linux-desktop-dl-x64 ./nsight-dl
Configure TAO Integration
Navigate to Tools > Options > Activities > NVIDIA TAO Integration:
Python/Conda Activation Command: Enter your environment activation command
Example:
. ~/tao_env/bin/activateKeep Activity Log Window Open: Set to Yes to view TAO output after activity completion
Configure Connection
For local testing: Select localhost
For remote testing: Add an SSH connection to your target machine
See the NSight DL Designer User Guide for detailed connection configuration.
TAO Deploy Workflows with NSight DL Designer#
Deploy TAO Model#
This workflow generates a TensorRT engine from a TAO ONNX model using the gen_trt_engine command.
Steps:
In NSight DL Designer, click Start Activity
Select Linux (x86-64) as the target platform
Choose Deploy TAO Model activity
Configure the Common tab:
ONNX Model: Path to your TAO-exported ONNX model
Experiment Spec: Path to your TAO experiment YAML file
Results Directory: Local path for output files
Configure the TAO-Deploy tab (optional overrides):
Data Type: FP32, FP16, or INT8 (overrides YAML setting)
Device Index: GPU device ID (leave blank for device 0)
GPU Workspace Size: TensorRT workspace size in gigabytes
Batch Size: Min/Opt/Max batch sizes for optimization profile
Click Start to launch the activity
Expected Output:
The results directory will contain:
results/
gen_trt_engine.log
gen_trt_engine/
experiment.yaml
status.json
tao_engine.trt
Profile TAO Model#
This workflow profiles a TensorRT engine (newly generated or prebuilt) and provides detailed performance metrics.
With Newly Generated Engine:
In NSight DL Designer, click Start Activity
Select Linux (x86-64) as the target platform
Choose Profile TAO Model activity
Configure the Common tab:
ONNX Model: Path to your TAO-exported ONNX model
Experiment Spec: Path to your TAO experiment YAML file
Prebuilt TensorRT Engine: Leave empty
Results Directory: Local path for output files
Configure the TAO-Deploy tab as needed
Configure the Profiler tab:
Inference Batch Size: Batch size for profiling runs
Additional profiler settings (see NSight DL Designer User Guide)
Click Start to launch the activity
Expected Output:
Interactive profiler report opens in NSight DL Designer
Results directory contains:
results/
gen_trt_engine.log
gen_trt_engine/
experiment.yaml
status.json
tao_engine.trt
tao_report.nv-dld-report
With Prebuilt Engine:
To profile an existing TensorRT engine without regenerating it:
Follow steps 1-4 above, but set Prebuilt TensorRT Engine to the path of your existing
.trtfileConfigure profiler settings and launch
This skips the gen_trt_engine step and profiles the provided engine directly.
Advanced Features#
Setting Layer Precision Constraints#
NSight DL Designer allows you to set per-layer precision constraints for fine-grained optimization.
Steps:
Open your ONNX model in NSight DL Designer: File > Open File
Navigate the model graph and select one or more ONNX nodes
Right-click and select Set TensorRT Layer Precision
Choose a precision (FP32, FP16, or INT8) from the dropdown
Click OK and save the model
Applying Precision Constraints:
Configure how NSight DL Designer precision constraints are applied using the TAO-Deploy > Apply Layer Type Constraints setting:
From YAML only (default): Only YAML constraints are applied; DLD constraints are ignored
From DLD only: Only DLD constraints are applied; YAML constraints are ignored
Merge YAML and DLD: Both are applied; DLD constraints take precedence if there is a conflict
Example Workflow:
# In your TAO experiment YAML
gen_trt_engine:
tensorrt:
data_type: fp16
layers_precision:
layer_name_1: fp32
layer_name_2: int8
Set precision constraints in NSight DL Designer for additional layers
Choose Merge YAML and DLD mode
Profile the model to validate layer precisions in the report
The NSight DL Designer profiler report includes a Precision column showing the actual precision used for each layer.
Variable Batch Size Optimization#
Configure dynamic batch size support for TensorRT engines:
In TAO-Deploy > TensorRT Optimization Profile:
Batch Size: Optimal batch size (used during engine generation)
Batch Size (Min): Minimum supported batch size
Batch Size (Max): Maximum supported batch size
In Profiler tab:
Inference Batch Size: Batch size for profiling (must be within [Min, Max] range)
Example Configuration:
# These settings can be overridden in NSight DL Designer
Batch Size: 2
Batch Size (Min): 1
Batch Size (Max): 4
Inference Batch Size: 3 # Must be between 1 and 4
The profiler report will show input dimensions reflecting the inference batch size.
INT8 PTQ Calibration#
For INT8 quantized models, configure Post-Training Quantization (PTQ) calibration parameters:
In TAO-Deploy > PTQ INT8 Calibration:
Batch Size: Calibration batch size
Calibration Batches: Number of batches to use for calibration
These settings map to:
gen_trt_engine:
tensorrt:
calibration:
cal_batch_size: <value>
cal_batches: <value>
Profiler Reports#
Understanding Profiler Output#
The NSight DL Designer profiler report provides comprehensive insights:
Network Metrics Table:
Layer Name: TensorRT layer identifier
Layer Type: Operation type (Convolution, MatMul, etc.)
Input/Output Dimensions: Tensor shapes
Precision: Actual precision used (FP32, FP16, INT8)
Execution Time: Per-layer latency
Memory Usage: GPU memory consumption
Performance Metrics:
Total inference time
Per-layer execution breakdown
Memory bandwidth utilization
GPU utilization
Use these metrics to:
Identify performance bottlenecks
Validate precision constraints
Optimize batch sizes
Guide model architecture improvements
Troubleshooting#
Common Issues#
TAO CLI Not Found:
Ensure TAO CLI is installed and in your
$PATHVerify the activation command in Tools > Options > Activities > NVIDIA TAO Integration
Check that NSight DL Designer can execute the activation command
Directory Mapping Errors:
Verify
~/.tao_mounts.jsonis properly configuredEnsure source directories exist and are accessible
Check permissions on mounted directories
TensorRT Engine Generation Fails:
Review
gen_trt_engine.login the results directoryCheck YAML configuration for errors
Verify ONNX model compatibility
Ensure sufficient GPU memory (especially for large batch sizes)
Insufficient GPU Memory:
Reduce batch sizes (Min/Opt/Max)
Reduce GPU workspace size
Use FP16 or INT8 instead of FP32
Close other GPU-intensive applications
Profiling Errors:
Ensure inference batch size is within [Min, Max] range
Verify TensorRT engine is valid
Check GPU device availability
Best Practices#
Model Optimization Workflow#
Initial Profiling: Profile with default settings to establish baseline performance
Identify Hotspots: Use profiler reports to find time-consuming layers
Apply Precision Constraints: Selectively reduce precision for compute-heavy layers
Re-profile: Measure performance impact and accuracy trade-offs
Iterate: Refine constraints based on performance and accuracy requirements
Deploy: Export final optimized engine for production use
Performance Tuning Tips#
Mixed Precision: Use FP16 for most layers, FP32 only where numerical stability requires it
Batch Size: Profile multiple batch sizes to find optimal throughput/latency balance
Workspace Size: Increase if seeing TensorRT optimization warnings
Layer Fusion: Let TensorRT automatically fuse layers; avoid unnecessary constraints
Calibration: For INT8, use representative calibration data covering expected inputs
Remote Testing#
For testing on remote machines:
Configure SSH connection in NSight DL Designer
Install TAO CLI on the remote machine
Ensure
~/.tao_mounts.jsonexists on the remote machineConfigure network access and firewall rules
Use the results archive extraction feature to retrieve outputs
Additional Resources#
Supported TAO Models#
The following TAO PyTorch models are compatible with NSight DL Designer integration:
Computer Vision:
Classification (classification)
Object Detection (DINO, RT-DETR)
Open Vocabulary Object Detection (Grounding DINO)
Visual Change Detection (Visual ChangeNet)
Semantic Segmentation (Segformer, Mask2Former, OneFormer)
Requirements:
Models must be exported to ONNX format via TAO export workflow
Experiment YAML file must be available
Models must be compatible with TensorRT (see TAO export documentation)
Limitations#
Remote profiling requires SSH connectivity and proper network configuration
INT8 quantization requires calibration data accessible from the TAO container
Very large models may require significant GPU memory (16 GB or more recommended)
Some TAO models with custom operations may have limited TensorRT support
Note
NSight DL Designer integration is available starting with TAO 6.25.11. Earlier TAO versions are not supported.