Monocular Depth Estimation#
Monocular depth estimation is the task of predicting depth information from a single RGB image. TAO Toolkit provides advanced monocular depth estimation capabilities through the DepthNet model, supporting both relative and metric depth prediction using state-of-the-art transformer-based architectures.
The monocular depth estimation models in TAO support the following tasks:
trainevaluateinferenceexportgen_trt_engine
These tasks can be invoked from the TAO Launcher using the following convention on the command-line:
SPECS=$(tao-client depth_net_mono get-spec --action <sub_task> --job_type experiment --id $EXPERIMENT_ID)
JOB_ID=$(tao-client depth_net_mono experiment-run-action --action <sub_task> --id $EXPERIMENT_ID --specs "$SPECS")
Required Arguments
--id: The unique identifier of the experiment from which to train the model
See also
For information on how to create an experiment using the FTMS client, refer to the Creating an experiment section in the Remote Client documentation.
tao model depth_net <sub_task> <args_per_subtask>
Where args_per_subtask are the command-line arguments required for a given subtask. Each subtask is explained in detail in the following sections.
Supported Model Architectures#
TAO Toolkit supports the following monocular depth estimation model types:
- MetricDepthAnything
A metric monocular depth estimation model that predicts absolute depth values in meters. This model is suitable for applications requiring precise depth measurements, such as robotics, autonomous navigation, and AR/VR. It uses a Vision Transformer (ViT) backbone based on DINOv2 and produces metric depth estimates that can be directly used for distance calculations.
- RelativeDepthAnything
A relative monocular depth estimation model that predicts depth relationships between objects in a scene. This model focuses on understanding the relative ordering of depths rather than absolute distances. It’s useful for applications where understanding spatial relationships is more important than exact measurements, such as image segmentation, scene understanding, and visual effects. This model can be fine-tuned to produce MetricDepthAnything models.
Both models support multiple Vision Transformer encoder sizes:
vits(small): Faster inference, lower memory footprintvitl(large): Higher accuracy, recommended for most use casesvitg(giant): Best accuracy, requires more computational resources
Data Input for Monocular Depth Estimation#
Dataset Preparation#
Monocular depth estimation requires paired RGB images and depth ground truth. The dataset should be organized as follows:
Image Data: RGB images in standard formats (PNG, JPEG, etc.)
Depth Ground Truth: Depth maps in PFM (Portable Float Map) or PNG format
Data Split Files: Text files listing the paths to image and depth pairs
Data Split File Format
Each line in the data split file should contain paths to the RGB image and corresponding depth map, separated by a space:
/path/to/rgb/image_001.png /path/to/depth/image_001.pfm
/path/to/rgb/image_002.png /path/to/depth/image_002.pfm
...
Supported Datasets#
TAO Toolkit supports the following monocular depth datasets:
NYUDV2: Indoor depth dataset with 1449 RGB-D images
NYUDV2Relative: NYUDV2 dataset configured for relative depth training
ThreeDVLM: 3D Vision Language Model dataset
FSD: Foundation Stereo Dataset dataset
NvCLIP: NVIDIA CLIP-based depth dataset
IsaacRealDataset: NVIDIA Isaac real-world stereo data
Crestereo: CREStereo dataset
Middlebury: Middlebury stereo dataset
RelativeMonoDataset: Generic relative monocular depth dataset format
MetricMonoDataset: Generic metric monocular depth dataset format
For custom datasets, you can use the generic RelativeMonoDataset or MetricMonoDataset formats
by creating appropriate data split files.
Creating an Experiment Specification File#
The experiment specification file is a YAML configuration that defines all parameters for training, evaluation, and inference. Below are example configurations for both model types.
Configuration for MetricDepthAnything#
Here is an example specification file for training a MetricDepthAnything model:
Retrieve the specifications:
TRAIN_SPECS=$(tao-client depth_net_mono get-spec --action train --job_type experiment --id $EXPERIMENT_ID)
Get specifications from $TRAIN_SPECS. You can override values as needed.
results_dir: /results/metric_depth_anything/
encryption_key: tlt_encode
dataset:
dataset_name: MonoDataset
min_depth: 0.001
max_depth: 10
train_dataset:
data_sources:
- dataset_name: NYUDV2
data_file: /data/splits/nyu_depth_v2_splits/train_files_with_gt.txt
batch_size: 8
workers: 8
augmentation:
crop_size: [518, 518]
input_mean: [0.485, 0.456, 0.406]
input_std: [0.229, 0.224, 0.225]
min_scale: -0.2
max_scale: 0.4
do_flip: False
yjitter_prob: 1.0
color_aug_prob: 0.2
color_aug_brightness: 0.4
color_aug_contrast: 0.4
color_aug_saturation: [0.0, 1.4]
eraser_aug_prob: 0.5
spatial_aug_prob: 1.0
stretch_prob: 0.8
h_flip_prob: 0.5
v_flip_prob: 0.5
hshift_prob: 0.5
val_dataset:
data_sources:
- dataset_name: NYUDV2
data_file: /data/splits/nyu_depth_v2_splits/test_files_with_gt.txt
batch_size: 1
workers: 4
augmentation:
crop_size: [518, 518]
test_dataset:
data_sources:
- dataset_name: NYUDV2
data_file: /data/splits/nyu_depth_v2_splits/test_files_with_gt.txt
batch_size: 1
infer_dataset:
data_sources:
- dataset_name: NYUDV2
data_file: /data/splits/nyu_depth_v2_splits/test_files_with_gt.txt
batch_size: 10
model:
model_type: MetricDepthAnything
encoder: vitl
mono_backbone:
pretrained_path: /models/nv_relative_depth_anything_v1.pth
use_bn: False
use_clstoken: False
train:
num_gpus: 1
gpu_ids: [0]
num_nodes: 1
num_epochs: 8
seed: 1234
checkpoint_interval: 1
checkpoint_interval_unit: epoch
validation_interval: 1
resume_training_checkpoint_path: null
pretrained_model_path: null
clip_grad_norm: 0.1
dataloader_visualize: True
vis_step_interval: 100
is_dry_run: False
precision: fp32
distributed_strategy: ddp
activation_checkpoint: False
verbose: False
log_every_n_steps: 100
optim:
optimizer: AdamW
lr: 0.000005
momentum: 0.9
weight_decay: 0.0001
lr_scheduler: LambdaLR
lr_steps: [1000]
lr_step_size: 1000
lr_decay: 0.1
min_lr: 1e-07
warmup_steps: 20
cudnn:
benchmark: False
deterministic: True
evaluate:
num_gpus: 1
gpu_ids: [0]
num_nodes: 1
checkpoint: /results/metric_depth_anything/train/dn_model_latest.pth
batch_size: -1
input_width: 736
input_height: 320
inference:
num_gpus: 1
gpu_ids: [0]
num_nodes: 1
checkpoint: /results/metric_depth_anything/train/dn_model_latest.pth
batch_size: -1
conf_threshold: 0.5
input_width: 736
input_height: 320
save_raw_pfm: False
export:
results_dir: /results/metric_depth_anything/export
gpu_id: 0
checkpoint: /results/metric_depth_anything/train/dn_model_latest.pth
onnx_file: /results/metric_depth_anything/export/dn_model_latest.onnx
on_cpu: False
input_channel: 3
input_width: 924
input_height: 518
opset_version: 16
batch_size: -1
verbose: False
format: onnx
valid_iters: 22
gen_trt_engine:
results_dir: /results/metric_depth_anything/trt
gpu_id: 0
onnx_file: /results/metric_depth_anything/export/dn_model_latest.onnx
trt_engine: /results/metric_depth_anything/trt/dn_model.engine
timing_cache: null
batch_size: -1
verbose: False
tensorrt:
workspace_size: 1024
min_batch_size: 1
opt_batch_size: 1
max_batch_size: 1
data_type: FP32
Configuration for RelativeDepthAnything#
Here is an example specification file for training a RelativeDepthAnything model:
Retrieve the specifications:
TRAIN_SPECS=$(tao-client depth_net_mono get-spec --action train --job_type experiment --id $EXPERIMENT_ID)
Get specifications from $TRAIN_SPECS. You can override values as needed.
results_dir: /results/relative_depth_anything/
encryption_key: tlt_encode
dataset:
dataset_name: MonoDataset
train_dataset:
data_sources:
- dataset_name: NYUDV2Relative
data_file: /data/splits/nyu_depth_v2_splits/train_files_with_gt.txt
batch_size: 4
workers: 8
augmentation:
crop_size: [518, 518]
val_dataset:
data_sources:
- dataset_name: NYUDV2Relative
data_file: /data/splits/nyu_depth_v2_splits/test_files_with_gt.txt
batch_size: 1
test_dataset:
data_sources:
- dataset_name: NYUDV2Relative
data_file: /data/splits/nyu_depth_v2_splits/test_files_with_gt.txt
batch_size: 1
infer_dataset:
data_sources:
- dataset_name: NYUDV2Relative
data_file: /data/splits/nyu_depth_v2_splits/test_files_with_gt.txt
batch_size: 10
model:
model_type: RelativeDepthAnything
encoder: vitl
mono_backbone:
pretrained_path: /models/dinov2_vitl14_pretrain.pth
use_bn: False
use_clstoken: False
train:
num_gpus: 1
num_nodes: 1
num_epochs: 10
activation_checkpoint: False
optim:
lr: 0.000006
lr_scheduler: LambdaLR
log_every_n_steps: 500
vis_step_interval: 500
dataloader_visualize: True
evaluate:
num_gpus: 1
checkpoint: /models/nv_relative_depth_anything_v1.pth
inference:
num_gpus: 1
checkpoint: /models/nv_relative_depth_anything_v1.pth
save_raw_pfm: False
export:
gpu_id: 0
checkpoint: /models/nv_relative_depth_anything_v1.pth
onnx_file: /results/relative_depth_anything/export/nv_relative_depth_anything_v1.onnx
input_channel: 3
input_width: 924
input_height: 518
opset_version: 16
on_cpu: False
Key Configuration Parameters#
The following sections provide detailed configuration tables for all parameters.
Dataset Configuration#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
categorical |
Dataset name |
StereoDataset |
MonoDataset,StereoDataset |
|||
|
bool |
Whether to normalize depth |
FALSE |
||||
|
float |
Maximum depth in meters in MetricDepthAnythingV2 |
1.0 |
inf |
|||
|
float |
Minimum depth in meters in MetricDepthAnythingV2 |
0.0 |
inf |
|||
|
int |
Maximum allowed disparity for which we compute losses during training |
416 |
1 |
416 |
||
|
float |
Baseline for stereo datasets |
0.193001 |
0.0 |
inf |
||
|
float |
Focal length along x-axis |
1998.842 |
0.0 |
inf |
||
|
collection |
Configurable parameters to construct the train dataset for a DepthNet experiment |
FALSE |
||||
|
collection |
Configurable parameters to construct the val dataset for a DepthNet experiment |
FALSE |
||||
|
collection |
Configurable parameters to construct the test dataset for a DepthNet experiment |
FALSE |
||||
|
collection |
Configurable parameters to construct the infer dataset for a DepthNet experiment |
FALSE |
Model Configuration#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
categorical |
Network name |
MetricDepthAnythingV2 |
FoundationStereo,MetricDepthAnything,RelativeDepthAnything |
|||
|
collection |
Network defined paths for Monocular DepthNet Backbone |
FALSE |
||||
|
collection |
Network defined paths for Edgenext and Depthanythingv2 |
FALSE |
||||
|
list |
Hidden dimensions |
[128, 128, 128] |
FALSE |
|||
|
int |
Width of the correlation pyramid |
4 |
1 |
TRUE |
||
|
int |
cv group |
8 |
1 |
TRUE |
||
|
int |
Train iteration |
22 |
1 |
TRUE |
||
|
int |
Validation iteration |
22 |
1 |
|||
|
int |
Volume dimension |
32 |
1 |
TRUE |
||
|
int |
reduce memory usage |
0 |
0 |
4 |
||
|
bool |
Whether to use mixed precision training |
FALSE |
||||
|
int |
Number of hidden GRU levels |
3 |
1 |
3 |
||
|
int |
Number of levels in the correlation pyramid |
2 |
1 |
2 |
||
|
int |
Resolution of the disparity field (1/2^K) |
2 |
1 |
2 |
||
|
categorical |
DepthAnythingV2 Encoder options |
vitl |
vits,vitl |
|||
|
int |
Maximum disparity of the model used in the training of a stereo model |
416 |
Monocular Backbone Configuration#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Path to load DepthAnythingv2 as an encoder for Monocular DepthNet |
|||||
|
bool |
Whether to use batch normalization in Monocular DepthNet |
FALSE |
||||
|
bool |
Whether to use class token |
FALSE |
Training Configuration#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
int |
Number of GPUs to run the train job. |
1 |
1 |
|||
|
list |
List of GPU IDs to run the training on. The length of this list must be equal to the number of gpus in train.num_gpus. |
[0] |
FALSE |
|||
|
int |
Number of nodes to run the training on. If > 1, then multi-node is enabled. |
1 |
1 |
|||
|
int |
Seed for the initializer in PyTorch. If < 0, disable fixed seed. |
1234 |
-1 |
inf |
||
|
collection |
FALSE |
|||||
|
int |
Number of epochs to run the training. |
10 |
1 |
inf |
||
|
int |
Interval (in epochs) at which a checkpoint is to be saved; helps resume training. |
1 |
1 |
|||
|
categorical |
Unit of the checkpoint interval. |
epoch |
epoch,step |
|||
|
int |
Interval (in epochs) at which a evaluation will be triggered on the validation dataset. |
1 |
1 |
|||
|
string |
Path to the checkpoint from which to resume training. |
|||||
|
string |
Path to where all the assets generated from a task are stored. |
|||||
|
int |
Number of steps to save the checkpoint. |
|||||
|
string |
Path to a pretrained DepthNet model from which to initialize the current training. |
|||||
|
float |
Amount to clip the gradient by L2 Norm. A value of 0.0 specifies no clipping. |
0.1 |
||||
|
bool |
Whether to visualize the dataloader. |
FALSE |
TRUE |
|||
|
int |
Visualization interval in step. |
10 |
TRUE |
|||
|
bool |
Whether to run the trainer in Dry Run mode. This serves as a good means to validate the specification file and run a sanity check on the trainer without actually initializing and running the trainer. |
FALSE |
||||
|
collection |
Hyperparameters to configure the optimizer. |
FALSE |
||||
|
categorical |
Precision on which to run the training. |
fp32 |
bf16,fp32,fp16 |
|||
|
categorical |
Multi-GPU training strategy. DDP (Distributed Data Parallel) and Fully Sharded DDP are supported. |
ddp |
ddp,fsdp |
|||
|
bool |
Whether train is to recompute in backward pass to save GPU memory (TRUE) or store activations (FALSE). |
TRUE |
||||
|
bool |
Whether to display verbose logs to console. |
FALSE |
||||
|
bool |
Whether to use tiled inference, particularly for transformers which expect fixed size of sequences. |
FALSE |
||||
|
string |
Use tiled inference weight type. |
gaussian |
||||
|
list |
Minimum overlap for tile. |
[16, 16] |
FALSE |
|||
|
int |
Interval steps of logging training results and running validation numbers within one epoch. |
500 |
Optimizer Configuration#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
categorical |
Type of optimizer used to train the network |
AdamW |
AdamW,SGD |
|||
|
categorical |
Metric value to be monitored for the |
val_loss |
val_loss,train_loss |
|||
|
float |
Initial learning rate for training the model, excluding the backbone |
0.0001 |
TRUE |
|||
|
float |
Momentum for the AdamW optimizer |
0.9 |
TRUE |
|||
|
float |
Weight decay coefficient |
0.0001 |
TRUE |
|||
|
categorical |
Learning scheduler:
|
MultiStepLR |
MultiStep,StepLR,CustomMultiStepLRScheduler,LambdaLR,PolynomialLR,OneCycleLR,CosineAnnealingLR |
|||
|
list |
Steps at which the learning rate must be decreased This is applicable only with the MultiStep LR |
[1000] |
FALSE |
|||
|
int |
Number of steps to decrease the learning rate in the StepLR |
1000 |
TRUE |
|||
|
float |
Decreasing factor for the learning rate scheduler |
0.1 |
TRUE |
|||
|
float |
Minimum learning rate value for the learning rate scheduler |
1e-07 |
TRUE |
|||
|
int |
Number of steps to perform linear learning rate” warm-up before engaging a learning rate scheduler |
20 |
0 |
inf |
Evaluation Configuration#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
int |
Number of GPUs to run the evaluation job. |
1 |
1 |
|||
|
list |
List of GPU IDs to run the evaluation on. The length of this list
must be equal to the number of |
[0] |
FALSE |
|||
|
int |
Number of nodes to run the evaluation on. If > 1, then multi-node is enabled. |
1 |
1 |
|||
|
string |
Path to the checkpoint used for evaluation. |
??? |
||||
|
string |
Path to the TensorRT engine to be used for evaluation.
This only works with |
|||||
|
string |
Path to where all the assets generated from a task are stored. |
|||||
|
int |
Batch size of the input Tensor. This is important if |
-1 |
-1 |
|||
|
int |
Width of the input image tensor. |
736 |
1 |
|||
|
int |
Height of the input image tensor. |
320 |
1 |
Inference Configuration#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
int |
Number of GPUs to run the inference job. |
1 |
1 |
|||
|
list |
List of GPU IDs to run the inference on. The length of this list
must be equal to the number of gpus in |
[0] |
FALSE |
|||
|
int |
Number of nodes to run the inference on. If > 1, then multi-node is enabled. |
1 |
1 |
|||
|
string |
Path to the checkpoint used for inference. |
??? |
||||
|
string |
Path to the TensorRT engine to be used for inference.
This only works with |
|||||
|
string |
Path to where all the assets generated from a task are stored. |
|||||
|
int |
Batch size of the input Tensor. This is important if batch_size > 1 for a large dataset. |
-1 |
-1 |
|||
|
float |
Value of the confidence threshold to be used when filtering out the final list of boxes. |
0.5 |
||||
|
int |
Width of the input image tensor. |
1 |
||||
|
int |
Height of the input image tensor. |
1 |
||||
|
bool |
Whether to save the raw pfm output during inference. |
FALSE |
Export Configuration#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Path to where all the assets generated from a task are stored. |
|||||
|
int |
Index of the GPU to build the TensorRT engine. |
0 |
||||
|
string |
Path to the checkpoint file to run export. |
??? |
||||
|
string |
Path to the onnx model file. |
??? |
||||
|
bool |
Whether to export CPU compatible model. |
FALSE |
||||
|
ordered_int |
Number of channels in the input Tensor. |
3 |
1 |
1,3 |
||
|
int |
Width of the input image tensor. |
960 |
32 |
|||
|
int |
Height of the input image tensor. |
544 |
32 |
|||
|
int |
Operator set version of the ONNX model used to generate TensorRT engine. |
17 |
1 |
|||
|
int |
Batch size of the input Tensor for the engine.
A value of |
-1 |
-1 |
|||
|
bool |
Whether to enable verbose TensorRT logging. |
FALSE |
||||
|
categorical |
File format to export to. |
onnx |
onnx,xdl |
|||
|
int |
Number of GRU iterations to export the model. |
22 |
1 |
TensorRT Engine Configuration#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Path to where all the assets generated from a task are stored. |
|||||
|
int |
Index of the GPU to build the TensorRT engine. |
0 |
0 |
|||
|
string |
Path to the ONNX model file. |
??? |
||||
|
string |
Path to the TensorRT engine generated should be stored.
This only works with |
??? |
||||
|
string |
Path to a TensorRT timing cache that speeds up engine generation. This will be created/read/updated. |
|||||
|
int |
Batch size of the input tensor for the engine.
A value of |
-1 |
-1 |
|||
|
bool |
Whether to enable verbose TensorRT logging. |
FALSE |
||||
|
collection |
Hyperparameters to configure the TensorRT Engine builder. |
FALSE |
Augmentation Configuration#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
list |
Input mean for RGB frames |
[0.485, 0.456, 0.406] |
FALSE |
|||
|
list |
Input standard deviation per pixel for RGB frames |
[0.229, 0.224, 0.225] |
FALSE |
|||
|
list |
Crop size for input RGB images [height, width] |
[518, 518] |
FALSE |
|||
|
float |
Minimum scale in data augmentation |
-0.2 |
0.2 |
1 |
||
|
float |
Maximum scale in data augmentation |
0.4 |
-0.2 |
1 |
||
|
bool |
Whether to perform flip in data augmentation |
FALSE |
||||
|
float |
Probability for y jitter |
1.0 |
0.0 |
1.0 |
TRUE |
|
|
list |
Gamma range in data augmentation |
[1, 1, 1, 1] |
FALSE |
|||
|
float |
Probability for asymmetric color augmentation |
0.2 |
0.0 |
1.0 |
TRUE |
|
|
float |
Color jitter brightness |
0.4 |
0.0 |
1.0 |
||
|
float |
Color jitter contrast |
0.4 |
0.0 |
1.0 |
||
|
list |
Color jitter saturation |
[0.0, 1.4] |
FALSE |
|||
|
list |
Hue range in data augmentation |
[-0.027777777777777776, 0.027777777777777776] |
FALSE |
|||
|
float |
Probability for eraser augmentation |
0.5 |
0.0 |
1.0 |
TRUE |
|
|
float |
Probability for spatial augmentation |
1.0 |
0.0 |
1.0 |
TRUE |
|
|
float |
Probability for stretch augmentation |
0.8 |
0.0 |
1.0 |
TRUE |
|
|
float |
Maximum stretch augmentation |
0.2 |
0.0 |
1.0 |
||
|
float |
Probability for horizontal flip augmentation |
0.5 |
0.0 |
1.0 |
TRUE |
|
|
float |
Probability for vertical flip augmentation |
0.5 |
0.0 |
1.0 |
TRUE |
|
|
float |
Probability for horizontal shift augmentation |
0.5 |
0.0 |
1.0 |
TRUE |
|
|
float |
Probability for minimum crop valid disparity ratio |
0.0 |
0.0 |
1.0 |
TRUE |
Training the Model#
To train a monocular depth estimation model:
# Get the training spec
TRAIN_SPECS=$(tao-client depth_net_mono get-spec --action train --job_type experiment --id $EXPERIMENT_ID)
# Modify TRAIN_SPECS as needed, then run training
JOB_ID=$(tao-client depth_net_mono experiment-run-action --action train --id $EXPERIMENT_ID --specs "$TRAIN_SPECS")
tao model depth_net train \
-e /path/to/experiment_spec.yaml \
-k $KEY \
results_dir=/path/to/results
Required arguments:
-e: Path to the experiment specification file-k: Encryption key for model checkpoints
Optional arguments:
results_dir: Overrides the results directory from the specification filetrain.num_gpus: Overrides number of GPUstrain.num_epochs: Overrides number of training epochsdataset.train_dataset.batch_size: Overrides batch size
Training Output#
The training process generates the following outputs in the results directory:
train/dn_model_latest.pth: Latest model checkpointtrain/dn_model_epoch_XXX.pth: Periodic checkpointstrain/events.out.tfevents.*: TensorBoard log filestrain/status.json: Training status and metrics
You can monitor training progress using TensorBoard:
tensorboard --logdir=/path/to/results/train
Evaluating the Model#
To evaluate a trained monocular depth estimation model:
EVAL_SPECS=$(tao-client depth_net_mono get-spec --action evaluate --job_type experiment --id $EXPERIMENT_ID)
JOB_ID=$(tao-client depth_net_mono experiment-run-action --action evaluate --id $EXPERIMENT_ID --specs "$EVAL_SPECS")
tao model depth_net evaluate \
-e /path/to/experiment_spec.yaml \
-k $KEY \
evaluate.checkpoint=/path/to/checkpoint.pth
Required arguments:
-e: Path to the experiment specification file-k: Encryption key
Optional arguments:
evaluate.checkpoint: Path to model checkpoint to evaluateevaluate.batch_size: Batch size for evaluationdataset.test_dataset.data_sources: Override test dataset
Evaluation Metrics#
For monocular depth estimation, TAO computes the following metrics:
- Absolute relative error (abs_rel):
Mean of |predicted - ground_truth| / ground_truth. Lower is better.
- Delta accuracy (d1):
Percentage of pixels where max(predicted/ground_truth, ground_truth/predicted) < 1.25. Higher is better.
These metrics are saved to a JSON file in the results directory and displayed in the console output.
Running Inference#
To run inference on images using a trained model:
INFER_SPECS=$(tao-client depth_net_mono get-spec --action inference --job_type experiment --id $EXPERIMENT_ID)
JOB_ID=$(tao-client depth_net_mono experiment-run-action --action inference --id $EXPERIMENT_ID --specs "$INFER_SPECS")
tao model depth_net inference \
-e /path/to/experiment_spec.yaml \
-k $KEY \
inference.checkpoint=/path/to/checkpoint.pth
Required arguments:
-e: Path to the experiment specification file-k: Encryption key
Optional arguments:
inference.checkpoint: Path to model checkpointinference.save_raw_pfm: Saves depth maps in PFM format (default: False)inference.batch_size: Batch size for inferencedataset.infer_dataset.data_sources: Overrides inference dataset
Inference Output#
The inference process generates:
Depth map visualizations (colored depth images) in PNG format
Raw depth values in PFM format (if
save_raw_pfmis True)Inference results, saved in
results_dir/inference/
Exporting the Model#
To export a trained model to ONNX format:
EXPORT_SPECS=$(tao-client depth_net_mono get-spec --action export --job_type experiment --id $EXPERIMENT_ID)
JOB_ID=$(tao-client depth_net_mono experiment-run-action --action export --id $EXPERIMENT_ID --specs "$EXPORT_SPECS")
tao model depth_net export \
-e /path/to/experiment_spec.yaml \
-k $KEY \
export.checkpoint=/path/to/checkpoint.pth \
export.onnx_file=/path/to/output.onnx
Required arguments:
-e: Path to the experiment specification file-k: Encryption keyexport.checkpoint: Path to trained model checkpointexport.onnx_file: Output path for ONNX model
Optional arguments:
export.input_channel: Number of input channels (default: 3)export.input_width: Input image width (default: 924)export.input_height: Input image height (default: 518)export.opset_version: ONNX opset version (default: 16)export.batch_size: Batch size, -1 for dynamic (default: -1)export.on_cpu: Export CPU-compatible model (default: False)export.format: Export format -onnxorxdl(default: onnx)export.valid_iters: Number of iterations for export (default: 22)
Generating TensorRT Engine#
To generate an NVIDIA® TensorRT™ engine from the exported ONNX model for optimized inference:
GEN_TRT_SPECS=$(tao-client depth_net_mono get-spec --action gen_trt_engine --job_type experiment --id $EXPERIMENT_ID)
JOB_ID=$(tao-client depth_net_mono experiment-run-action --action gen_trt_engine --id $EXPERIMENT_ID --specs "$GEN_TRT_SPECS")
tao deploy depth_net gen_trt_engine \
-e /path/to/experiment_spec.yaml \
gen_trt_engine.onnx_file=/path/to/model.onnx \
gen_trt_engine.trt_engine=/path/to/output.engine
Required arguments:
-e: Path to the experiment specification filegen_trt_engine.onnx_file: Path to ONNX modelgen_trt_engine.trt_engine: Output path for TensorRT engine
Optional arguments:
gen_trt_engine.gpu_id: GPU index for engine generation (default: 0)gen_trt_engine.batch_size: Batch size, -1 for dynamic (default: -1)gen_trt_engine.verbose: Enables verbose logging (default: False)gen_trt_engine.timing_cache: Path to timing cache filegen_trt_engine.tensorrt.workspace_size: TensorRT workspace size in MB (default: 1024)gen_trt_engine.tensorrt.data_type: Precision -FP32orFP16(default: FP32)gen_trt_engine.tensorrt.min_batch_size: Minimum batch size (default: 1)gen_trt_engine.tensorrt.opt_batch_size: Optimal batch size (default: 1)gen_trt_engine.tensorrt.max_batch_size: Maximum batch size (default: 1)
TensorRT Engine Benefits#
Performance:: 2-5x faster inference compared to PyTorch
Memory efficiency: Reduced memory footprint
Optimization: Layer fusion and kernel auto-tuning
Deployment: Production-ready inference engine
Model Configuration Reference#
For a complete reference to all configuration parameters, refer to the configuration tables in the TAO Toolkit documentation or the experiment specification files provided with the toolkit.
Best Practices#
Training Recommendations#
Start with RelativeDepthAnything: Train a relative depth model first, then fine-tune to metric depth
Use pretrained weights: Initialize from DINOv2 or existing depth models for better convergence
Encoder selection:
Use
vitlfor most applications (best balance)Use
vitsfor edge deployment with limited resourcesUse
vitgwhen maximum accuracy is critical
Batch size: Start with batch size 4-8 for
vitlencoder on a single GPULearning rate: Use small learning rates (1e-5 to 1e-6) when fine-tuning from pretrained models
Activation checkpointing: Enable for large models (
vitl,vitg) to reduce memory usageAugmentation: Use moderate augmentation for indoor scenes, stronger for outdoor datasets
Data Preparation#
Dataset quality: Ensure depth ground truth is accurate and aligned with RGB images
Depth range: Set appropriate
min_depthandmax_depthfor your use caseMixed datasets: Combine multiple datasets for better generalization
Train/val split: Use 80-90% for training, 10-20% for validation
Performance Optimization#
Multi-GPU training: Use
ddpstrategy for 2-8 GPUs,fsdpfor larger clustersMixed precision: Use
fp16for 2x faster training on modern GPUsData loading: Increase
workers(4-8) to prevent data loading bottlenecksTensorRT deployment: Always use TensorRT engines for production inference
Troubleshooting#
Common Issues#
Out of memory (OOM):
Reduce batch size
Enable
activation_checkpoint: True
Poor depth quality:
Check data alignment between RGB and depth
Verify depth ground truth is in correct format (PFM or PNG)
Ensure
min_depthandmax_depthmatch your data rangeIncrease training epochs
Try different augmentation settings
Training instability:
Reduce learning rate
Enable gradient clipping (
clip_grad_norm: 0.1)Check for NaN values in depth ground truth
Use
cudnn.deterministic: Truefor reproducible training
Additional Resources#
TAO Toolkit documentation: https://docs.nvidia.com/tao/
Sample notebooks: NVIDIA/tao_tutorials
NGC pretrained models: https://catalog.ngc.nvidia.com/
For more information about stereo depth estimation, go to Stereo Depth Estimation.