Monocular Depth Estimation#

Monocular depth estimation is the task of predicting depth information from a single RGB image. TAO Toolkit provides advanced monocular depth estimation capabilities through the DepthNet model, supporting both relative and metric depth prediction using state-of-the-art transformer-based architectures.

The monocular depth estimation models in TAO support the following tasks:

train
evaluate
inference
export
gen_trt_engine

These tasks can be invoked from the TAO Launcher using the following convention on the command-line:

SPECS=$(tao-client depth_net_mono get-spec --action <sub_task> --job_type experiment --id $EXPERIMENT_ID)

JOB_ID=$(tao-client depth_net_mono experiment-run-action --action <sub_task> --id $EXPERIMENT_ID --specs "$SPECS")

Required Arguments

--id: The unique identifier of the experiment from which to train the model

Supported Model Architectures#

TAO Toolkit supports the following monocular depth estimation model types:

MetricDepthAnything: A metric monocular depth estimation model that predicts absolute depth values in meters. This model is suitable for applications requiring precise depth measurements, such as robotics, autonomous navigation, and AR/VR. It uses a Vision Transformer (ViT) backbone based on DINOv2 and produces metric depth estimates that can be directly used for distance calculations.
RelativeDepthAnything: A relative monocular depth estimation model that predicts depth relationships between objects in a scene. This model focuses on understanding the relative ordering of depths rather than absolute distances. It’s useful for applications where understanding spatial relationships is more important than exact measurements, such as image segmentation, scene understanding, and visual effects. This model can be fine-tuned to produce MetricDepthAnything models.

Both models support multiple Vision Transformer encoder sizes:

vits (small): Faster inference, lower memory footprint
vitl (large): Higher accuracy, recommended for most use cases
vitg (giant): Best accuracy, requires more computational resources

Data Input for Monocular Depth Estimation#

Dataset Preparation#

Monocular depth estimation requires paired RGB images and depth ground truth. The dataset should be organized as follows:

Image Data: RGB images in standard formats (PNG, JPEG, etc.)
Depth Ground Truth: Depth maps in PFM (Portable Float Map) or PNG format
Data Split Files: Text files listing the paths to image and depth pairs

Data Split File Format

Each line in the data split file should contain paths to the RGB image and corresponding depth map, separated by a space:

/path/to/rgb/image_001.png /path/to/depth/image_001.pfm
/path/to/rgb/image_002.png /path/to/depth/image_002.pfm
...

Supported Datasets#

TAO Toolkit supports the following monocular depth datasets:

NYUDV2: Indoor depth dataset with 1449 RGB-D images
NYUDV2Relative: NYUDV2 dataset configured for relative depth training
ThreeDVLM: 3D Vision Language Model dataset
FSD: Foundation Stereo Dataset dataset
NvCLIP: NVIDIA CLIP-based depth dataset
IsaacRealDataset: NVIDIA Isaac real-world stereo data
Crestereo: CREStereo dataset
Middlebury: Middlebury stereo dataset
RelativeMonoDataset: Generic relative monocular depth dataset format
MetricMonoDataset: Generic metric monocular depth dataset format

For custom datasets, you can use the generic RelativeMonoDataset or MetricMonoDataset formats by creating appropriate data split files.

Creating an Experiment Specification File#

The experiment specification file is a YAML configuration that defines all parameters for training, evaluation, and inference. Below are example configurations for both model types.

Configuration for MetricDepthAnything#

Here is an example specification file for training a MetricDepthAnything model:

Retrieve the specifications:

TRAIN_SPECS=$(tao-client depth_net_mono get-spec --action train --job_type experiment --id $EXPERIMENT_ID)

Get specifications from $TRAIN_SPECS. You can override values as needed.

results_dir: /results/metric_depth_anything/
encryption_key: tlt_encode

dataset:
  dataset_name: MonoDataset
  min_depth: 0.001
  max_depth: 10
  train_dataset:
    data_sources:
      - dataset_name: NYUDV2
        data_file: /data/splits/nyu_depth_v2_splits/train_files_with_gt.txt
    batch_size: 8
    workers: 8
    augmentation:
      crop_size: [518, 518]
      input_mean: [0.485, 0.456, 0.406]
      input_std: [0.229, 0.224, 0.225]
      min_scale: -0.2
      max_scale: 0.4
      do_flip: False
      yjitter_prob: 1.0
      color_aug_prob: 0.2
      color_aug_brightness: 0.4
      color_aug_contrast: 0.4
      color_aug_saturation: [0.0, 1.4]
      eraser_aug_prob: 0.5
      spatial_aug_prob: 1.0
      stretch_prob: 0.8
      h_flip_prob: 0.5
      v_flip_prob: 0.5
      hshift_prob: 0.5
  val_dataset:
    data_sources:
      - dataset_name: NYUDV2
        data_file: /data/splits/nyu_depth_v2_splits/test_files_with_gt.txt
    batch_size: 1
    workers: 4
    augmentation:
      crop_size: [518, 518]
  test_dataset:
    data_sources:
      - dataset_name: NYUDV2
        data_file: /data/splits/nyu_depth_v2_splits/test_files_with_gt.txt
    batch_size: 1
  infer_dataset:
    data_sources:
      - dataset_name: NYUDV2
        data_file: /data/splits/nyu_depth_v2_splits/test_files_with_gt.txt
    batch_size: 10

model:
  model_type: MetricDepthAnything
  encoder: vitl
  mono_backbone:
    pretrained_path: /models/nv_relative_depth_anything_v1.pth
    use_bn: False
    use_clstoken: False

train:
  num_gpus: 1
  gpu_ids: [0]
  num_nodes: 1
  num_epochs: 8
  seed: 1234
  checkpoint_interval: 1
  checkpoint_interval_unit: epoch
  validation_interval: 1
  resume_training_checkpoint_path: null
  pretrained_model_path: null
  clip_grad_norm: 0.1
  dataloader_visualize: True
  vis_step_interval: 100
  is_dry_run: False
  precision: fp32
  distributed_strategy: ddp
  activation_checkpoint: False
  verbose: False
  log_every_n_steps: 100
  optim:
    optimizer: AdamW
    lr: 0.000005
    momentum: 0.9
    weight_decay: 0.0001
    lr_scheduler: LambdaLR
    lr_steps: [1000]
    lr_step_size: 1000
    lr_decay: 0.1
    min_lr: 1e-07
    warmup_steps: 20
  cudnn:
    benchmark: False
    deterministic: True

evaluate:
  num_gpus: 1
  gpu_ids: [0]
  num_nodes: 1
  checkpoint: /results/metric_depth_anything/train/dn_model_latest.pth
  batch_size: -1
  input_width: 736
  input_height: 320

inference:
  num_gpus: 1
  gpu_ids: [0]
  num_nodes: 1
  checkpoint: /results/metric_depth_anything/train/dn_model_latest.pth
  batch_size: -1
  conf_threshold: 0.5
  input_width: 736
  input_height: 320
  save_raw_pfm: False

export:
  results_dir: /results/metric_depth_anything/export
  gpu_id: 0
  checkpoint: /results/metric_depth_anything/train/dn_model_latest.pth
  onnx_file: /results/metric_depth_anything/export/dn_model_latest.onnx
  on_cpu: False
  input_channel: 3
  input_width: 924
  input_height: 518
  opset_version: 16
  batch_size: -1
  verbose: False
  format: onnx
  valid_iters: 22

gen_trt_engine:
  results_dir: /results/metric_depth_anything/trt
  gpu_id: 0
  onnx_file: /results/metric_depth_anything/export/dn_model_latest.onnx
  trt_engine: /results/metric_depth_anything/trt/dn_model.engine
  timing_cache: null
  batch_size: -1
  verbose: False
  tensorrt:
    workspace_size: 1024
    min_batch_size: 1
    opt_batch_size: 1
    max_batch_size: 1
    data_type: FP32

Configuration for RelativeDepthAnything#

Here is an example specification file for training a RelativeDepthAnything model:

Retrieve the specifications:

TRAIN_SPECS=$(tao-client depth_net_mono get-spec --action train --job_type experiment --id $EXPERIMENT_ID)

Get specifications from $TRAIN_SPECS. You can override values as needed.

results_dir: /results/relative_depth_anything/
encryption_key: tlt_encode

dataset:
  dataset_name: MonoDataset
  train_dataset:
    data_sources:
      - dataset_name: NYUDV2Relative
        data_file: /data/splits/nyu_depth_v2_splits/train_files_with_gt.txt
    batch_size: 4
    workers: 8
    augmentation:
      crop_size: [518, 518]
  val_dataset:
    data_sources:
      - dataset_name: NYUDV2Relative
        data_file: /data/splits/nyu_depth_v2_splits/test_files_with_gt.txt
    batch_size: 1
  test_dataset:
    data_sources:
      - dataset_name: NYUDV2Relative
        data_file: /data/splits/nyu_depth_v2_splits/test_files_with_gt.txt
    batch_size: 1
  infer_dataset:
    data_sources:
      - dataset_name: NYUDV2Relative
        data_file: /data/splits/nyu_depth_v2_splits/test_files_with_gt.txt
    batch_size: 10

model:
  model_type: RelativeDepthAnything
  encoder: vitl
  mono_backbone:
    pretrained_path: /models/dinov2_vitl14_pretrain.pth
    use_bn: False
    use_clstoken: False

train:
  num_gpus: 1
  num_nodes: 1
  num_epochs: 10
  activation_checkpoint: False
  optim:
    lr: 0.000006
    lr_scheduler: LambdaLR
  log_every_n_steps: 500
  vis_step_interval: 500
  dataloader_visualize: True

evaluate:
  num_gpus: 1
  checkpoint: /models/nv_relative_depth_anything_v1.pth

inference:
  num_gpus: 1
  checkpoint: /models/nv_relative_depth_anything_v1.pth
  save_raw_pfm: False

export:
  gpu_id: 0
  checkpoint: /models/nv_relative_depth_anything_v1.pth
  onnx_file: /results/relative_depth_anything/export/nv_relative_depth_anything_v1.onnx
  input_channel: 3
  input_width: 924
  input_height: 518
  opset_version: 16
  on_cpu: False

Key Configuration Parameters#

The following sections provide detailed configuration tables for all parameters.

Dataset Configuration#

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`dataset_name`	categorical	Dataset name	StereoDataset			MonoDataset,StereoDataset
`normalize_depth`	bool	Whether to normalize depth	FALSE
`max_depth`	float	Maximum depth in meters in MetricDepthAnythingV2		1.0	inf
`min_depth`	float	Minimum depth in meters in MetricDepthAnythingV2		0.0	inf
`max_disparity`	int	Maximum allowed disparity for which we compute losses during training	416	1	416
`baseline`	float	Baseline for stereo datasets	0.193001	0.0	inf
`focal_x`	float	Focal length along x-axis	1998.842	0.0	inf
`train_dataset`	collection	Configurable parameters to construct the train dataset for a DepthNet experiment					FALSE
`val_dataset`	collection	Configurable parameters to construct the val dataset for a DepthNet experiment					FALSE
`test_dataset`	collection	Configurable parameters to construct the test dataset for a DepthNet experiment					FALSE
`infer_dataset`	collection	Configurable parameters to construct the infer dataset for a DepthNet experiment					FALSE

Model Configuration#

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`model_type`	categorical	Network name	MetricDepthAnythingV2			FoundationStereo,MetricDepthAnything,RelativeDepthAnything
`mono_backbone`	collection	Network defined paths for Monocular DepthNet Backbone					FALSE
`stereo_backbone`	collection	Network defined paths for Edgenext and Depthanythingv2					FALSE
`hidden_dims`	list	Hidden dimensions	[128, 128, 128]				FALSE
`corr_radius`	int	Width of the correlation pyramid	4	1			TRUE
`cv_group`	int	cv group	8	1			TRUE
`train_iters`	int	Train iteration	22	1			TRUE
`valid_iters`	int	Validation iteration	22	1
`volume_dim`	int	Volume dimension	32	1			TRUE
`low_memory`	int	reduce memory usage	0	0	4
`mixed_precision`	bool	Whether to use mixed precision training	FALSE
`n_gru_layers`	int	Number of hidden GRU levels	3	1	3
`corr_levels`	int	Number of levels in the correlation pyramid	2	1	2
`n_downsample`	int	Resolution of the disparity field (1/2^K)	2	1	2
`encoder`	categorical	DepthAnythingV2 Encoder options	vitl			vits,vitl
`max_disparity`	int	Maximum disparity of the model used in the training of a stereo model	416

Monocular Backbone Configuration#

Field	value_type	description	default_value
`pretrained_path`	string	Path to load DepthAnythingv2 as an encoder for Monocular DepthNet
`use_bn`	bool	Whether to use batch normalization in Monocular DepthNet	FALSE
`use_clstoken`	bool	Whether to use class token	FALSE

Training Configuration#

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`num_gpus`	int	Number of GPUs to run the train job.	1	1
`gpu_ids`	list	List of GPU IDs to run the training on. The length of this list must be equal to the number of gpus in train.num_gpus.	[0]				FALSE
`num_nodes`	int	Number of nodes to run the training on. If > 1, then multi-node is enabled.	1	1
`seed`	int	Seed for the initializer in PyTorch. If < 0, disable fixed seed.	1234	-1	inf
`cudnn`	collection						FALSE
`num_epochs`	int	Number of epochs to run the training.	10	1	inf
`checkpoint_interval`	int	Interval (in epochs) at which a checkpoint is to be saved; helps resume training.	1	1
`checkpoint_interval_unit`	categorical	Unit of the checkpoint interval.	epoch			epoch,step
`validation_interval`	int	Interval (in epochs) at which a evaluation will be triggered on the validation dataset.	1	1
`resume_training_checkpoint_path`	string	Path to the checkpoint from which to resume training.
`results_dir`	string	Path to where all the assets generated from a task are stored.
`checkpoint_interval_steps`	int	Number of steps to save the checkpoint.
`pretrained_model_path`	string	Path to a pretrained DepthNet model from which to initialize the current training.
`clip_grad_norm`	float	Amount to clip the gradient by L2 Norm. A value of 0.0 specifies no clipping.	0.1
`dataloader_visualize`	bool	Whether to visualize the dataloader.	FALSE				TRUE
`vis_step_interval`	int	Visualization interval in step.	10				TRUE
`is_dry_run`	bool	Whether to run the trainer in Dry Run mode. This serves as a good means to validate the specification file and run a sanity check on the trainer without actually initializing and running the trainer.	FALSE
`optim`	collection	Hyperparameters to configure the optimizer.					FALSE
`precision`	categorical	Precision on which to run the training.	fp32			bf16,fp32,fp16
`distributed_strategy`	categorical	Multi-GPU training strategy. DDP (Distributed Data Parallel) and Fully Sharded DDP are supported.	ddp			ddp,fsdp
`activation_checkpoint`	bool	Whether train is to recompute in backward pass to save GPU memory (TRUE) or store activations (FALSE).	TRUE
`verbose`	bool	Whether to display verbose logs to console.	FALSE
`inference_tile`	bool	Whether to use tiled inference, particularly for transformers which expect fixed size of sequences.	FALSE
`tile_wtype`	string	Use tiled inference weight type.	gaussian
`tile_min_overlap`	list	Minimum overlap for tile.	[16, 16]				FALSE
`log_every_n_steps`	int	Interval steps of logging training results and running validation numbers within one epoch.	500

Optimizer Configuration#

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`optimizer`	categorical	Type of optimizer used to train the network	AdamW			AdamW,SGD
`monitor_name`	categorical	Metric value to be monitored for the `AutoReduce` Scheduler	val_loss			val_loss,train_loss
`lr`	float	Initial learning rate for training the model, excluding the backbone	0.0001				TRUE
`momentum`	float	Momentum for the AdamW optimizer	0.9				TRUE
`weight_decay`	float	Weight decay coefficient	0.0001				TRUE
`lr_scheduler`	categorical	Learning scheduler: MultiStepLR : Decrease the lr by lr_decay from lr_steps StepLR : Decrease the lr by lr_decay at every lr_step_size	MultiStepLR			MultiStep,StepLR,CustomMultiStepLRScheduler,LambdaLR,PolynomialLR,OneCycleLR,CosineAnnealingLR
`lr_steps`	list	Steps at which the learning rate must be decreased This is applicable only with the MultiStep LR	[1000]				FALSE
`lr_step_size`	int	Number of steps to decrease the learning rate in the StepLR	1000				TRUE
`lr_decay`	float	Decreasing factor for the learning rate scheduler	0.1				TRUE
`min_lr`	float	Minimum learning rate value for the learning rate scheduler	1e-07				TRUE
`warmup_steps`	int	Number of steps to perform linear learning rate” warm-up before engaging a learning rate scheduler	20	0	inf

Evaluation Configuration#

Field	value_type	description	default_value	valid_min	automl_enabled
`num_gpus`	int	Number of GPUs to run the evaluation job.	1	1
`gpu_ids`	list	List of GPU IDs to run the evaluation on. The length of this list must be equal to the number of `gpus in evaluate.num_gpus`.	[0]		FALSE
`num_nodes`	int	Number of nodes to run the evaluation on. If > 1, then multi-node is enabled.	1	1
`checkpoint`	string	Path to the checkpoint used for evaluation.	???
`trt_engine`	string	Path to the TensorRT engine to be used for evaluation. This only works with `tao-deploy`.
`results_dir`	string	Path to where all the assets generated from a task are stored.
`batch_size`	int	Batch size of the input Tensor. This is important if `batch_size` > 1 for large dataset.	-1	-1
`input_width`	int	Width of the input image tensor.	736	1
`input_height`	int	Height of the input image tensor.	320	1

Inference Configuration#

Field	value_type	description	default_value	valid_min	automl_enabled
`num_gpus`	int	Number of GPUs to run the inference job.	1	1
`gpu_ids`	list	List of GPU IDs to run the inference on. The length of this list must be equal to the number of gpus in `inference.num_gpus`.	[0]		FALSE
`num_nodes`	int	Number of nodes to run the inference on. If > 1, then multi-node is enabled.	1	1
`checkpoint`	string	Path to the checkpoint used for inference.	???
`trt_engine`	string	Path to the TensorRT engine to be used for inference. This only works with `tao-deploy`.
`results_dir`	string	Path to where all the assets generated from a task are stored.
`batch_size`	int	Batch size of the input Tensor. This is important if batch_size > 1 for a large dataset.	-1	-1
`conf_threshold`	float	Value of the confidence threshold to be used when filtering out the final list of boxes.	0.5
`input_width`	int	Width of the input image tensor.		1
`input_height`	int	Height of the input image tensor.		1
`save_raw_pfm`	bool	Whether to save the raw pfm output during inference.	FALSE

Export Configuration#

Field	value_type	description	default_value	valid_min	valid_options
`results_dir`	string	Path to where all the assets generated from a task are stored.
`gpu_id`	int	Index of the GPU to build the TensorRT engine.	0
`checkpoint`	string	Path to the checkpoint file to run export.	???
`onnx_file`	string	Path to the onnx model file.	???
`on_cpu`	bool	Whether to export CPU compatible model.	FALSE
`input_channel`	ordered_int	Number of channels in the input Tensor.	3	1	1,3
`input_width`	int	Width of the input image tensor.	960	32
`input_height`	int	Height of the input image tensor.	544	32
`opset_version`	int	Operator set version of the ONNX model used to generate TensorRT engine.	17	1
`batch_size`	int	Batch size of the input Tensor for the engine. A value of `-1` implies dynamic tensor shapes.	-1	-1
`verbose`	bool	Whether to enable verbose TensorRT logging.	FALSE
`format`	categorical	File format to export to.	onnx		onnx,xdl
`valid_iters`	int	Number of GRU iterations to export the model.	22	1

TensorRT Engine Configuration#

Field	value_type	description	default_value	valid_min	automl_enabled
`results_dir`	string	Path to where all the assets generated from a task are stored.
`gpu_id`	int	Index of the GPU to build the TensorRT engine.	0	0
`onnx_file`	string	Path to the ONNX model file.	???
`trt_engine`	string	Path to the TensorRT engine generated should be stored. This only works with `tao-deploy`.	???
`timing_cache`	string	Path to a TensorRT timing cache that speeds up engine generation. This will be created/read/updated.
`batch_size`	int	Batch size of the input tensor for the engine. A value of `-1` implies dynamic tensor shapes.	-1	-1
`verbose`	bool	Whether to enable verbose TensorRT logging.	FALSE
`tensorrt`	collection	Hyperparameters to configure the TensorRT Engine builder.			FALSE

Augmentation Configuration#

Field	value_type	description	default_value	valid_min	valid_max	automl_enabled
`input_mean`	list	Input mean for RGB frames	[0.485, 0.456, 0.406]			FALSE
`input_std`	list	Input standard deviation per pixel for RGB frames	[0.229, 0.224, 0.225]			FALSE
`crop_size`	list	Crop size for input RGB images [height, width]	[518, 518]			FALSE
`min_scale`	float	Minimum scale in data augmentation	-0.2	0.2	1
`max_scale`	float	Maximum scale in data augmentation	0.4	-0.2	1
`do_flip`	bool	Whether to perform flip in data augmentation	FALSE
`yjitter_prob`	float	Probability for y jitter	1.0	0.0	1.0	TRUE
`gamma`	list	Gamma range in data augmentation	[1, 1, 1, 1]			FALSE
`color_aug_prob`	float	Probability for asymmetric color augmentation	0.2	0.0	1.0	TRUE
`color_aug_brightness`	float	Color jitter brightness	0.4	0.0	1.0
`color_aug_contrast`	float	Color jitter contrast	0.4	0.0	1.0
`color_aug_saturation`	list	Color jitter saturation	[0.0, 1.4]			FALSE
`color_aug_hue_range`	list	Hue range in data augmentation	[-0.027777777777777776, 0.027777777777777776]			FALSE
`eraser_aug_prob`	float	Probability for eraser augmentation	0.5	0.0	1.0	TRUE
`spatial_aug_prob`	float	Probability for spatial augmentation	1.0	0.0	1.0	TRUE
`stretch_prob`	float	Probability for stretch augmentation	0.8	0.0	1.0	TRUE
`max_stretch`	float	Maximum stretch augmentation	0.2	0.0	1.0
`h_flip_prob`	float	Probability for horizontal flip augmentation	0.5	0.0	1.0	TRUE
`v_flip_prob`	float	Probability for vertical flip augmentation	0.5	0.0	1.0	TRUE
`hshift_prob`	float	Probability for horizontal shift augmentation	0.5	0.0	1.0	TRUE
`crop_min_valid_disp_ratio`	float	Probability for minimum crop valid disparity ratio	0.0	0.0	1.0	TRUE

Training the Model#

To train a monocular depth estimation model:

# Get the training spec
TRAIN_SPECS=$(tao-client depth_net_mono get-spec --action train --job_type experiment --id $EXPERIMENT_ID)

# Modify TRAIN_SPECS as needed, then run training
JOB_ID=$(tao-client depth_net_mono experiment-run-action --action train --id $EXPERIMENT_ID --specs "$TRAIN_SPECS")

tao model depth_net train \
          -e /path/to/experiment_spec.yaml \
          -k $KEY \
          results_dir=/path/to/results

Required arguments:

-e: Path to the experiment specification file
-k: Encryption key for model checkpoints

Optional arguments:

results_dir: Overrides the results directory from the specification file
train.num_gpus: Overrides number of GPUs
train.num_epochs: Overrides number of training epochs
dataset.train_dataset.batch_size: Overrides batch size

Training Output#

The training process generates the following outputs in the results directory:

train/dn_model_latest.pth: Latest model checkpoint
train/dn_model_epoch_XXX.pth: Periodic checkpoints
train/events.out.tfevents.*: TensorBoard log files
train/status.json: Training status and metrics

You can monitor training progress using TensorBoard:

tensorboard --logdir=/path/to/results/train

Evaluating the Model#

To evaluate a trained monocular depth estimation model:

EVAL_SPECS=$(tao-client depth_net_mono get-spec --action evaluate --job_type experiment --id $EXPERIMENT_ID)

JOB_ID=$(tao-client depth_net_mono experiment-run-action --action evaluate --id $EXPERIMENT_ID --specs "$EVAL_SPECS")

tao model depth_net evaluate \
          -e /path/to/experiment_spec.yaml \
          -k $KEY \
          evaluate.checkpoint=/path/to/checkpoint.pth

Required arguments:

-e: Path to the experiment specification file
-k: Encryption key

Optional arguments:

evaluate.checkpoint: Path to model checkpoint to evaluate
evaluate.batch_size: Batch size for evaluation
dataset.test_dataset.data_sources: Override test dataset

Evaluation Metrics#

For monocular depth estimation, TAO computes the following metrics:

Absolute relative error (abs_rel):: Mean of |predicted - ground_truth| / ground_truth. Lower is better.
Delta accuracy (d1):: Percentage of pixels where max(predicted/ground_truth, ground_truth/predicted) < 1.25. Higher is better.

These metrics are saved to a JSON file in the results directory and displayed in the console output.

Running Inference#

To run inference on images using a trained model:

INFER_SPECS=$(tao-client depth_net_mono get-spec --action inference --job_type experiment --id $EXPERIMENT_ID)

JOB_ID=$(tao-client depth_net_mono experiment-run-action --action inference --id $EXPERIMENT_ID --specs "$INFER_SPECS")

tao model depth_net inference \
          -e /path/to/experiment_spec.yaml \
          -k $KEY \
          inference.checkpoint=/path/to/checkpoint.pth

Required arguments:

-e: Path to the experiment specification file
-k: Encryption key

Optional arguments:

inference.checkpoint: Path to model checkpoint
inference.save_raw_pfm: Saves depth maps in PFM format (default: False)
inference.batch_size: Batch size for inference
dataset.infer_dataset.data_sources: Overrides inference dataset

Inference Output#

The inference process generates:

Depth map visualizations (colored depth images) in PNG format
Raw depth values in PFM format (if save_raw_pfm is True)
Inference results, saved in results_dir/inference/

Exporting the Model#

To export a trained model to ONNX format:

EXPORT_SPECS=$(tao-client depth_net_mono get-spec --action export --job_type experiment --id $EXPERIMENT_ID)

JOB_ID=$(tao-client depth_net_mono experiment-run-action --action export --id $EXPERIMENT_ID --specs "$EXPORT_SPECS")

tao model depth_net export \
          -e /path/to/experiment_spec.yaml \
          -k $KEY \
          export.checkpoint=/path/to/checkpoint.pth \
          export.onnx_file=/path/to/output.onnx

Required arguments:

-e: Path to the experiment specification file
-k: Encryption key
export.checkpoint: Path to trained model checkpoint
export.onnx_file: Output path for ONNX model

Optional arguments:

export.input_channel: Number of input channels (default: 3)
export.input_width: Input image width (default: 924)
export.input_height: Input image height (default: 518)
export.opset_version: ONNX opset version (default: 16)
export.batch_size: Batch size, -1 for dynamic (default: -1)
export.on_cpu: Export CPU-compatible model (default: False)
export.format: Export format - onnx or xdl (default: onnx)
export.valid_iters: Number of iterations for export (default: 22)

Generating TensorRT Engine#

To generate an NVIDIA^® TensorRT™ engine from the exported ONNX model for optimized inference:

GEN_TRT_SPECS=$(tao-client depth_net_mono get-spec --action gen_trt_engine --job_type experiment --id $EXPERIMENT_ID)

JOB_ID=$(tao-client depth_net_mono experiment-run-action --action gen_trt_engine --id $EXPERIMENT_ID --specs "$GEN_TRT_SPECS")

tao deploy depth_net gen_trt_engine \
           -e /path/to/experiment_spec.yaml \
           gen_trt_engine.onnx_file=/path/to/model.onnx \
           gen_trt_engine.trt_engine=/path/to/output.engine

Required arguments:

-e: Path to the experiment specification file
gen_trt_engine.onnx_file: Path to ONNX model
gen_trt_engine.trt_engine: Output path for TensorRT engine

Optional arguments:

gen_trt_engine.gpu_id: GPU index for engine generation (default: 0)
gen_trt_engine.batch_size: Batch size, -1 for dynamic (default: -1)
gen_trt_engine.verbose: Enables verbose logging (default: False)
gen_trt_engine.timing_cache: Path to timing cache file
gen_trt_engine.tensorrt.workspace_size: TensorRT workspace size in MB (default: 1024)
gen_trt_engine.tensorrt.data_type: Precision - FP32 or FP16 (default: FP32)
gen_trt_engine.tensorrt.min_batch_size: Minimum batch size (default: 1)
gen_trt_engine.tensorrt.opt_batch_size: Optimal batch size (default: 1)
gen_trt_engine.tensorrt.max_batch_size: Maximum batch size (default: 1)

TensorRT Engine Benefits#

Performance:: 2-5x faster inference compared to PyTorch
Memory efficiency: Reduced memory footprint
Optimization: Layer fusion and kernel auto-tuning
Deployment: Production-ready inference engine

Model Configuration Reference#

For a complete reference to all configuration parameters, refer to the configuration tables in the TAO Toolkit documentation or the experiment specification files provided with the toolkit.

Best Practices#

Training Recommendations#

Start with RelativeDepthAnything: Train a relative depth model first, then fine-tune to metric depth
Use pretrained weights: Initialize from DINOv2 or existing depth models for better convergence
Encoder selection:
- Use vitl for most applications (best balance)
- Use vits for edge deployment with limited resources
- Use vitg when maximum accuracy is critical
Batch size: Start with batch size 4-8 for vitl encoder on a single GPU
Learning rate: Use small learning rates (1e-5 to 1e-6) when fine-tuning from pretrained models
Activation checkpointing: Enable for large models (vitl, vitg) to reduce memory usage
Augmentation: Use moderate augmentation for indoor scenes, stronger for outdoor datasets

Data Preparation#

Dataset quality: Ensure depth ground truth is accurate and aligned with RGB images
Depth range: Set appropriate min_depth and max_depth for your use case
Mixed datasets: Combine multiple datasets for better generalization
Train/val split: Use 80-90% for training, 10-20% for validation

Performance Optimization#

Multi-GPU training: Use ddp strategy for 2-8 GPUs, fsdp for larger clusters
Mixed precision: Use fp16 for 2x faster training on modern GPUs
Data loading: Increase workers (4-8) to prevent data loading bottlenecks
TensorRT deployment: Always use TensorRT engines for production inference

Troubleshooting#

Common Issues#

Out of memory (OOM):

Reduce batch size
Enable activation_checkpoint: True

Poor depth quality:

Check data alignment between RGB and depth
Verify depth ground truth is in correct format (PFM or PNG)
Ensure min_depth and max_depth match your data range
Increase training epochs
Try different augmentation settings

Training instability:

Reduce learning rate
Enable gradient clipping (clip_grad_norm: 0.1)
Check for NaN values in depth ground truth
Use cudnn.deterministic: True for reproducible training

Additional Resources#

TAO Toolkit documentation: https://docs.nvidia.com/tao/
Sample notebooks: NVIDIA/tao_tutorials
NGC pretrained models: https://catalog.ngc.nvidia.com/

For more information about stereo depth estimation, go to Stereo Depth Estimation.