Stereo Depth Estimation#

Stereo depth estimation is the task of predicting depth information from a pair of calibrated stereo images. TAO Toolkit provides advanced stereo depth estimation capabilities through the DepthNet model using the FoundationStereo architecture, which combines transformer and CNN architectures for high-accuracy disparity prediction in industrial and robotic applications.

Note

For the real-time variant optimized for edge deployment, see Fast Foundation Stereo.

The stereo depth estimation models in TAO support the following tasks:

train
evaluate
inference
export
gen_trt_engine

Supported Model Architecture#

TAO Toolkit supports the FoundationStereo model for stereo depth estimation:

FoundationStereo

A hybrid transformer-CNN architecture designed for stereo depth estimation. This model takes a pair of rectified stereo images (left and right) as input and produces a disparity map. The architecture combines:

Vision Transformer Encoder: Based on DepthAnythingV2 for rich feature extraction
EdgeNext CNN Encoder: Efficient convolutional feature extractor
Iterative Refinement Module: GRU-based refinement for accurate disparity prediction
Correlation Volume: Computes feature similarities between left and right images

FoundationStereo is optimized for:

High zero-shot accuracy on unseen domains
Real-time performance with NVIDIA^® TensorRT™ optimization
Industrial and robotic 3D perception tasks
Autonomous navigation and obstacle detection

Encoder Options#

The FoundationStereo model supports multiple Vision Transformer encoder sizes:

vits (small): 22M parameters, fastest inference, suitable for edge deployment
vitl (large): 304M parameters, higher accuracy for challenging scenes

Data Input for Stereo Depth Estimation#

Dataset Preparation#

Stereo depth estimation requires stereo image pairs with disparity ground truth. The dataset should be organized as follows:

Left images: Rectified left stereo images in standard formats (PNG, JPEG, etc.)
Right images: Rectified right stereo images aligned with left images
Disparity ground truth: Disparity maps in PFM or PNG format
Data split files: Text files listing the paths to stereo pairs and disparity

Data split file format:

Each line in the data split file should contain paths to the left image, right image, and disparity map, separated by spaces:

/path/to/left/image_001.png /path/to/right/image_001.png /path/to/disp/image_001.pfm
/path/to/left/image_002.png /path/to/right/image_002.png /path/to/disp/image_002.pfm
...

For inference without ground truth:

/path/to/left/image_001.png /path/to/right/image_001.png
/path/to/left/image_002.png /path/to/right/image_002.png
...

Stereo calibration requirements:

For accurate stereo depth estimation, ensure:

Images are rectified (epipolar lines are horizontal)
Stereo baseline and focal length are known
Image pairs are temporally synchronized
Minimal lens distortion after rectification

Supported Datasets#

TAO Toolkit supports the following stereo depth datasets:

FSD (Foundation Stereo Dataset): NVIDIA’s proprietary surround-view stereo dataset
IsaacRealDataset: NVIDIA Isaac real-world stereo data
Crestereo: Large-scale stereo dataset with diverse scenes
Middlebury: Classic stereo benchmark dataset with high-quality ground truth
Eth3d: Low-resolution gray-scale outdoor stereo evaluation dataset
KITTI: Autonomous driving stereo dataset
GenericDataset: Generic format for custom stereo datasets

For custom datasets, use the GenericDataset format by creating appropriate data split files with the format shown above.

Creating an Experiment Specification File#

The experiment specification file is a YAML configuration that defines all parameters for training, evaluation, and inference.

Configuration for FoundationStereo#

Here is an example specification file for training a FoundationStereo model:

Key Configuration Parameters#

The following sections provide detailed configuration tables for all parameters.

Dataset Configuration#

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`dataset_name`	categorical	Dataset name	StereoDataset			MonoDataset,StereoDataset
`normalize_depth`	bool	Whether to normalize depth	FALSE
`max_depth`	float	Maximum depth in meters in MetricDepthAnythingV2		1.0	inf
`min_depth`	float	Minimum depth in meters in MetricDepthAnythingV2		0.0	inf
`max_disparity`	int	Maximum allowed disparity for which we compute losses during training	416	1	416
`baseline`	float	Baseline for stereo datasets	0.193001	0.0	inf
`focal_x`	float	Focal length along x-axis	1998.842	0.0	inf
`train_dataset`	collection	Configurable parameters to construct the train dataset for a DepthNet experiment					FALSE
`val_dataset`	collection	Configurable parameters to construct the val dataset for a DepthNet experiment					FALSE
`test_dataset`	collection	Configurable parameters to construct the test dataset for a DepthNet experiment					FALSE
`infer_dataset`	collection	Configurable parameters to construct the infer dataset for a DepthNet experiment					FALSE

Model Configuration#

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`model_type`	categorical	Network name	MetricDepthAnythingV2			FoundationStereo,MetricDepthAnything,RelativeDepthAnything
`mono_backbone`	collection	Network defined paths for Monocular DepthNet Backbone					FALSE
`stereo_backbone`	collection	Network defined paths for Edgenext and Depthanythingv2					FALSE
`hidden_dims`	list	Hidden dimensions	[128, 128, 128]				FALSE
`corr_radius`	int	Width of the correlation pyramid	4	1			TRUE
`cv_group`	int	cv group	8	1			TRUE
`train_iters`	int	Train iteration	22	1			TRUE
`valid_iters`	int	Validation iteration	22	1
`volume_dim`	int	Volume dimension	32	1			TRUE
`low_memory`	int	reduce memory usage	0	0	4
`mixed_precision`	bool	Whether to use mixed precision training	FALSE
`n_gru_layers`	int	Number of hidden GRU levels	3	1	3
`corr_levels`	int	Number of levels in the correlation pyramid	2	1	2
`n_downsample`	int	Resolution of the disparity field (1/2^K)	2	1	2
`encoder`	categorical	DepthAnythingV2 Encoder options	vitl			vits,vitl
`max_disparity`	int	Maximum disparity of the model used in the training of a stereo model	416

Stereo Backbone Configuration#

Field	value_type	description	default_value
`depth_anything_v2_pretrained_path`	string	Path to load DepthAnythingv2 as an encoder for Stereo DepthNet (FoundationStereo)
`edgenext_pretrained_path`	string	Path to load edgenext encoder for Stereo DepthNet (FoundationStereo)
`use_bn`	bool	Whether to use batch normalization in DepthAnythingV2	FALSE
`use_clstoken`	bool	Whether to use class token	FALSE

Training Configuration#

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`num_gpus`	int	Number of GPUs to run the train job.	1	1
`gpu_ids`	list	List of GPU IDs to run the training on. The length of this list must be equal to the number of gpus in train.num_gpus.	[0]				FALSE
`num_nodes`	int	Number of nodes to run the training on. If > 1, then multi-node is enabled.	1	1
`seed`	int	Seed for the initializer in PyTorch. If < 0, disable fixed seed.	1234	-1	inf
`cudnn`	collection						FALSE
`num_epochs`	int	Number of epochs to run the training.	10	1	inf
`checkpoint_interval`	int	Interval (in epochs) at which a checkpoint is to be saved; helps resume training.	1	1
`checkpoint_interval_unit`	categorical	Unit of the checkpoint interval.	epoch			epoch,step
`validation_interval`	int	Interval (in epochs) at which a evaluation will be triggered on the validation dataset.	1	1
`resume_training_checkpoint_path`	string	Path to the checkpoint from which to resume training.
`results_dir`	string	Path to where all the assets generated from a task are stored.
`checkpoint_interval_steps`	int	Number of steps to save the checkpoint.
`pretrained_model_path`	string	Path to a pretrained DepthNet model from which to initialize the current training.
`clip_grad_norm`	float	Amount to clip the gradient by L2 Norm. A value of 0.0 specifies no clipping.	0.1
`dataloader_visualize`	bool	Whether to visualize the dataloader.	FALSE				TRUE
`vis_step_interval`	int	Visualization interval in step.	10				TRUE
`is_dry_run`	bool	Whether to run the trainer in Dry Run mode. This serves as a good means to validate the specification file and run a sanity check on the trainer without actually initializing and running the trainer.	FALSE
`optim`	collection	Hyperparameters to configure the optimizer.					FALSE
`precision`	categorical	Precision on which to run the training.	fp32			bf16,fp32,fp16
`distributed_strategy`	categorical	Multi-GPU training strategy. DDP (Distributed Data Parallel) and Fully Sharded DDP are supported.	ddp			ddp,fsdp
`activation_checkpoint`	bool	Whether train is to recompute in backward pass to save GPU memory (TRUE) or store activations (FALSE).	TRUE
`verbose`	bool	Whether to display verbose logs to console.	FALSE
`inference_tile`	bool	Whether to use tiled inference, particularly for transformers which expect fixed size of sequences.	FALSE
`tile_wtype`	string	Use tiled inference weight type.	gaussian
`tile_min_overlap`	list	Minimum overlap for tile.	[16, 16]				FALSE
`log_every_n_steps`	int	Interval steps of logging training results and running validation numbers within one epoch.	500

Optimizer Configuration#

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`optimizer`	categorical	Type of optimizer used to train the network	AdamW			AdamW,SGD
`monitor_name`	categorical	Metric value to be monitored for the `AutoReduce` Scheduler	val_loss			val_loss,train_loss
`lr`	float	Initial learning rate for training the model, excluding the backbone	0.0001				TRUE
`momentum`	float	Momentum for the AdamW optimizer	0.9				TRUE
`weight_decay`	float	Weight decay coefficient	0.0001				TRUE
`lr_scheduler`	categorical	Learning scheduler: MultiStepLR : Decrease the lr by lr_decay from lr_steps StepLR : Decrease the lr by lr_decay at every lr_step_size	MultiStepLR			MultiStep,StepLR,CustomMultiStepLRScheduler,LambdaLR,PolynomialLR,OneCycleLR,CosineAnnealingLR
`lr_steps`	list	Steps at which the learning rate must be decreased This is applicable only with the MultiStep LR	[1000]				FALSE
`lr_step_size`	int	Number of steps to decrease the learning rate in the StepLR	1000				TRUE
`lr_decay`	float	Decreasing factor for the learning rate scheduler	0.1				TRUE
`min_lr`	float	Minimum learning rate value for the learning rate scheduler	1e-07				TRUE
`warmup_steps`	int	Number of steps to perform linear learning rate” warm-up before engaging a learning rate scheduler	20	0	inf

Evaluation Configuration#

Field	value_type	description	default_value	valid_min	automl_enabled
`num_gpus`	int	Number of GPUs to run the evaluation job.	1	1
`gpu_ids`	list	List of GPU IDs to run the evaluation on. The length of this list must be equal to the number of `gpus in evaluate.num_gpus`.	[0]		FALSE
`num_nodes`	int	Number of nodes to run the evaluation on. If > 1, then multi-node is enabled.	1	1
`checkpoint`	string	Path to the checkpoint used for evaluation.	???
`trt_engine`	string	Path to the TensorRT engine to be used for evaluation. This only works with `tao-deploy`.
`results_dir`	string	Path to where all the assets generated from a task are stored.
`batch_size`	int	Batch size of the input Tensor. This is important if `batch_size` > 1 for large dataset.	-1	-1
`input_width`	int	Width of the input image tensor.	736	1
`input_height`	int	Height of the input image tensor.	320	1

Inference Configuration#

Field	value_type	description	default_value	valid_min	automl_enabled
`num_gpus`	int	Number of GPUs to run the inference job.	1	1
`gpu_ids`	list	List of GPU IDs to run the inference on. The length of this list must be equal to the number of gpus in `inference.num_gpus`.	[0]		FALSE
`num_nodes`	int	Number of nodes to run the inference on. If > 1, then multi-node is enabled.	1	1
`checkpoint`	string	Path to the checkpoint used for inference.	???
`trt_engine`	string	Path to the TensorRT engine to be used for inference. This only works with `tao-deploy`.
`results_dir`	string	Path to where all the assets generated from a task are stored.
`batch_size`	int	Batch size of the input Tensor. This is important if batch_size > 1 for a large dataset.	-1	-1
`conf_threshold`	float	Value of the confidence threshold to be used when filtering out the final list of boxes.	0.5
`input_width`	int	Width of the input image tensor.		1
`input_height`	int	Height of the input image tensor.		1
`save_raw_pfm`	bool	Whether to save the raw pfm output during inference.	FALSE

Export Configuration#

Field	value_type	description	default_value	valid_min	valid_options
`results_dir`	string	Path to where all the assets generated from a task are stored.
`gpu_id`	int	Index of the GPU to build the TensorRT engine.	0
`checkpoint`	string	Path to the checkpoint file to run export.	???
`onnx_file`	string	Path to the onnx model file.	???
`on_cpu`	bool	Whether to export CPU compatible model.	FALSE
`input_channel`	ordered_int	Number of channels in the input Tensor.	3	1	1,3
`input_width`	int	Width of the input image tensor.	960	32
`input_height`	int	Height of the input image tensor.	544	32
`opset_version`	int	Operator set version of the ONNX model used to generate TensorRT engine.	17	1
`batch_size`	int	Batch size of the input Tensor for the engine. A value of `-1` implies dynamic tensor shapes.	-1	-1
`verbose`	bool	Whether to enable verbose TensorRT logging.	FALSE
`format`	categorical	File format to export to.	onnx		onnx,xdl
`valid_iters`	int	Number of GRU iterations to export the model.	22	1

TensorRT Engine Configuration#

Field	value_type	description	default_value	valid_min	automl_enabled
`results_dir`	string	Path to where all the assets generated from a task are stored.
`gpu_id`	int	Index of the GPU to build the TensorRT engine.	0	0
`onnx_file`	string	Path to the ONNX model file.	???
`trt_engine`	string	Path to the TensorRT engine generated should be stored. This only works with `tao-deploy`.	???
`timing_cache`	string	Path to a TensorRT timing cache that speeds up engine generation. This will be created/read/updated.
`batch_size`	int	Batch size of the input tensor for the engine. A value of `-1` implies dynamic tensor shapes.	-1	-1
`verbose`	bool	Whether to enable verbose TensorRT logging.	FALSE
`tensorrt`	collection	Hyperparameters to configure the TensorRT Engine builder.			FALSE

Augmentation Configuration#

Field	value_type	description	default_value	valid_min	valid_max	automl_enabled
`input_mean`	list	Input mean for RGB frames	[0.485, 0.456, 0.406]			FALSE
`input_std`	list	Input standard deviation per pixel for RGB frames	[0.229, 0.224, 0.225]			FALSE
`crop_size`	list	Crop size for input RGB images [height, width]	[518, 518]			FALSE
`min_scale`	float	Minimum scale in data augmentation	-0.2	0.2	1
`max_scale`	float	Maximum scale in data augmentation	0.4	-0.2	1
`do_flip`	bool	Whether to perform flip in data augmentation	FALSE
`yjitter_prob`	float	Probability for y jitter	1.0	0.0	1.0	TRUE
`gamma`	list	Gamma range in data augmentation	[1, 1, 1, 1]			FALSE
`color_aug_prob`	float	Probability for asymmetric color augmentation	0.2	0.0	1.0	TRUE
`color_aug_brightness`	float	Color jitter brightness	0.4	0.0	1.0
`color_aug_contrast`	float	Color jitter contrast	0.4	0.0	1.0
`color_aug_saturation`	list	Color jitter saturation	[0.0, 1.4]			FALSE
`color_aug_hue_range`	list	Hue range in data augmentation	[-0.027777777777777776, 0.027777777777777776]			FALSE
`eraser_aug_prob`	float	Probability for eraser augmentation	0.5	0.0	1.0	TRUE
`spatial_aug_prob`	float	Probability for spatial augmentation	1.0	0.0	1.0	TRUE
`stretch_prob`	float	Probability for stretch augmentation	0.8	0.0	1.0	TRUE
`max_stretch`	float	Maximum stretch augmentation	0.2	0.0	1.0
`h_flip_prob`	float	Probability for horizontal flip augmentation	0.5	0.0	1.0	TRUE
`v_flip_prob`	float	Probability for vertical flip augmentation	0.5	0.0	1.0	TRUE
`hshift_prob`	float	Probability for horizontal shift augmentation	0.5	0.0	1.0	TRUE
`crop_min_valid_disp_ratio`	float	Probability for minimum crop valid disparity ratio	0.0	0.0	1.0	TRUE

Training the Model#

Training Output#

The training process generates the following outputs in the results directory:

train/dn_model_latest.pth: Latest model checkpoint
train/dn_model_epoch_XXX_step_YYY.pth: Periodic checkpoints
train/events.out.tfevents.*: TensorBoard log files
train/status.json: Training status and metrics
train/visualizations/: Sample disparity predictions (if enabled)

You can monitor training progress using TensorBoard:

tensorboard --logdir=/path/to/results/train

Evaluating the Model#

Evaluation Metrics#

For stereo depth estimation, TAO computes the following metrics:

End-Point-Error (EPE): Mean absolute difference between predicted and ground truth disparity. Lower is better.
D1-All Error: Percentage of pixels with disparity error > 1 pixel. Lower is better.
Bad Pixel Rates (BP1, BP2, BP3): Percentage of pixels with errors exceeding 1, 2, and 3 pixels respectively. Lower is better.
Absolute Relative Error (abs_rel): Mean of |predicted - ground_truth| / ground_truth. Lower is better.
Squared Relative Error (sq_rel): Mean of (predicted - ground_truth)² / ground_truth. Lower is better.
RMSE: Root mean square error of disparity. Lower is better.
RMSE Log: RMSE in log space. Lower is better.

These metrics are saved to a JSON file in the results directory and displayed in the console output.

Running Inference#

Inference Output#

The inference process generates:

Disparity map visualizations (colored disparity images) in PNG format
Raw disparity values in PFM format (if save_raw_pfm is True)
Depth maps (if baseline and focal length are provided)
Inference results, saved in results_dir/inference/

The disparity can be converted to depth (in meters) using:

depth = (baseline * focal_x) / disparity

Exporting the Model#

The export task converts a trained PyTorch model to ONNX format. Configure parameters such as export.checkpoint, export.onnx_file, export.input_width, export.input_height, export.opset_version, export.batch_size, export.on_cpu, export.format (onnx or xdl), and export.valid_iters in your specification file.

Generating TensorRT Engine#

The gen_trt_engine task converts an exported ONNX model into a TensorRT engine for optimized inference. Configure parameters such as gen_trt_engine.onnx_file, gen_trt_engine.trt_engine, gen_trt_engine.gpu_id, gen_trt_engine.batch_size, gen_trt_engine.verbose, gen_trt_engine.timing_cache, and the gen_trt_engine.tensorrt block (workspace_size, data_type, min_batch_size, opt_batch_size, max_batch_size) in your specification file.

TensorRT Engine Benefits#

Performance: 3-10x faster inference compared to PyTorch
Memory efficiency: Reduced memory footprint
Optimization: Layer fusion, kernel auto-tuning, and precision calibration
Deployment: Production-ready inference engine for real-time applications

For stereo depth estimation, TensorRT optimization is particularly beneficial for:

Real-time robotic vision (30+ FPS on modern GPUs)
Autonomous navigation systems
Industrial inspection and quality control
AR/VR applications requiring low latency

Model Configuration Reference#

For a complete reference to all configuration parameters, refer to the configuration tables in the TAO Toolkit documentation or the experiment specification files provided with the toolkit. Many parameters are shared with monocular depth estimation.

Best Practices#

Training Recommendations#

Dataset diversity: Mix multiple datasets (FSD, Crestereo, Isaac) for better generalization
Encoder selection:
- Use vits for real-time applications (fastest, 22M parameters)
- Use vitl for maximum accuracy (304M parameters)
Batch size: Start with batch size 1-2 per GPU for FoundationStereo
Learning rate: Use small learning rates (1e-5) with PolynomialLR scheduler
Multi-GPU training: Use 2-8 GPUs with DDP strategy for faster training
Activation checkpointing: Enable for larger encoders (vitl) to reduce memory
Refinement iterations:
- Use 22 iterations during training for best accuracy
- You can reduce to 10-15 for faster inference with minimal accuracy loss
Augmentation: Use strong augmentation for robustness across domains

Data Preparation#

Stereo rectification: Ensure images are properly rectified before training
Calibration accuracy: Accurate baseline and focal length are critical for metric depth
Disparity range: Set max_disparity based on your camera setup and scene depth
Image resolution: Higher resolution (e.g., 768x1280) improves accuracy but requires more memory
Mixed datasets: Combine indoor and outdoor datasets for domain generalization
Data quality: Filter out poorly calibrated or misaligned stereo pairs

Performance Optimization#

TensorRT deployment: Always use TensorRT engines for production (3-10x speedup)
FP16 precision: Use FP16 for TensorRT engines (2x faster with minimal accuracy loss)
Dynamic batching: Use dynamic batch sizes for variable workloads
Timing cache: Reuse timing cache to speed up subsequent engine builds
Input resolution: Balance resolution and speed based on application needs
Multi-stream inference: Use multiple CUDA streams for maximum throughput

Troubleshooting#

Common Issues#

Out of memory (OOM):

Reduce batch size to 1
Enable activation_checkpoint: True
Use a smaller encoder (vits instead of vitl)
Reduce crop_size or input resolution
Set low_memory: 1 or higher (0-4) in model config
Reduce train_iters to 10-15

Poor disparity quality:

Check stereo rectification - images must be properly rectified
Verify baseline and focal_x match your camera calibration
Ensure max_disparity is appropriate for your depth range
Increase training epochs (6-10 epochs recommended)
Use stronger augmentation
Mix multiple datasets for better generalization
Check for occluded regions and textureless areas in your data

Training instability:

Reduce learning rate (try 5e-6 to 1e-5)
Enable gradient clipping (clip_grad_norm: 0.1)
Use a PolynomialLR scheduler with lr_decay: 0.9
Check for NaN or inf values in disparity ground truth
Ensure disparity maps are in correct format (PFM or PNG)
Use cudnn.deterministic: True for reproducible training

Slow training:

Increase batch_size if memory allows
Use multiple GPUs (2-8) with DDP strategy
Reduce log_every_n_steps and vis_step_interval
Use fp16 precision (2x speedup)
Increase number of data loading workers (8-16)
Disable dataloader_visualize during long training runs
Use smaller train_iters (15 instead of 22)

Poor zero-shot performance:

Train on diverse datasets (mix FSD, Crestereo, Isaac)
Use strong augmentation (color, eraser, spatial)
Increase training epochs
Use larger encoder (vitl)
Ensure training data covers target domain characteristics
Fine-tune on a small sample of target domain data

Inference speed issues:

Use TensorRT engine instead of PyTorch model
Enable FP16 precision in TensorRT
Reduce input resolution if acceptable
Reduce valid_iters to 10-15 for faster inference
Use vits encoder for edge deployment
Optimize batch size for your GPU

Additional Resources#

TAO Toolkit documentation: https://docs.nvidia.com/tao/
Sample notebooks: NVIDIA/tao_tutorials
NGC pretrained models: https://catalog.ngc.nvidia.com/
FoundationStereo paper: NVIDIA Technical Reports

For more information about monocular depth estimation, go to Monocular Depth Estimation.