Stereo Depth Estimation#

Stereo depth estimation is the task of predicting depth information from a pair of calibrated stereo images. TAO Toolkit provides advanced stereo depth estimation capabilities through the DepthNet model using the FoundationStereo architecture, which combines transformer and CNN architectures for high-accuracy disparity prediction in industrial and robotic applications.

The stereo depth estimation models in TAO support the following tasks:

  • train

  • evaluate

  • inference

  • export

  • gen_trt_engine

These tasks can be invoked from the TAO Launcher using the following convention on the command-line:

SPECS=$(tao-client depth_net_stereo get-spec --action <sub_task> --job_type experiment --id $EXPERIMENT_ID)

JOB_ID=$(tao-client depth_net_stereo experiment-run-action --action <sub_task> --id $EXPERIMENT_ID --specs "$SPECS")

Required arguments:

  • --id: The unique identifier of the experiment from which to train the model

See also

For information on how to create an experiment using the FTMS client, refer to the Creating an experiment section in the Remote Client documentation.

Supported Model Architecture#

TAO Toolkit supports the FoundationStereo model for stereo depth estimation:

FoundationStereo

A hybrid transformer-CNN architecture designed for stereo depth estimation. This model takes a pair of rectified stereo images (left and right) as input and produces a disparity map. The architecture combines:

  • Vision Transformer Encoder: Based on DepthAnythingV2 for rich feature extraction

  • EdgeNext CNN Encoder: Efficient convolutional feature extractor

  • Iterative Refinement Module: GRU-based refinement for accurate disparity prediction

  • Correlation Volume: Computes feature similarities between left and right images

FoundationStereo is optimized for:

  • High zero-shot accuracy on unseen domains

  • Real-time performance with NVIDIA® TensorRT optimization

  • Industrial and robotic 3D perception tasks

  • Autonomous navigation and obstacle detection

Encoder Options#

The FoundationStereo model supports multiple Vision Transformer encoder sizes:

  • vits (small): 22M parameters, fastest inference, suitable for edge deployment

  • vitl (large): 304M parameters, higher accuracy for challenging scenes

Data Input for Stereo Depth Estimation#

Dataset Preparation#

Stereo depth estimation requires stereo image pairs with disparity ground truth. The dataset should be organized as follows:

  1. Left images: Rectified left stereo images in standard formats (PNG, JPEG, etc.)

  2. Right images: Rectified right stereo images aligned with left images

  3. Disparity ground truth: Disparity maps in PFM or PNG format

  4. Data split files: Text files listing the paths to stereo pairs and disparity

Data split file format:

Each line in the data split file should contain paths to the left image, right image, and disparity map, separated by spaces:

/path/to/left/image_001.png /path/to/right/image_001.png /path/to/disp/image_001.pfm
/path/to/left/image_002.png /path/to/right/image_002.png /path/to/disp/image_002.pfm
...

For inference without ground truth:

/path/to/left/image_001.png /path/to/right/image_001.png
/path/to/left/image_002.png /path/to/right/image_002.png
...

Stereo calibration requirements:

For accurate stereo depth estimation, ensure:

  • Images are rectified (epipolar lines are horizontal)

  • Stereo baseline and focal length are known

  • Image pairs are temporally synchronized

  • Minimal lens distortion after rectification

Supported Datasets#

TAO Toolkit supports the following stereo depth datasets:

  • FSD (Foundation Stereo Dataset): NVIDIA’s proprietary surround-view stereo dataset

  • IsaacRealDataset: NVIDIA Isaac real-world stereo data

  • Crestereo: Large-scale stereo dataset with diverse scenes

  • Middlebury: Classic stereo benchmark dataset with high-quality ground truth

  • Eth3d: Low-resolution gray-scale outdoor stereo evaluation dataset

  • KITTI: Autonomous driving stereo dataset

  • GenericDataset: Generic format for custom stereo datasets

For custom datasets, use the GenericDataset format by creating appropriate data split files with the format shown above.

Creating an Experiment Specification File#

The experiment specification file is a YAML configuration that defines all parameters for training, evaluation, and inference.

Configuration for FoundationStereo#

Here is an example specification file for training a FoundationStereo model:

Retrieve the specifications:

TRAIN_SPECS=$(tao-client depth_net_stereo get-spec --action train --job_type experiment --id $EXPERIMENT_ID)

Get specifications from $TRAIN_SPECS. You can override values as needed.

Key Configuration Parameters#

The following sections provide detailed configuration tables for all parameters.

Dataset Configuration#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

dataset_name

categorical

Dataset name

StereoDataset

MonoDataset,StereoDataset

normalize_depth

bool

Whether to normalize depth

FALSE

max_depth

float

Maximum depth in meters in MetricDepthAnythingV2

1.0

inf

min_depth

float

Minimum depth in meters in MetricDepthAnythingV2

0.0

inf

max_disparity

int

Maximum allowed disparity for which we compute losses during training

416

1

416

baseline

float

Baseline for stereo datasets

0.193001

0.0

inf

focal_x

float

Focal length along x-axis

1998.842

0.0

inf

train_dataset

collection

Configurable parameters to construct the train dataset for a DepthNet experiment

FALSE

val_dataset

collection

Configurable parameters to construct the val dataset for a DepthNet experiment

FALSE

test_dataset

collection

Configurable parameters to construct the test dataset for a DepthNet experiment

FALSE

infer_dataset

collection

Configurable parameters to construct the infer dataset for a DepthNet experiment

FALSE

Model Configuration#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

model_type

categorical

Network name

MetricDepthAnythingV2

FoundationStereo,MetricDepthAnything,RelativeDepthAnything

mono_backbone

collection

Network defined paths for Monocular DepthNet Backbone

FALSE

stereo_backbone

collection

Network defined paths for Edgenext and Depthanythingv2

FALSE

hidden_dims

list

Hidden dimensions

[128, 128, 128]

FALSE

corr_radius

int

Width of the correlation pyramid

4

1

TRUE

cv_group

int

cv group

8

1

TRUE

train_iters

int

Train iteration

22

1

TRUE

valid_iters

int

Validation iteration

22

1

volume_dim

int

Volume dimension

32

1

TRUE

low_memory

int

reduce memory usage

0

0

4

mixed_precision

bool

Whether to use mixed precision training

FALSE

n_gru_layers

int

Number of hidden GRU levels

3

1

3

corr_levels

int

Number of levels in the correlation pyramid

2

1

2

n_downsample

int

Resolution of the disparity field (1/2^K)

2

1

2

encoder

categorical

DepthAnythingV2 Encoder options

vitl

vits,vitl

max_disparity

int

Maximum disparity of the model used in the training of a stereo model

416

Stereo Backbone Configuration#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

depth_anything_v2_pretrained_path

string

Path to load DepthAnythingv2 as an encoder for Stereo DepthNet (FoundationStereo)

edgenext_pretrained_path

string

Path to load edgenext encoder for Stereo DepthNet (FoundationStereo)

use_bn

bool

Whether to use batch normalization in DepthAnythingV2

FALSE

use_clstoken

bool

Whether to use class token

FALSE

Training Configuration#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

num_gpus

int

Number of GPUs to run the train job.

1

1

gpu_ids

list

List of GPU IDs to run the training on. The length of this list must be equal to the number of gpus in train.num_gpus.

[0]

FALSE

num_nodes

int

Number of nodes to run the training on. If > 1, then multi-node is enabled.

1

1

seed

int

Seed for the initializer in PyTorch. If < 0, disable fixed seed.

1234

-1

inf

cudnn

collection

FALSE

num_epochs

int

Number of epochs to run the training.

10

1

inf

checkpoint_interval

int

Interval (in epochs) at which a checkpoint is to be saved; helps resume training.

1

1

checkpoint_interval_unit

categorical

Unit of the checkpoint interval.

epoch

epoch,step

validation_interval

int

Interval (in epochs) at which a evaluation will be triggered on the validation dataset.

1

1

resume_training_checkpoint_path

string

Path to the checkpoint from which to resume training.

results_dir

string

Path to where all the assets generated from a task are stored.

checkpoint_interval_steps

int

Number of steps to save the checkpoint.

pretrained_model_path

string

Path to a pretrained DepthNet model from which to initialize the current training.

clip_grad_norm

float

Amount to clip the gradient by L2 Norm. A value of 0.0 specifies no clipping.

0.1

dataloader_visualize

bool

Whether to visualize the dataloader.

FALSE

TRUE

vis_step_interval

int

Visualization interval in step.

10

TRUE

is_dry_run

bool

Whether to run the trainer in Dry Run mode. This serves as a good means to validate the specification file and run a sanity check on the trainer without actually initializing and running the trainer.

FALSE

optim

collection

Hyperparameters to configure the optimizer.

FALSE

precision

categorical

Precision on which to run the training.

fp32

bf16,fp32,fp16

distributed_strategy

categorical

Multi-GPU training strategy. DDP (Distributed Data Parallel) and Fully Sharded DDP are supported.

ddp

ddp,fsdp

activation_checkpoint

bool

Whether train is to recompute in backward pass to save GPU memory (TRUE) or store activations (FALSE).

TRUE

verbose

bool

Whether to display verbose logs to console.

FALSE

inference_tile

bool

Whether to use tiled inference, particularly for transformers which expect fixed size of sequences.

FALSE

tile_wtype

string

Use tiled inference weight type.

gaussian

tile_min_overlap

list

Minimum overlap for tile.

[16, 16]

FALSE

log_every_n_steps

int

Interval steps of logging training results and running validation numbers within one epoch.

500

Optimizer Configuration#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

optimizer

categorical

Type of optimizer used to train the network

AdamW

AdamW,SGD

monitor_name

categorical

Metric value to be monitored for the AutoReduce Scheduler

val_loss

val_loss,train_loss

lr

float

Initial learning rate for training the model, excluding the backbone

0.0001

TRUE

momentum

float

Momentum for the AdamW optimizer

0.9

TRUE

weight_decay

float

Weight decay coefficient

0.0001

TRUE

lr_scheduler

categorical

Learning scheduler:

  • MultiStepLR : Decrease the lr by lr_decay from lr_steps

  • StepLR : Decrease the lr by lr_decay at every lr_step_size

MultiStepLR

MultiStep,StepLR,CustomMultiStepLRScheduler,LambdaLR,PolynomialLR,OneCycleLR,CosineAnnealingLR

lr_steps

list

Steps at which the learning rate must be decreased This is applicable only with the MultiStep LR

[1000]

FALSE

lr_step_size

int

Number of steps to decrease the learning rate in the StepLR

1000

TRUE

lr_decay

float

Decreasing factor for the learning rate scheduler

0.1

TRUE

min_lr

float

Minimum learning rate value for the learning rate scheduler

1e-07

TRUE

warmup_steps

int

Number of steps to perform linear learning rate” warm-up before engaging a learning rate scheduler

20

0

inf

Evaluation Configuration#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

num_gpus

int

Number of GPUs to run the evaluation job.

1

1

gpu_ids

list

List of GPU IDs to run the evaluation on. The length of this list must be equal to the number of gpus in evaluate.num_gpus.

[0]

FALSE

num_nodes

int

Number of nodes to run the evaluation on. If > 1, then multi-node is enabled.

1

1

checkpoint

string

Path to the checkpoint used for evaluation.

???

trt_engine

string

Path to the TensorRT engine to be used for evaluation. This only works with tao-deploy.

results_dir

string

Path to where all the assets generated from a task are stored.

batch_size

int

Batch size of the input Tensor. This is important if batch_size > 1 for large dataset.

-1

-1

input_width

int

Width of the input image tensor.

736

1

input_height

int

Height of the input image tensor.

320

1

Inference Configuration#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

num_gpus

int

Number of GPUs to run the inference job.

1

1

gpu_ids

list

List of GPU IDs to run the inference on. The length of this list must be equal to the number of gpus in inference.num_gpus.

[0]

FALSE

num_nodes

int

Number of nodes to run the inference on. If > 1, then multi-node is enabled.

1

1

checkpoint

string

Path to the checkpoint used for inference.

???

trt_engine

string

Path to the TensorRT engine to be used for inference. This only works with tao-deploy.

results_dir

string

Path to where all the assets generated from a task are stored.

batch_size

int

Batch size of the input Tensor. This is important if batch_size > 1 for a large dataset.

-1

-1

conf_threshold

float

Value of the confidence threshold to be used when filtering out the final list of boxes.

0.5

input_width

int

Width of the input image tensor.

1

input_height

int

Height of the input image tensor.

1

save_raw_pfm

bool

Whether to save the raw pfm output during inference.

FALSE

Export Configuration#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

results_dir

string

Path to where all the assets generated from a task are stored.

gpu_id

int

Index of the GPU to build the TensorRT engine.

0

checkpoint

string

Path to the checkpoint file to run export.

???

onnx_file

string

Path to the onnx model file.

???

on_cpu

bool

Whether to export CPU compatible model.

FALSE

input_channel

ordered_int

Number of channels in the input Tensor.

3

1

1,3

input_width

int

Width of the input image tensor.

960

32

input_height

int

Height of the input image tensor.

544

32

opset_version

int

Operator set version of the ONNX model used to generate TensorRT engine.

17

1

batch_size

int

Batch size of the input Tensor for the engine. A value of -1 implies dynamic tensor shapes.

-1

-1

verbose

bool

Whether to enable verbose TensorRT logging.

FALSE

format

categorical

File format to export to.

onnx

onnx,xdl

valid_iters

int

Number of GRU iterations to export the model.

22

1

TensorRT Engine Configuration#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

results_dir

string

Path to where all the assets generated from a task are stored.

gpu_id

int

Index of the GPU to build the TensorRT engine.

0

0

onnx_file

string

Path to the ONNX model file.

???

trt_engine

string

Path to the TensorRT engine generated should be stored. This only works with tao-deploy.

???

timing_cache

string

Path to a TensorRT timing cache that speeds up engine generation. This will be created/read/updated.

batch_size

int

Batch size of the input tensor for the engine. A value of -1 implies dynamic tensor shapes.

-1

-1

verbose

bool

Whether to enable verbose TensorRT logging.

FALSE

tensorrt

collection

Hyperparameters to configure the TensorRT Engine builder.

FALSE

Augmentation Configuration#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

input_mean

list

Input mean for RGB frames

[0.485, 0.456, 0.406]

FALSE

input_std

list

Input standard deviation per pixel for RGB frames

[0.229, 0.224, 0.225]

FALSE

crop_size

list

Crop size for input RGB images [height, width]

[518, 518]

FALSE

min_scale

float

Minimum scale in data augmentation

-0.2

0.2

1

max_scale

float

Maximum scale in data augmentation

0.4

-0.2

1

do_flip

bool

Whether to perform flip in data augmentation

FALSE

yjitter_prob

float

Probability for y jitter

1.0

0.0

1.0

TRUE

gamma

list

Gamma range in data augmentation

[1, 1, 1, 1]

FALSE

color_aug_prob

float

Probability for asymmetric color augmentation

0.2

0.0

1.0

TRUE

color_aug_brightness

float

Color jitter brightness

0.4

0.0

1.0

color_aug_contrast

float

Color jitter contrast

0.4

0.0

1.0

color_aug_saturation

list

Color jitter saturation

[0.0, 1.4]

FALSE

color_aug_hue_range

list

Hue range in data augmentation

[-0.027777777777777776, 0.027777777777777776]

FALSE

eraser_aug_prob

float

Probability for eraser augmentation

0.5

0.0

1.0

TRUE

spatial_aug_prob

float

Probability for spatial augmentation

1.0

0.0

1.0

TRUE

stretch_prob

float

Probability for stretch augmentation

0.8

0.0

1.0

TRUE

max_stretch

float

Maximum stretch augmentation

0.2

0.0

1.0

h_flip_prob

float

Probability for horizontal flip augmentation

0.5

0.0

1.0

TRUE

v_flip_prob

float

Probability for vertical flip augmentation

0.5

0.0

1.0

TRUE

hshift_prob

float

Probability for horizontal shift augmentation

0.5

0.0

1.0

TRUE

crop_min_valid_disp_ratio

float

Probability for minimum crop valid disparity ratio

0.0

0.0

1.0

TRUE

Training the Model#

To train a stereo depth estimation model:

# Get the training spec
TRAIN_SPECS=$(tao-client depth_net_stereo get-spec --action train --job_type experiment --id $EXPERIMENT_ID)

# Modify TRAIN_SPECS as needed, then run training
JOB_ID=$(tao-client depth_net_stereo experiment-run-action --action train --id $EXPERIMENT_ID --specs "$TRAIN_SPECS")

Required arguments:

  • -e: Path to the experiment specification file

  • -k: Encryption key for model checkpoints

Optional arguments:

  • results_dir: Overrides the results directory from the specification file

  • train.num_gpus: Overrides number of GPUs

  • train.num_epochs: Overrides number of training epochs

  • dataset.train_dataset.batch_size: Overrides batch size

  • model.train_iters: Overrides number of refinement iterations

Training Output#

The training process generates the following outputs in the results directory:

  • train/dn_model_latest.pth: Latest model checkpoint

  • train/dn_model_epoch_XXX_step_YYY.pth: Periodic checkpoints

  • train/events.out.tfevents.*: TensorBoard log files

  • train/status.json: Training status and metrics

  • train/visualizations/: Sample disparity predictions (if enabled)

You can monitor training progress using TensorBoard:

tensorboard --logdir=/path/to/results/train

Evaluating the Model#

To evaluate a trained stereo depth estimation model:

EVAL_SPECS=$(tao-client depth_net_stereo get-spec --action evaluate --job_type experiment --id $EXPERIMENT_ID)

JOB_ID=$(tao-client depth_net_stereo experiment-run-action --action evaluate --id $EXPERIMENT_ID --specs "$EVAL_SPECS")

Required arguments:

  • -e: Path to the experiment specification file

  • -k: Encryption key

Optional arguments:

  • evaluate.checkpoint: Path to model checkpoint to evaluate

  • evaluate.batch_size: Batch size for evaluation

  • evaluate.input_width: Input width for evaluation

  • evaluate.input_height: Input height for evaluation

  • dataset.test_dataset.data_sources: Override test dataset

Evaluation Metrics#

For stereo depth estimation, TAO computes the following metrics:

End-Point-Error (EPE)

Mean absolute difference between predicted and ground truth disparity. Lower is better.

D1-All Error

Percentage of pixels with disparity error > 1 pixel. Lower is better.

Bad Pixel Rates (BP1, BP2, BP3)

Percentage of pixels with errors exceeding 1, 2, and 3 pixels respectively. Lower is better.

Absolute Relative Error (abs_rel)

Mean of |predicted - ground_truth| / ground_truth. Lower is better.

Squared Relative Error (sq_rel)

Mean of (predicted - ground_truth)² / ground_truth. Lower is better.

RMSE

Root mean square error of disparity. Lower is better.

RMSE Log

RMSE in log space. Lower is better.

These metrics are saved to a JSON file in the results directory and displayed in the console output.

Running Inference#

To run inference on stereo image pairs using a trained model:

INFER_SPECS=$(tao-client depth_net_stereo get-spec --action inference --job_type experiment --id $EXPERIMENT_ID)

JOB_ID=$(tao-client depth_net_stereo experiment-run-action --action inference --id $EXPERIMENT_ID --specs "$INFER_SPECS")

Required arguments:

  • -e: Path to the experiment specification file

  • -k: Encryption key

Optional arguments:

  • inference.checkpoint: Path to model checkpoint

  • inference.save_raw_pfm: Saves disparity maps in PFM format (default: False)

  • inference.batch_size: Batch size for inference

  • inference.input_width: Input width for inference

  • inference.input_height: Input height for inference

  • dataset.infer_dataset.data_sources: Overrides inference dataset

Inference Output#

The inference process generates:

  • Disparity map visualizations (colored disparity images) in PNG format

  • Raw disparity values in PFM format (if save_raw_pfm is True)

  • Depth maps (if baseline and focal length are provided)

  • Inference results, saved in results_dir/inference/

The disparity can be converted to depth (in meters) using:

depth = (baseline * focal_x) / disparity

Exporting the Model#

To export a trained model to ONNX format:

EXPORT_SPECS=$(tao-client depth_net_stereo get-spec --action export --job_type experiment --id $EXPERIMENT_ID)

JOB_ID=$(tao-client depth_net_stereo experiment-run-action --action export --id $EXPERIMENT_ID --specs "$EXPORT_SPECS")

Required arguments:

  • -e: Path to the experiment specification file

  • -k: Encryption key

  • export.checkpoint: Path to trained model checkpoint

  • export.onnx_file: Output path for ONNX model

Optional arguments:

  • export.input_channel: Number of input channels (default: 3)

  • export.input_width: Input image width (default: 416)

  • export.input_height: Input image height (default: 768)

  • export.opset_version: ONNX opset version (default: 16)

  • export.batch_size: Batch size, -1 for dynamic (default: -1)

  • export.on_cpu: Export CPU-compatible model (default: False)

  • export.format: Export format - onnx or xdl (default: onnx)

  • export.valid_iters: Number of refinement iterations to export (default: 22)

Generating TensorRT Engine#

To generate a TensorRT engine from the exported ONNX model for optimized inference:

GEN_TRT_SPECS=$(tao-client depth_net_stereo get-spec --action gen_trt_engine --job_type experiment --id $EXPERIMENT_ID)

JOB_ID=$(tao-client depth_net_stereo experiment-run-action --action gen_trt_engine --id $EXPERIMENT_ID --specs "$GEN_TRT_SPECS")

Required arguments:

  • -e: Path to the experiment specification file

  • gen_trt_engine.onnx_file: Path to ONNX model

  • gen_trt_engine.trt_engine: Output path for TensorRT engine

Optional arguments:

  • gen_trt_engine.gpu_id: GPU index for engine generation (default: 0)

  • gen_trt_engine.batch_size: Batch size, -1 for dynamic (default: -1)

  • gen_trt_engine.verbose: Enables verbose logging (default: False)

  • gen_trt_engine.timing_cache: Path to timing cache file

  • gen_trt_engine.tensorrt.workspace_size: TensorRT workspace size in MB (default: 1024)

  • gen_trt_engine.tensorrt.data_type: Precision - FP32 or FP16 (default: FP16)

  • gen_trt_engine.tensorrt.min_batch_size: Minimum batch size (default: 1)

  • gen_trt_engine.tensorrt.opt_batch_size: Optimal batch size (default: 2)

  • gen_trt_engine.tensorrt.max_batch_size: Maximum batch size (default: 4)

TensorRT Engine Benefits#

  • Performance: 3-10x faster inference compared to PyTorch

  • Memory efficiency: Reduced memory footprint

  • Optimization: Layer fusion, kernel auto-tuning, and precision calibration

  • Deployment: Production-ready inference engine for real-time applications

For stereo depth estimation, TensorRT optimization is particularly beneficial for:

  • Real-time robotic vision (30+ FPS on modern GPUs)

  • Autonomous navigation systems

  • Industrial inspection and quality control

  • AR/VR applications requiring low latency

Model Configuration Reference#

For a complete reference to all configuration parameters, refer to the configuration tables in the TAO Toolkit documentation or the experiment specification files provided with the toolkit. Many parameters are shared with monocular depth estimation.

Best Practices#

Training Recommendations#

  1. Dataset diversity: Mix multiple datasets (FSD, Crestereo, Isaac) for better generalization

  2. Encoder selection:

    • Use vits for real-time applications (fastest, 22M parameters)

    • Use vitl for maximum accuracy (304M parameters)

  3. Batch size: Start with batch size 1-2 per GPU for FoundationStereo

  4. Learning rate: Use small learning rates (1e-5) with PolynomialLR scheduler

  5. Multi-GPU training: Use 2-8 GPUs with DDP strategy for faster training

  6. Activation checkpointing: Enable for larger encoders (vitl) to reduce memory

  7. Refinement iterations:

    • Use 22 iterations during training for best accuracy

    • You can reduce to 10-15 for faster inference with minimal accuracy loss

  8. Augmentation: Use strong augmentation for robustness across domains

Data Preparation#

  1. Stereo rectification: Ensure images are properly rectified before training

  2. Calibration accuracy: Accurate baseline and focal length are critical for metric depth

  3. Disparity range: Set max_disparity based on your camera setup and scene depth

  4. Image resolution: Higher resolution (e.g., 768x1280) improves accuracy but requires more memory

  5. Mixed datasets: Combine indoor and outdoor datasets for domain generalization

  6. Data quality: Filter out poorly calibrated or misaligned stereo pairs

Performance Optimization#

  1. TensorRT deployment: Always use TensorRT engines for production (3-10x speedup)

  2. FP16 precision: Use FP16 for TensorRT engines (2x faster with minimal accuracy loss)

  3. Dynamic batching: Use dynamic batch sizes for variable workloads

  4. Timing cache: Reuse timing cache to speed up subsequent engine builds

  5. Input resolution: Balance resolution and speed based on application needs

  6. Multi-stream inference: Use multiple CUDA streams for maximum throughput

Troubleshooting#

Common Issues#

Out of memory (OOM):

  • Reduce batch size to 1

  • Enable activation_checkpoint: True

  • Use a smaller encoder (vits instead of vitl)

  • Reduce crop_size or input resolution

  • Set low_memory: 1 or higher (0-4) in model config

  • Reduce train_iters to 10-15

Poor disparity quality:

  • Check stereo rectification - images must be properly rectified

  • Verify baseline and focal_x match your camera calibration

  • Ensure max_disparity is appropriate for your depth range

  • Increase training epochs (6-10 epochs recommended)

  • Use stronger augmentation

  • Mix multiple datasets for better generalization

  • Check for occluded regions and textureless areas in your data

Training instability:

  • Reduce learning rate (try 5e-6 to 1e-5)

  • Enable gradient clipping (clip_grad_norm: 0.1)

  • Use a PolynomialLR scheduler with lr_decay: 0.9

  • Check for NaN or inf values in disparity ground truth

  • Ensure disparity maps are in correct format (PFM or PNG)

  • Use cudnn.deterministic: True for reproducible training

Slow training:

  • Increase batch_size if memory allows

  • Use multiple GPUs (2-8) with DDP strategy

  • Reduce log_every_n_steps and vis_step_interval

  • Use fp16 precision (2x speedup)

  • Increase number of data loading workers (8-16)

  • Disable dataloader_visualize during long training runs

  • Use smaller train_iters (15 instead of 22)

Poor zero-shot performance:

  • Train on diverse datasets (mix FSD, Crestereo, Isaac)

  • Use strong augmentation (color, eraser, spatial)

  • Increase training epochs

  • Use larger encoder (vitl)

  • Ensure training data covers target domain characteristics

  • Fine-tune on a small sample of target domain data

Inference speed issues:

  • Use TensorRT engine instead of PyTorch model

  • Enable FP16 precision in TensorRT

  • Reduce input resolution if acceptable

  • Reduce valid_iters to 10-15 for faster inference

  • Use vits encoder for edge deployment

  • Optimize batch size for your GPU

Additional Resources#

For more information about monocular depth estimation, go to Monocular Depth Estimation.