Fast Foundation Stereo#
Fast Foundation Stereo (FFS) is a real-time stereo depth estimation model introduced in “Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching” (Wen et al., NVIDIA, CVPR 2026). FFS distills the full FoundationStereo architecture into a compact model that delivers over 10× faster inference while closely matching the zero-shot accuracy of FoundationStereo across diverse domains including robotics, autonomous vehicles, and industrial inspection.
TAO Toolkit integrates FFS into the depth_net module; select it by setting
model.model_type: FastFoundationStereo in your experiment specification file.
The model accepts rectified stereo image pairs and produces disparity maps.
The Fast Foundation Stereo model in TAO supports the following tasks:
trainevaluateinferenceexportgen_trt_engine
Architecture#
Fast Foundation Stereo applies a divide-and-conquer strategy to accelerate FoundationStereo across its three stages:
Feature extraction: Hybrid monocular and stereo priors from FoundationStereo are distilled into a single efficient student backbone.
Cost filtering: Blockwise neural architecture search automatically discovers the optimal cost filtering design under a latency budget.
Disparity refinement: A dependency graph models the recurrent structure of the GRU module, enabling structured pruning to eliminate redundancy.
The bp2 commercial checkpoint has a significantly smaller model configuration than FoundationStereo — refer to the configuration note in Creating a Configuration File for the exact parameter values required.
Data Input for Fast Foundation Stereo#
Annotation File Format#
Fast Foundation Stereo reads stereo data from a plain text annotation file. Each line specifies one stereo sample with fields separated by spaces:
Columns |
Format |
Use |
|---|---|---|
2 |
|
Inference without ground truth |
3 |
|
Training and evaluation |
4 |
|
Evaluation with occlusion mask |
Note
The 4-column format is only supported when dataset_name is Middlebury or
Eth3d. Other dataset types (including GenericDataset) will raise an error
if given a 4-column annotation file.
Supported Datasets#
Set dataset.dataset_name: StereoDataset at the top level of your specification file.
For each entry in a data_sources list, set dataset_name to one of the following values:
|
Description |
|---|---|
|
Middlebury stereo benchmark |
|
KITTI autonomous driving stereo dataset |
|
ETH3D low-resolution outdoor stereo dataset |
|
NVIDIA Foundation Stereo Dataset (synthetic) |
|
NVIDIA Isaac real-world stereo data |
|
CREStereo large-scale synthetic dataset |
|
Custom stereo data; required for 2-column inference |
For details on stereo rectification requirements and data preparation, refer to Stereo Depth Estimation.
Creating a Configuration File#
The experiment specification file is a YAML configuration that defines all parameters for training, evaluation, and inference. The following example shows the configuration for the bp2 commercial checkpoint:
results_dir: /data/result
dataset:
dataset_name: StereoDataset
max_disparity: 192
min_depth: 0.0
train_dataset:
data_sources:
- dataset_name: Middlebury
data_file: /data/datasets/stereo/train.txt
batch_size: 1
workers: 4
augmentation:
crop_size: [320, 736]
val_dataset:
data_sources:
- dataset_name: Middlebury
data_file: /data/datasets/stereo/val.txt
batch_size: 1
workers: 4
augmentation:
crop_size: [320, 736]
model:
model_type: FastFoundationStereo
encoder: vitl
hidden_dims: [128]
n_gru_layers: 1
corr_radius: 4
corr_levels: 2
n_downsample: 2
max_disparity: 192
valid_iters: 8
train_iters: 22
volume_dim: 28
mixed_precision: false
gwc_feature_normalize: true
motion_encoder_widths: [56, 96, 16, 12]
motion_encoder_final: 48
gru_hidden: 60
gru_gating_conv_widths: [100, 168]
disp_head_input_dim: 60
disp_head_intermediate: 36
disp_head_pwconv1_widths: [212, 244]
mask_widths: [32, 16]
stem_2_widths: [12, 16]
spx_2_gru_widths: [16, 12, 16, 24]
spx_gru_out: 9
classifier_mid: 14
cnet_conv04_widths: [60, 48]
cam_mid_channels: 8
cost_agg_conv_patch_padding: [0, 0, 0]
stereo_backbone:
edgenext_pretrained_path: ""
depth_anything_v2_pretrained_path: ""
use_bn: false
use_clstoken: false
train:
num_gpus: 1
num_epochs: 1
precision: fp32
pretrained_model_path: /data/checkpoints/model.pth
optim:
optimizer: AdamW
lr: 1.0e-5
evaluate:
num_gpus: 1
batch_size: 1
checkpoint: /data/checkpoints/model.pth
inference:
save_raw_pfm: true
num_gpus: 1
checkpoint: /data/checkpoints/model.pth
export:
checkpoint: /data/checkpoints/model.pth
onnx_file: /data/checkpoints/model.onnx
input_height: 480
input_width: 736
opset_version: 17
batch_size: 1
valid_iters: 8
format: onnx
gen_trt_engine:
onnx_file: /data/checkpoints/model.onnx
trt_engine: /data/checkpoints/model.engine
batch_size: 1
tensorrt:
data_type: fp16
workspace_size: 4096
min_batch_size: 1
opt_batch_size: 1
max_batch_size: 1
Note
The following model parameters must match the bp2 commercial checkpoint exactly.
TAO applies schema defaults when a field is absent — the schema defaults differ from
the bp2 training values and produce incorrect output without raising an error.
max_disparity: 192— the schema default is416. An incorrect value builds an oversized cost volume and shifts predictions out of the trained disparity regime.gwc_feature_normalize: true— the bp2 checkpoint requires normalized group-wise correlation. Setting this tofalseproduces negative disparity values in approximately 7–8% of pixels.volume_dim: 28— the schema default is32.hidden_dims: [128]andn_gru_layers: 1— the FoundationStereo schema defaults are[128, 128, 128]and3respectively.
Key Configuration Parameters#
The following tables describe all available configuration parameters. The top-level
results_dir field sets the output directory for all tasks. WandB logging is available
and configured via the top-level wandb block; see the ExperimentConfig and WandBConfig
fields in depthnet_tables.rst for details.
Dataset Configuration#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
categorical |
Dataset name |
StereoDataset |
MonoDataset,StereoDataset |
|||
|
bool |
Whether to normalize depth |
FALSE |
||||
|
float |
Maximum depth in meters in MetricDepthAnythingV2 |
1.0 |
inf |
|||
|
float |
Minimum depth in meters in MetricDepthAnythingV2 |
0.0 |
inf |
|||
|
int |
Maximum allowed disparity for which we compute losses during training |
416 |
1 |
416 |
||
|
float |
Baseline for stereo datasets |
0.193001 |
0.0 |
inf |
||
|
float |
Focal length along x-axis |
1998.842 |
0.0 |
inf |
||
|
collection |
Configurable parameters to construct the train dataset for a DepthNet experiment |
FALSE |
||||
|
collection |
Configurable parameters to construct the val dataset for a DepthNet experiment |
FALSE |
||||
|
collection |
Configurable parameters to construct the test dataset for a DepthNet experiment |
FALSE |
||||
|
collection |
Configurable parameters to construct the infer dataset for a DepthNet experiment |
FALSE |
Model Configuration#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
categorical |
Network name |
MetricDepthAnythingV2 |
FoundationStereo,MetricDepthAnything,RelativeDepthAnything |
|||
|
collection |
Network defined paths for Monocular DepthNet Backbone |
FALSE |
||||
|
collection |
Network defined paths for Edgenext and Depthanythingv2 |
FALSE |
||||
|
list |
Hidden dimensions |
[128, 128, 128] |
FALSE |
|||
|
int |
Width of the correlation pyramid |
4 |
1 |
TRUE |
||
|
int |
cv group |
8 |
1 |
TRUE |
||
|
int |
Train iteration |
22 |
1 |
TRUE |
||
|
int |
Validation iteration |
22 |
1 |
|||
|
int |
Volume dimension |
32 |
1 |
TRUE |
||
|
int |
reduce memory usage |
0 |
0 |
4 |
||
|
bool |
Whether to use mixed precision training |
FALSE |
||||
|
int |
Number of hidden GRU levels |
3 |
1 |
3 |
||
|
int |
Number of levels in the correlation pyramid |
2 |
1 |
2 |
||
|
int |
Resolution of the disparity field (1/2^K) |
2 |
1 |
2 |
||
|
categorical |
DepthAnythingV2 Encoder options |
vitl |
vits,vitl |
|||
|
int |
Maximum disparity of the model used in the training of a stereo model |
416 |
Note
Set model_type to FastFoundationStereo for this model. The shared table above
lists only the values applicable to other model types; FastFoundationStereo is valid
for this model but is not shown in the shared table.
Stereo Backbone Configuration#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Path to load DepthAnythingv2 as an encoder for Stereo DepthNet (FoundationStereo) |
|||||
|
string |
Path to load edgenext encoder for Stereo DepthNet (FoundationStereo) |
|||||
|
bool |
Whether to use batch normalization in DepthAnythingV2 |
FALSE |
||||
|
bool |
Whether to use class token |
FALSE |
Training Configuration#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
int |
Number of GPUs to run the train job. |
1 |
1 |
|||
|
list |
List of GPU IDs to run the training on. The length of this list must be equal to the number of gpus in train.num_gpus. |
[0] |
FALSE |
|||
|
int |
Number of nodes to run the training on. If > 1, then multi-node is enabled. |
1 |
1 |
|||
|
int |
Seed for the initializer in PyTorch. If < 0, disable fixed seed. |
1234 |
-1 |
inf |
||
|
collection |
FALSE |
|||||
|
int |
Number of epochs to run the training. |
10 |
1 |
inf |
||
|
int |
Interval (in epochs) at which a checkpoint is to be saved; helps resume training. |
1 |
1 |
|||
|
categorical |
Unit of the checkpoint interval. |
epoch |
epoch,step |
|||
|
int |
Interval (in epochs) at which a evaluation will be triggered on the validation dataset. |
1 |
1 |
|||
|
string |
Path to the checkpoint from which to resume training. |
|||||
|
string |
Path to where all the assets generated from a task are stored. |
|||||
|
int |
Number of steps to save the checkpoint. |
|||||
|
string |
Path to a pretrained DepthNet model from which to initialize the current training. |
|||||
|
float |
Amount to clip the gradient by L2 Norm. A value of 0.0 specifies no clipping. |
0.1 |
||||
|
bool |
Whether to visualize the dataloader. |
FALSE |
TRUE |
|||
|
int |
Visualization interval in step. |
10 |
TRUE |
|||
|
bool |
Whether to run the trainer in Dry Run mode. This serves as a good means to validate the specification file and run a sanity check on the trainer without actually initializing and running the trainer. |
FALSE |
||||
|
collection |
Hyperparameters to configure the optimizer. |
FALSE |
||||
|
categorical |
Precision on which to run the training. |
fp32 |
bf16,fp32,fp16 |
|||
|
categorical |
Multi-GPU training strategy. DDP (Distributed Data Parallel) and Fully Sharded DDP are supported. |
ddp |
ddp,fsdp |
|||
|
bool |
Whether train is to recompute in backward pass to save GPU memory (TRUE) or store activations (FALSE). |
TRUE |
||||
|
bool |
Whether to display verbose logs to console. |
FALSE |
||||
|
bool |
Whether to use tiled inference, particularly for transformers which expect fixed size of sequences. |
FALSE |
||||
|
string |
Use tiled inference weight type. |
gaussian |
||||
|
list |
Minimum overlap for tile. |
[16, 16] |
FALSE |
|||
|
int |
Interval steps of logging training results and running validation numbers within one epoch. |
500 |
Optimizer Configuration#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
categorical |
Type of optimizer used to train the network |
AdamW |
AdamW,SGD |
|||
|
categorical |
Metric value to be monitored for the |
val_loss |
val_loss,train_loss |
|||
|
float |
Initial learning rate for training the model, excluding the backbone |
0.0001 |
TRUE |
|||
|
float |
Momentum for the AdamW optimizer |
0.9 |
TRUE |
|||
|
float |
Weight decay coefficient |
0.0001 |
TRUE |
|||
|
categorical |
Learning scheduler:
|
MultiStepLR |
MultiStep,StepLR,CustomMultiStepLRScheduler,LambdaLR,PolynomialLR,OneCycleLR,CosineAnnealingLR |
|||
|
list |
Steps at which the learning rate must be decreased This is applicable only with the MultiStep LR |
[1000] |
FALSE |
|||
|
int |
Number of steps to decrease the learning rate in the StepLR |
1000 |
TRUE |
|||
|
float |
Decreasing factor for the learning rate scheduler |
0.1 |
TRUE |
|||
|
float |
Minimum learning rate value for the learning rate scheduler |
1e-07 |
TRUE |
|||
|
int |
Number of steps to perform linear learning rate” warm-up before engaging a learning rate scheduler |
20 |
0 |
inf |
Evaluation Configuration#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
int |
Number of GPUs to run the evaluation job. |
1 |
1 |
|||
|
list |
List of GPU IDs to run the evaluation on. The length of this list
must be equal to the number of |
[0] |
FALSE |
|||
|
int |
Number of nodes to run the evaluation on. If > 1, then multi-node is enabled. |
1 |
1 |
|||
|
string |
Path to the checkpoint used for evaluation. |
??? |
||||
|
string |
Path to the TensorRT engine to be used for evaluation.
This only works with |
|||||
|
string |
Path to where all the assets generated from a task are stored. |
|||||
|
int |
Batch size of the input Tensor. This is important if |
-1 |
-1 |
|||
|
int |
Width of the input image tensor. |
736 |
1 |
|||
|
int |
Height of the input image tensor. |
320 |
1 |
Inference Configuration#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
int |
Number of GPUs to run the inference job. |
1 |
1 |
|||
|
list |
List of GPU IDs to run the inference on. The length of this list
must be equal to the number of gpus in |
[0] |
FALSE |
|||
|
int |
Number of nodes to run the inference on. If > 1, then multi-node is enabled. |
1 |
1 |
|||
|
string |
Path to the checkpoint used for inference. |
??? |
||||
|
string |
Path to the TensorRT engine to be used for inference.
This only works with |
|||||
|
string |
Path to where all the assets generated from a task are stored. |
|||||
|
int |
Batch size of the input Tensor. This is important if batch_size > 1 for a large dataset. |
-1 |
-1 |
|||
|
float |
Value of the confidence threshold to be used when filtering out the final list of boxes. |
0.5 |
||||
|
int |
Width of the input image tensor. |
1 |
||||
|
int |
Height of the input image tensor. |
1 |
||||
|
bool |
Whether to save the raw pfm output during inference. |
FALSE |
Export Configuration#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Path to where all the assets generated from a task are stored. |
|||||
|
int |
Index of the GPU to build the TensorRT engine. |
0 |
||||
|
string |
Path to the checkpoint file to run export. |
??? |
||||
|
string |
Path to the onnx model file. |
??? |
||||
|
bool |
Whether to export CPU compatible model. |
FALSE |
||||
|
ordered_int |
Number of channels in the input Tensor. |
3 |
1 |
1,3 |
||
|
int |
Width of the input image tensor. |
960 |
32 |
|||
|
int |
Height of the input image tensor. |
544 |
32 |
|||
|
int |
Operator set version of the ONNX model used to generate TensorRT engine. |
17 |
1 |
|||
|
int |
Batch size of the input Tensor for the engine.
A value of |
-1 |
-1 |
|||
|
bool |
Whether to enable verbose TensorRT logging. |
FALSE |
||||
|
categorical |
File format to export to. |
onnx |
onnx,xdl |
|||
|
int |
Number of GRU iterations to export the model. |
22 |
1 |
TensorRT Engine Configuration#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Path to where all the assets generated from a task are stored. |
|||||
|
int |
Index of the GPU to build the TensorRT engine. |
0 |
0 |
|||
|
string |
Path to the ONNX model file. |
??? |
||||
|
string |
Path to the TensorRT engine generated should be stored.
This only works with |
??? |
||||
|
string |
Path to a TensorRT timing cache that speeds up engine generation. This will be created/read/updated. |
|||||
|
int |
Batch size of the input tensor for the engine.
A value of |
-1 |
-1 |
|||
|
bool |
Whether to enable verbose TensorRT logging. |
FALSE |
||||
|
collection |
Hyperparameters to configure the TensorRT Engine builder. |
FALSE |
Augmentation Configuration#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
list |
Input mean for RGB frames |
[0.485, 0.456, 0.406] |
FALSE |
|||
|
list |
Input standard deviation per pixel for RGB frames |
[0.229, 0.224, 0.225] |
FALSE |
|||
|
list |
Crop size for input RGB images [height, width] |
[518, 518] |
FALSE |
|||
|
float |
Minimum scale in data augmentation |
-0.2 |
0.2 |
1 |
||
|
float |
Maximum scale in data augmentation |
0.4 |
-0.2 |
1 |
||
|
bool |
Whether to perform flip in data augmentation |
FALSE |
||||
|
float |
Probability for y jitter |
1.0 |
0.0 |
1.0 |
TRUE |
|
|
list |
Gamma range in data augmentation |
[1, 1, 1, 1] |
FALSE |
|||
|
float |
Probability for asymmetric color augmentation |
0.2 |
0.0 |
1.0 |
TRUE |
|
|
float |
Color jitter brightness |
0.4 |
0.0 |
1.0 |
||
|
float |
Color jitter contrast |
0.4 |
0.0 |
1.0 |
||
|
list |
Color jitter saturation |
[0.0, 1.4] |
FALSE |
|||
|
list |
Hue range in data augmentation |
[-0.027777777777777776, 0.027777777777777776] |
FALSE |
|||
|
float |
Probability for eraser augmentation |
0.5 |
0.0 |
1.0 |
TRUE |
|
|
float |
Probability for spatial augmentation |
1.0 |
0.0 |
1.0 |
TRUE |
|
|
float |
Probability for stretch augmentation |
0.8 |
0.0 |
1.0 |
TRUE |
|
|
float |
Maximum stretch augmentation |
0.2 |
0.0 |
1.0 |
||
|
float |
Probability for horizontal flip augmentation |
0.5 |
0.0 |
1.0 |
TRUE |
|
|
float |
Probability for vertical flip augmentation |
0.5 |
0.0 |
1.0 |
TRUE |
|
|
float |
Probability for horizontal shift augmentation |
0.5 |
0.0 |
1.0 |
TRUE |
|
|
float |
Probability for minimum crop valid disparity ratio |
0.0 |
0.0 |
1.0 |
TRUE |
Training the Model#
To start training, run the train task using your experiment specification file. TAO
initializes the model from train.pretrained_model_path if provided. To resume an
interrupted run, set train.resume_training_checkpoint_path to the checkpoint path;
if left empty, TAO automatically resumes from the latest checkpoint in results_dir.
tao depth_net train -e /path/to/spec.yaml
Training Output#
The training process generates the following outputs under the results_dir directory:
train/dn_model_latest.pth: Latest model checkpointtrain/model_XXX_YYYYY.pth: Periodic checkpoints (zero-padded epoch and step)train/events.out.tfevents.*: TensorBoard log filestrain/status.json: Training status and metrics
You can monitor training progress using TensorBoard:
tensorboard --logdir=/path/to/results/train
Note
The checkpoint path for subsequent actions follows the pattern
<results_dir>/train/dn_model_latest.pth.
Evaluating the Model#
To evaluate a PyTorch checkpoint, set evaluate.checkpoint in your specification file.
To evaluate a TensorRT engine, set evaluate.trt_engine instead (requires tao-deploy).
tao depth_net evaluate -e /path/to/spec.yaml
Evaluation Metrics#
For stereo depth estimation, TAO computes the following metrics. Lower is better for all metrics.
- End-point error (epe)
Mean absolute difference between predicted and ground-truth disparity in pixels.
- Bad pixel rates (bp1, bp2, bp3)
Percentage of pixels with disparity error exceeding 1, 2, and 3 pixels respectively.
- D1 outlier rate (d1)
Percentage of pixels where the disparity error exceeds both 3 pixels and 5% of the ground-truth disparity.
- Absolute relative error (abs_rel)
Mean of |predicted - ground_truth| / ground_truth.
- Squared relative error (sq_rel)
Sum of squared disparity errors divided by the sum of absolute deviations of ground-truth disparity from its mean.
- Root mean square error (rmse)
Root mean square error of the disparity values.
- RMSE log (rmse_log)
Root mean square error computed in log space.
TAO displays these metrics in the console output. For PyTorch evaluation, metrics are also
logged to results_dir/evaluate/status.json. For TRT evaluation (tao-deploy), metrics
are saved to results_dir/trt_evaluate/results.json.
Running Inference#
To run PyTorch inference, set inference.checkpoint in your specification file. To run
TensorRT inference, set inference.trt_engine instead (requires tao-deploy).
tao depth_net inference -e /path/to/spec.yaml
Inference Output#
The inference process generates:
Colorized disparity visualizations in PNG format (PyTorch):
results_dir/inference/with dataset-relative pathsColorized disparity visualizations in PNG format (TRT, tao-deploy):
results_dir/trt_inference/predicted_depth/Raw disparity values in PFM format (PyTorch path only; enable with
inference.save_raw_pfm: true)
You can convert disparity to metric depth using the following formula:
depth = (baseline * focal_x) / disparity
Exporting the Model#
The export task converts a trained PyTorch checkpoint to ONNX format. TAO always exports
an fp32 ONNX file regardless of the model.mixed_precision setting; precision selection
happens at the gen_trt_engine step.
Configure export.checkpoint, export.onnx_file, export.input_height,
export.input_width, export.opset_version, export.batch_size, and
export.valid_iters in your specification file.
To export a dynamic-shape ONNX that accepts variable input resolutions, set
export.dynamic_hw: true (FFS only; not supported by other depth models). Input height
and width must each be divisible by 32.
tao depth_net export -e /path/to/spec.yaml
Generating a TensorRT Engine#
The gen_trt_engine task converts an ONNX model into an NVIDIA® TensorRT™ engine
for optimized inference. Configure gen_trt_engine.onnx_file, gen_trt_engine.trt_engine,
gen_trt_engine.gpu_id, gen_trt_engine.batch_size, and the gen_trt_engine.tensorrt
block (data_type, workspace_size, min_batch_size, opt_batch_size,
max_batch_size) in your specification file.
Note
Set gen_trt_engine.tensorrt.workspace_size to 4096 MB. Fast Foundation
Stereo requires more workspace memory than the default value of 1024 MB.
tao depth_net gen_trt_engine -e /path/to/spec.yaml
For production deployment, use a static-shape fp16 engine. For variable input resolutions,
build a dynamic-shape engine from an ONNX exported with dynamic_hw: true and supply an optimization profile
with min_height, opt_height, max_height, min_width, opt_width,
max_width (all divisible by 32). Dynamic-shape engines may show slightly larger
disparity drift than static-shape engines.
Best Practices#
Training Recommendations#
Precision: Use
train.precision: fp32for fine-tuning.fp16andbf16are supported but may degrade accuracy.Pretrained checkpoint: Initialize from the bp2 checkpoint via
train.pretrained_model_path. Use a learning rate of1e-5with an AdamW optimizer.Batch size: Use
dataset.train_dataset.batch_size: 1for variable-aspect datasets such as Middlebury, KITTI, and ETH3D.Crop size: Match
dataset.train_dataset.augmentation.crop_sizetodataset.val_dataset.augmentation.crop_sizefor consistent evaluation during training.
Performance Optimization#
TensorRT: Use a static-shape fp16 TensorRT engine for production inference — lowest latency and smallest disparity drift versus the PyTorch baseline.
Inference iterations: The bp2 checkpoint was distilled for
valid_iters: 8; increasingvalid_itersbeyond 8 does not improve quality.train_iters: 22is separate and controls training-time supervision only.Memory: Set
model.low_memory: 1to reduce peak GPU memory at a small throughput cost. Values above 1 have no additional effect.
Troubleshooting#
Common Issues#
Large disparity drift:
Verify that
model.max_disparity: 192is set explicitly. Refer to the configuration note above for the full list of bp2-critical parameters.
Negative disparity values:
Verify that
model.gwc_feature_normalize: trueis set. Refer to the configuration note above for details.
Model load error or shape mismatch:
Verify that all model configuration values match the sample configuration in this document. Mismatched values cause the model to build with an architecture that differs from what the bp2 checkpoint expects.
Additional Resources#
For more information about stereo depth estimation with FoundationStereo, refer to Stereo Depth Estimation.