NvPanoptix3D#
NvPanoptix3D is a 3D panoptic scene reconstruction network that takes a single RGB image as input and produces a complete 3D reconstruction of the scene, including depth estimation, 2D panoptic segmentation, 3D geometry, and 3D panoptic segmentation. The network is built on a VGGT (Visual Geometry Grounded Transformer) backbone combined with a Mask2Former-style decoder for the 2D stage and a sparse 3D convolutional frustum decoder for the 3D stage. The total model size is approximately 1.4 billion parameters.
NvPanoptix3D supports the following tasks:
trainevaluateinferenceexport
The tasks are explained in detail in the following sections.
Note
Throughout this documentation are references to
$EXPERIMENT_IDand$DATASET_IDin the FTMS Client sections.For instructions on creating a dataset using the remote client, refer to the Creating a dataset section in the Remote Client documentation.
For instructions on creating an experiment using the remote client, refer to the Creating an experiment section in the Remote Client documentation.
The spec format is YAML for TAO Launcher, and JSON for FTMS Client.
File-related parameters, such as dataset paths or pretrained model paths, are required only for TAO Launcher, not for FTMS Client.
Pipeline Overview#
NvPanoptix3D uses a two-stage training pipeline. You must train Stage 1 before Stage 2.
Stage 1 — 2D Stage#
This stage trains joint 2D panoptic segmentation and depth estimation. It takes a single RGB image as input and produces:
A depth map
2D panoptic segmentation masks
Object queries for the 3D stage
Camera intrinsic matrix
Stage 2 — 3D Stage#
This stage freezes the Stage 1 model weights and trains the 3D U-Net frustum completion module. It takes the Stage 1 outputs as input and produces:
3D scene geometry as a truncated signed distance field (TSDF) at 3 cm voxel resolution
3D panoptic segmentation at a 256 × 256 × 256 voxel grid
The dataset.enable_3d parameter in the configuration file controls which stage is active.
Set it to False for Stage 1 and True for Stage 2.
Dataset Format#
NvPanoptix3D supports two datasets: 3D-Front and Matterport3D.
3D-Front Dataset#
3D-Front is a synthetic indoor scene dataset. The annotation JSON file for each split specifies the scene and image IDs to use. Organize the data in the following directory structure:
<base_dir>/
data/
<scene_id>/
rgb_<img_id>.png
depth_<img_id>.exr
segmap_<img_id>.mapped.npz
geometry_<img_id>.npz
segmentation_<img_id>.mapped.npz
weighting_<img_id>.npz
The files in each scene directory contain the following:
File |
Description |
|---|---|
|
RGB input image |
|
Depth map in OpenEXR format |
|
2D panoptic segmentation labels with mapped category IDs |
|
3D geometry encoded as a truncated signed distance field |
|
3D panoptic segmentation volumes |
|
Spatial weighting volumes used during 3D training |
Matterport3D Dataset#
Matterport3D is a real indoor scene dataset with per-image camera intrinsics. The image ID
format is <name>_<angle>_<rot>. Organize the data in the following directory structure:
<base_dir>/
data/
<scene_id>/
<name>_i<angle>_<rot>.jpg
<name>_segmap<angle>_<rot>.mapped.npz
<name>_intrinsics_<angle>.npy
depth_gen/
<scene_id>/
<name>_d<angle>_<rot>.png
room_mask/
<scene_id>/
<name>_rm<angle>_<rot>.png
The files in each scene directory contain the following:
File |
Description |
|---|---|
|
RGB input image |
|
2D panoptic segmentation labels with mapped category IDs |
|
Per-image camera intrinsic matrix |
|
Depth map |
|
Room mask used for multiplane occupancy |
Note
Unlike 3D-Front, Matterport3D uses per-image intrinsic matrices. Set
dataset.downsample_factor to 2 and dataset.iso_value to 2.0 in
the configuration file when training on Matterport3D.
Creating a Configuration File#
NvPanoptix3D uses a YAML configuration file with the following top-level sections:
dataset, train, evaluate, inference, model, export, and wandb.
Because training is a two-stage process, prepare a separate configuration file for each
stage. Sample configuration files for both datasets and both stages are provided in the
experiment_specs directory of the NvPanoptix3D source:
spec_front3d_2d.yaml: Stage 1 (2D) training on 3D-Frontspec_front3d_3d.yaml: Stage 2 (3D) training on 3D-Frontspec_matterport_2d.yaml: Stage 1 (2D) training on Matterport3Dspec_matterport_3d.yaml: Stage 2 (3D) training on Matterport3D
The following example shows a Stage 2 (3D) configuration file for the 3D-Front dataset.
The Stage 1 configuration is identical except that you set dataset.enable_3d to
False, omit train.checkpoint_2d, and set the batch size for training to 16
instead of 1.
results_dir: /workspace/nvpanoptix3d/train3d_front3d
dataset:
name: front3d
contiguous_id: True
label_map: ""
downsample_factor: 1
frustum_mask_path: ""
iso_value: 1.0
ignore_label: 255
enable_3d: True # Set to False for Stage 1 (2D) training
enable_mp_occ: True
train:
json_path: /path/to/train.json
base_dir: /path/to/front3d/data
batch_size: 1
num_workers: 2
val:
json_path: /path/to/val.json
base_dir: /path/to/front3d/data
batch_size: 1
num_workers: 2
test:
json_path: /path/to/test.json
base_dir: /path/to/front3d/data
batch_size: 1
num_workers: 2
augmentation:
train_min_size: [240]
train_max_size: 960
test_min_size: 240
test_max_size: 960
size_divisibility: 32
train:
checkpoint_2d: /path/to/stage1/checkpoint.pth
checkpoint_3d: ""
freeze: []
precision: fp32
num_gpus: 1
num_nodes: 1
checkpoint_interval_unit: step
checkpoint_interval: 1000
num_epochs: 20
activation_checkpoint: False
optim:
type: AdamW
lr: 0.0001
weight_decay: 0.05
lr_scheduler: WarmupPoly
max_steps: 110000
evaluate:
checkpoint: ""
inference:
images_dir: ""
checkpoint: ""
model:
object_mask_threshold: 0.8
overlap_threshold: 0.8
test_topk_per_image: 100
mode: panoptic
backbone:
backbone_type: vggt
pretrained_model_path: /path/to/vggt_pretrained.pth
sem_seg_head:
num_classes: 13
mask_former:
dropout: 0.0
num_object_queries: 100
deep_supervision: True
no_object_weight: 0.1
class_weight: 2.0
mask_weight: 5.0
dice_weight: 5.0
depth_weight: 5.0
mp_occ_weight: 5.0
size_divisibility: 32
frustum3d:
truncation: 3.0
iso_recon_value: 1.0
panoptic_weight: 25.0
completion_weights: [50.0, 25.0, 10.0]
surface_weight: 5.0
unet_output_channels: 16
unet_features: 16
use_multi_scale: True
grid_dimensions: 256
signed_channel: 3
frustum_dims: 256
projection:
voxel_size: 0.03
sign_channel: True
export:
checkpoint: ""
onnx_file_2d: /workspace/nvpanoptix3d/model_2d.onnx
onnx_file_3d: ""
on_cpu: False
input_channel: 3
input_width: 320
input_height: 240
opset_version: 17
batch_size: 1
verbose: False
wandb:
enable: True
name: nvpanoptix3d_vggt_3d_front3d
tags: ["training", "nvpanoptix3d", "vggt", "3d_front3d"]
Configuration Parameters#
The following tables describe all available configuration parameters.
Experiment Configuration#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Name of model if invoking task via |
|||||
|
string |
Key for encrypting model checkpoints |
|||||
|
string |
Path to where all the assets generated from a task are stored |
|||||
|
collection |
False |
|||||
|
collection |
Configurable parameters to construct the model for the NVPanoptix3D experiment |
False |
||||
|
collection |
Configurable parameters to construct the dataset for the NVPanoptix3D experiment |
False |
||||
|
collection |
Configurable parameters to construct the trainer for the NVPanoptix3D experiment |
False |
||||
|
collection |
Configurable parameters to construct the inferencer for the NVPanoptix3D experiment |
False |
||||
|
collection |
Configurable parameters to construct the evaluator for the NVPanoptix3D experiment |
False |
||||
|
collection |
Configurable parameters to construct the exporter for the NVPanoptix3D experiment |
False |
||||
|
collection |
Configurable parameters to construct the TensorRT engine builder for a NVPanoptix3D experiment |
False |
WandB Configuration#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
bool |
True |
|||||
|
string |
TAO Toolkit |
|||||
|
string |
||||||
|
string |
||||||
|
list |
[‘tao-toolkit’] |
False |
||||
|
bool |
False |
|||||
|
bool |
False |
|||||
|
bool |
False |
|||||
|
string |
TAO Toolkit Training |
|||||
|
string |
Model Configuration#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
collection |
Configuration hyper parameters for the NVPanoptix3D Backbone |
False |
||||
|
collection |
Configuration hyper parameters for the Mask2Former Semantic Segmentation Head |
False |
||||
|
collection |
Configuration hyper parameters for the Mask2Former model |
False |
||||
|
collection |
Configuration hyper parameters for the Frustum3D model |
False |
||||
|
collection |
Configuration hyper parameters for the Projection model |
False |
||||
|
categorical |
Segmentation mode |
panoptic |
panoptic,instance,semantic |
|||
|
float |
The value of the threshold to be used when filtering out the object mask |
0.4 |
||||
|
float |
The value of the threshold to be used when evaluating overlap |
0.5 |
||||
|
int |
Keep topk instances per image for instance segmentation |
100 |
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
categorical |
Type of backbone to use. Available backbone: vggt |
vggt |
vggt |
|||
|
string |
Path to a pretrained backbone file |
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
int |
Common stride |
4 |
2 |
|||
|
int |
Number of transformer encoder layers |
6 |
1 |
|||
|
int |
Convolutional layer dimension |
256 |
1 |
|||
|
int |
Mask head dimension |
256 |
1 |
|||
|
int |
Depth head dimension |
256 |
1 |
|||
|
int |
Ignore value |
255 |
0 |
255 |
||
|
list |
List of feature names for deformable transformer encoder input |
[‘res3’, ‘res4’, ‘res5’] |
False |
|||
|
int |
Number of classes |
13 |
1 |
|||
|
string |
Norm layer type |
GN |
||||
|
list |
List of input feature names |
[‘res2’, ‘res3’, ‘res4’, ‘res5’] |
False |
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
float |
The probability to drop out |
0 |
0.0 |
1.0 |
||
|
int |
Number of heads |
8 |
||||
|
int |
The number of queries |
100 |
1 |
inf |
||
|
int |
Dimension of the hidden units |
256 |
||||
|
int |
Dimension of the feedforward network in the transformer |
1024 |
1 |
|||
|
int |
Dimension of the feedforward network |
2048 |
1 |
|||
|
int |
Number of decoder layers in the transformer |
10 |
1 |
|||
|
bool |
Whether to add layer norm in the encoder; 1=add layer norm, 0=do not add |
0 |
||||
|
float |
The relative weight of the classification error in the matching cost |
2 |
0.0 |
inf |
||
|
float |
The relative weight of the focal loss of the binary mask in the matching cost |
5 |
0.0 |
inf |
||
|
float |
The relative weight of the dice loss of the binary mask in the matching cost |
5 |
0.0 |
inf |
||
|
float |
The relative weight of the depth loss in the matching cost |
5 |
0.0 |
inf |
||
|
float |
The relative weight of the mp occ loss in the matching cost |
5 |
0.0 |
inf |
||
|
int |
The number of points to sample |
12544 |
||||
|
float |
Oversampling parameter |
3 |
||||
|
float |
Ratio of points that are sampled via important sampling |
0.75 |
||||
|
bool |
Flag to enable deep supervision |
1 |
||||
|
float |
The relative classification weight applied to the no-object category |
0.1 |
||||
|
int |
Size divisibility |
32 |
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
float |
The truncation value |
3.0 |
||||
|
float |
The iso recon value |
2.0 |
||||
|
float |
The weight of the panoptic loss |
25.0 |
||||
|
list |
The weights of the completion loss |
[50.0, 25.0, 10.0] |
False |
|||
|
float |
The weight of the surface loss |
5.0 |
||||
|
int |
The number of output channels of the UNet |
16 |
||||
|
int |
The number of features of the UNet |
16 |
||||
|
bool |
Whether to use multi-scale |
False |
||||
|
int |
The number of grid dimensions |
256 |
||||
|
int |
The number of frustum dimensions |
256 |
||||
|
int |
The number of signed channel |
3 |
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
float |
The size of the voxel |
0.03 |
||||
|
bool |
Whether to use signed channel |
1 |
||||
|
int |
The dimension of the depth feature |
256 |
Dataset Configuration#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
collection |
Configurable parameters to construct the train dataset |
False |
||||
|
collection |
Configurable parameters to construct the validation dataset |
False |
||||
|
collection |
Configurable parameters to construct the test dataset |
False |
||||
|
int |
The number of parallel workers processing data |
8 |
1 |
|||
|
bool |
Flag to allocate pagelocked memory for faster of data between the CPU and GPU |
True |
||||
|
collection |
Configuration parameters for data augmentation |
False |
||||
|
bool |
Flag to enable contiguous IDs for labels |
False |
||||
|
string |
A path to label map file |
|||||
|
categorical |
Dataset name |
front3d |
front3d,matterport,synthetic_hospital,synthetic_warehouse |
|||
|
int |
Downsample factor (1: Synthetic & Front3D, 2: Matterport3D) |
1 |
||||
|
float |
ISO value to reconstruct mesh from TUDF volume |
1.0 |
||||
|
int |
Ignore label value |
255 |
||||
|
int |
Minimum number of pixels required for an instance to be considered valid |
200 |
||||
|
string |
Image format |
RGB |
||||
|
list |
Input image size to resize |
[320, 240] |
False |
|||
|
list |
Image size to process at 3D stage |
[160, 120] |
False |
|||
|
list |
Input depth size to resize |
[120, 160] |
False |
|||
|
bool |
Enable depth truncation in bounds |
False |
||||
|
float |
Min depth value |
0.4 |
||||
|
float |
Max depth value |
6.0 |
||||
|
string |
Relative frustum mask path |
meta/frustum_mask.npz |
||||
|
list |
Value to create occuppancy volume from TUDF volume |
[8.0, 6.0] |
False |
|||
|
list |
truncation range for TUDF volume |
[0.0, 12.0] |
False |
|||
|
bool |
Enable 3d for training |
False |
||||
|
bool |
Enable multi-plane occupancy |
True |
||||
|
float |
Depth scale |
25.0 |
||||
|
int |
Number of thing classes |
9 |
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Root directory of the dataset |
|||||
|
string |
JSON file for image/mask pair |
|||||
|
int |
Batch size |
1 |
1 |
|||
|
int |
Number of workers in the dataloader |
1 |
0 |
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Root directory of the dataset |
|||||
|
string |
JSON file for image/mask pair |
|||||
|
int |
Batch size |
1 |
1 |
|||
|
int |
Number of workers in the dataloader |
1 |
0 |
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Root directory of the dataset |
|||||
|
string |
JSON file for image/mask pair |
|||||
|
int |
Batch size |
1 |
1 |
|||
|
int |
Number of workers in the dataloader |
1 |
0 |
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
list |
A list of sizes to perform random resize |
[448] |
False |
|||
|
int |
The maximum random crop size for training data |
768 |
32 |
960 |
||
|
list |
The random crop size for training data in [H, W] |
[240, 240] |
False |
|||
|
int |
The minimum resize size for test data |
240 |
32 |
960 |
||
|
int |
The maximum resize size for test |
960 |
32 |
960 |
||
|
bool |
Color augmentation |
False |
||||
|
bool |
Enable cropping for input image |
False |
||||
|
list |
Size to crop input image |
[240, 240] |
False |
|||
|
float |
Maximum ratio of crop area that can be occupied by a single semantic category |
1.0 |
0.0 |
1.0 |
||
|
string |
Flip horizontal/vertical |
|||||
|
float |
Flip probability |
0.5 |
0.0 |
1.0 |
||
|
float |
Size divisibility to pad |
-1 |
||||
|
float |
Weight for generated augmentation, 0.0 will disable generated augmentation |
0.0 |
0.0 |
1.0 |
Training Configuration#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
int |
The number of GPUs to run the train job |
1 |
1 |
|||
|
list |
List of GPU IDs to run the training on; length of list must equal |
[0] |
False |
|||
|
int |
Number of nodes to run the training on; if > 1, multi-node is enabled |
1 |
1 |
|||
|
int |
Seed for the initializer in PyTorch; if < 0, fixed seed is disabled |
1234 |
-1 |
inf |
||
|
collection |
False |
|||||
|
int |
Number of epochs to run the training |
10 |
1 |
inf |
||
|
int |
The interval (in epochs) at which a checkpoint is saved |
1 |
1 |
|||
|
categorical |
The unit of the checkpoint interval |
epoch |
epoch,step |
|||
|
int |
The interval (in epochs) at which an evaluation is triggered on the validation set |
1 |
1 |
|||
|
string |
Path to the checkpoint to resume training from |
|||||
|
string |
The folder in which to save the experiment |
|||||
|
string |
Path to 2D stage checkpoint to initialize the 3D stage training |
|||||
|
string |
Path to 3D stage checkpoint to initialize the 3D stage training |
|||||
|
int |
The number of iterations between validation checks |
5 |
||||
|
list |
|
[] |
False |
|||
|
float |
Amount to clip the gradient by L2 Norm |
0.1 |
||||
|
|||||||
|
string |
Gradient clip type |
full |
||||
|
bool |
Whether to run the trainer in Dry Run mode |
False |
||||
|
collection |
Hyper parameters to configure the optimizer |
False |
||||
|
categorical |
Precision to run the training on |
fp32 |
fp16,fp32 |
|||
|
categorical |
|
ddp |
ddp,fsdp |
|||
|
bool |
|
True |
||||
|
bool |
Flag to enable printing of detailed learning rate scaling from the optimizer |
False |
||||
|
int |
Number of iterations per epoch |
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
categorical |
Type of optimizer used to train the network |
AdamW |
AdamW |
|||
|
categorical |
The metric value to be monitored for the |
val_loss |
val_loss,train_loss |
|||
|
float |
The initial learning rate for training the model |
0.0002 |
0.0 |
1.0 |
True |
|
|
float |
A multiplier for backbone learning rate |
0.1 |
0.0 |
1.0 |
True |
|
|
float |
The momentum for the AdamW optimizer |
0.9 |
0.0 |
1.0 |
True |
|
|
float |
The weight decay coefficient |
0.05 |
0.0 |
1.0 |
True |
|
|
categorical |
|
MultiStep |
MultiStep,Warmuppoly |
|||
|
list |
Learning rate decay epochs |
[88, 96] |
False |
|||
|
float |
Multiplicative factor of learning rate decay |
0.1 |
||||
|
int |
The maximum number of steps to train the model |
160000 |
||||
|
float |
The warmup factor for the learning rate scheduler |
1.0 |
||||
|
int |
The number of warmup iterations |
0 |
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
bool |
Whether to enable cuDNN benchmark mode |
False |
||||
|
bool |
Whether to enable cuDNN deterministic mode |
True |
Inference Configuration#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
int |
The number of GPUs to run the evaluation job |
1 |
1 |
|||
|
list |
List of GPU IDs to run the inference on; length of list must equal |
[0] |
False |
|||
|
int |
Number of nodes to run the inference on; if > 1, multi-node is enabled |
1 |
1 |
|||
|
string |
Path to the checkpoint file used for inference |
|||||
|
string |
Path to the TensorRT engine folder to be used for inference |
|||||
|
string |
Path to where all the assets generated from a task are stored |
|||||
|
int |
The batch size of the input tensor; important if |
-1 |
-1 |
|||
|
categorical |
Mode to run inference |
panoptic |
semantic,instance,panoptic |
|||
|
string |
Path to the images directory |
Evaluation Configuration#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
int |
The number of GPUs to run the evaluation job |
1 |
1 |
|||
|
list |
List of GPU IDs to run the evaluation on; length of list must equal |
[0] |
False |
|||
|
int |
Number of nodes to run the evaluation on; if > 1, multi-node is enabled |
1 |
1 |
|||
|
string |
Path to the checkpoint file used for evaluation |
|||||
|
string |
Path to the TensorRT engine to be used for evaluation; only works with |
|||||
|
string |
Path to where all the assets generated from a task are stored |
|||||
|
int |
The batch size of the input tensor; important if |
-1 |
-1 |
Export Configuration#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Path to where all the assets generated from a task are stored |
|||||
|
int |
The index of the GPU used to build the TensorRT engine |
0 |
||||
|
string |
Path to the checkpoint file to run export |
??? |
||||
|
string |
Path to the ONNX model file |
??? |
||||
|
bool |
Flag to export CPU compatible model |
False |
||||
|
ordered_int |
Number of channels in the input tensor |
3 |
1 |
1,3 |
||
|
int |
Width of the input image tensor |
960 |
32 |
|||
|
int |
Height of the input image tensor |
544 |
32 |
|||
|
int |
|
17 |
1 |
|||
|
int |
|
-1 |
-1 |
|||
|
bool |
Flag to enable verbose TensorRT logging |
False |
||||
|
categorical |
File format to export to |
onnx |
onnx,xdl |
|||
|
string |
Path to the ONNX model 2D file |
|||||
|
string |
Path to the ONNX model 3D file |
|||||
|
int |
The maximum number of voxels in the input tensor for the engine |
700000 |
1 |
TensorRT Engine Configuration#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Path to where all the assets generated from a task are stored |
|||||
|
int |
The index of the GPU used to build the TensorRT engine |
0 |
0 |
|||
|
string |
Path to the ONNX model file |
??? |
||||
|
string |
Path to the generated TensorRT engine; only works with |
??? |
||||
|
string |
|
|||||
|
int |
|
-1 |
-1 |
|||
|
bool |
Flag to enable verbose TensorRT logging |
False |
||||
|
collection |
Hyper parameters to configure the NVPanoptix3D TensorRT Engine builder |
False |
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
int |
|
1024 |
0 |
|||
|
int |
|
1 |
1 |
|||
|
int |
|
1 |
1 |
|||
|
int |
|
1 |
1 |
|||
|
list |
The list to specify layer precision |
[] |
False |
|||
|
categorical |
The precision to be set for building the TensorRT engine |
FP32 |
FP32,FP16 |
Training#
NvPanoptix3D training requires two sequential stages. Complete Stage 1 before beginning Stage 2.
Stage 1: 2D Panoptic Segmentation#
Stage 1 trains the 2D panoptic segmentation and depth estimation head. Set
dataset.enable_3d to False in your configuration file.
To run Stage 1 training:
BASE_EXPERIMENT_ID=$(tao nvpanoptix3d list-base-experiments | jq -r '.[0].id')
STAGE1_SPECS=$(tao nvpanoptix3d get-job-schema --action train --base-experiment-id $BASE_EXPERIMENT_ID | jq -r '.default')
STAGE1_JOB_ID=$(tao nvpanoptix3d create-job \
--kind experiment \
--name "nvpanoptix3d_stage1_train" \
--action train \
--workspace-id $WORKSPACE_ID \
--specs @stage1_spec.yaml \
--train-dataset-uri "$DATASET_URI" \
--eval-dataset-uri "$DATASET_URI" \
--base-experiment-id "$BASE_EXPERIMENT_ID" \
--encryption-key "nvidia_tlt" | jq -r '.id')
Multi-Node Training with FTMS
Distributed training is supported through FTMS. For large models, multi-node clusters can bring significant speedups and performance improvements for training.
Verify that your cluster has multiple GPU enabled nodes available for training by running this command:
kubectl get nodes -o wide
The command lists the nodes in your cluster. If it does not list multiple nodes, contact your cluster administrator to get more nodes added to your cluster.
To run a multi-node training job through FTMS, modify these fields in the training job specification:
{
"train": {
"num_gpus": 8, // Number of GPUs per node
"num_nodes": 2 // Number of nodes to use for training
}
}
If these fields are not specified, FTMS uses the default values of one GPU per node and one node.
Note
The number of GPUs specified in the num_gpus field must not exceed the number of GPUs per node in the cluster.
The number of nodes specified in the num_nodes field must not exceed the number of nodes in the cluster.
tao nvpanoptix3d train \
-e /path/to/spec_2d.yaml \
dataset.train.json_path=/path/to/train.json \
dataset.train.base_dir=/path/to/data \
dataset.val.json_path=/path/to/val.json \
dataset.val.base_dir=/path/to/data \
model.backbone.pretrained_model_path=/path/to/vggt_pretrained.pth \
results_dir=/path/to/results/stage1
Required arguments:
-e: Path to the Stage 1 experiment specification file.
Optional arguments:
results_dir: Override the results directory.train.num_gpus: Number of GPUs to use.model.backbone.pretrained_model_path: Path to pretrained VGGT backbone weights.
Note
For training, evaluation, and inference, we expose two variables for each task: num_gpus and gpu_ids, which
default to 1 and [0], respectively. If both are passed, but are inconsistent, for example num_gpus = 1,
gpu_ids = [0, 1], then they are modified to follow the setting that implies more GPUs; in the same example num_gpus is modified from 1 to 2.
In some cases multi-GPU training may result in a segmentation fault. You can circumvent this by
setting the enviroment variable OMP_NUM_THREADS to 1. Depending upon your model of execution, you may use the following methods to set
this variable:
CLI Launcher:
You may set the environment variable by adding the following fields to the
Envsfield of your~/.tao_mounts.jsonfile as mentioned in bullet 3 in ths section Running the launcher.{ "Envs": [ { "variable": "OMP_NUM_THREADSR", "value": "1" } }
Docker:
You may set environment variables in Docker by setting the
-eflag in the Docker command line.docker run -it --rm --gpus all \ -e OMP_NUM_THREADS=1 \ -v /path/to/local/mount:/path/to/docker/mount nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt <model> train -e
Stage 2: 3D Volumetric Reconstruction#
Stage 2 freezes the Stage 1 model weights and trains the 3D U-Net frustum completion
module. Set dataset.enable_3d to True and provide the Stage 1 checkpoint via
train.checkpoint_2d.
To run Stage 2 training:
STAGE2_SPECS=$(tao nvpanoptix3d get-job-schema --action train --base-experiment-id $BASE_EXPERIMENT_ID | jq -r '.default')
STAGE2_JOB_ID=$(tao nvpanoptix3d create-job \
--kind experiment \
--name "nvpanoptix3d_stage2_train" \
--action train \
--workspace-id $WORKSPACE_ID \
--specs @stage2_spec.yaml \
--train-dataset-uri "$DATASET_URI" \
--eval-dataset-uri "$DATASET_URI" \
--parent-job-id $STAGE1_JOB_ID \
--base-experiment-id "$BASE_EXPERIMENT_ID" \
--encryption-key "nvidia_tlt" | jq -r '.id')
Multi-Node Training with FTMS
Distributed training is supported through FTMS. For large models, multi-node clusters can bring significant speedups and performance improvements for training.
Verify that your cluster has multiple GPU enabled nodes available for training by running this command:
kubectl get nodes -o wide
The command lists the nodes in your cluster. If it does not list multiple nodes, contact your cluster administrator to get more nodes added to your cluster.
To run a multi-node training job through FTMS, modify these fields in the training job specification:
{
"train": {
"num_gpus": 8, // Number of GPUs per node
"num_nodes": 2 // Number of nodes to use for training
}
}
If these fields are not specified, FTMS uses the default values of one GPU per node and one node.
Note
The number of GPUs specified in the num_gpus field must not exceed the number of GPUs per node in the cluster.
The number of nodes specified in the num_nodes field must not exceed the number of nodes in the cluster.
tao nvpanoptix3d train \
-e /path/to/spec_3d.yaml \
dataset.train.json_path=/path/to/train.json \
dataset.train.base_dir=/path/to/data \
dataset.val.json_path=/path/to/val.json \
dataset.val.base_dir=/path/to/data \
train.checkpoint_2d=/path/to/results/stage1/checkpoint.pth \
results_dir=/path/to/results/stage2
To resume Stage 2 training from an existing Stage 2 checkpoint, also set
train.checkpoint_3d:
tao nvpanoptix3d train \
-e /path/to/spec_3d.yaml \
train.checkpoint_2d=/path/to/results/stage1/checkpoint.pth \
train.checkpoint_3d=/path/to/results/stage2/checkpoint.pth \
results_dir=/path/to/results/stage2_resumed
Required arguments:
-e: Path to the Stage 2 experiment specification file.train.checkpoint_2d: Path to the Stage 1 checkpoint.
Optional arguments:
results_dir: Override the results directory.train.checkpoint_3d: Resume Stage 2 from an existing checkpoint.train.num_gpus: Number of GPUs. For 3D training, also settrain.activation_checkpoint=Trueto reduce GPU memory usage.
Note
For training, evaluation, and inference, we expose two variables for each task: num_gpus and gpu_ids, which
default to 1 and [0], respectively. If both are passed, but are inconsistent, for example num_gpus = 1,
gpu_ids = [0, 1], then they are modified to follow the setting that implies more GPUs; in the same example num_gpus is modified from 1 to 2.
In some cases multi-GPU training may result in a segmentation fault. You can circumvent this by
setting the enviroment variable OMP_NUM_THREADS to 1. Depending upon your model of execution, you may use the following methods to set
this variable:
CLI Launcher:
You may set the environment variable by adding the following fields to the
Envsfield of your~/.tao_mounts.jsonfile as mentioned in bullet 3 in ths section Running the launcher.{ "Envs": [ { "variable": "OMP_NUM_THREADSR", "value": "1" } }
Docker:
You may set environment variables in Docker by setting the
-eflag in the Docker command line.docker run -it --rm --gpus all \ -e OMP_NUM_THREADS=1 \ -v /path/to/local/mount:/path/to/docker/mount nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt <model> train -e
Note
Only fp32 precision is supported. Mixed precision training is not available
for NvPanoptix3D.
Checkpointing and Resuming Training
At every train.checkpoint_interval, a PyTorch Lightning checkpoint is saved. It is called model_epoch_<epoch_num>.pth.
Checkpoints are saved in train.results_dir, like this:
$ ls /results/train
'model_epoch_000.pth'
'model_epoch_001.pth'
'model_epoch_002.pth'
'model_epoch_003.pth'
'model_epoch_004.pth'
Evaluation#
NvPanoptix3D reports the following metrics, which assess the quality of 3D panoptic reconstruction:
Metric |
Description |
|---|---|
PRQ |
Panoptic Reconstruction Quality. Overall 3D panoptic performance, combining geometry accuracy and semantic recognition. |
RSQ |
Reconstructed Segmentation Quality. Measures the quality of semantic segmentation in 3D. |
RRQ |
Reconstruction Recognition Quality. Measures instance recognition quality in 3D. |
To evaluate a trained model, provide the path to the checkpoint and the test dataset:
EVALUATE_SPECS=$(tao nvpanoptix3d get-job-schema --action evaluate --base-experiment-id $BASE_EXPERIMENT_ID | jq -r '.default')
EVAL_JOB_ID=$(tao nvpanoptix3d create-job \
--kind experiment \
--name "nvpanoptix3d_evaluate" \
--action evaluate \
--workspace-id $WORKSPACE_ID \
--parent-job-id $STAGE2_JOB_ID \
--eval-dataset-uri "$DATASET_URI" \
--specs @eval_spec.yaml \
--base-experiment-id "$BASE_EXPERIMENT_ID" \
--encryption-key "nvidia_tlt" | jq -r '.id')
tao nvpanoptix3d evaluate \
-e /path/to/spec.yaml \
dataset.test.json_path=/path/to/test.json \
dataset.test.base_dir=/path/to/data \
evaluate.checkpoint=/path/to/checkpoint.pth
Required arguments:
-e: Path to the experiment specification file.evaluate.checkpoint: Path to the trained checkpoint.
Optional arguments:
evaluate.num_gpus: Number of GPUs to use.
Set dataset.enable_3d to True in the specification file to evaluate the full 3D model,
or False to evaluate the Stage 1 (2D) model only. When evaluating the 2D model,
the reported metric is Panoptic Quality (PQ).
Performance#
The following tables show NvPanoptix3D performance on the 3D-Front test set and Matterport3D validation set, compared against published baselines. Metrics are reported for all categories combined and broken down into Things (countable objects) and Stuff (background regions). Bold values indicate the best result in each column.
Model |
PRQ (All / Things / Stuff) |
RSQ (All / Things / Stuff) |
RRQ (All / Things / Stuff) |
|---|---|---|---|
BUOL |
54.01 / 49.73 / 73.30 |
63.81 / 60.57 / 78.37 |
82.99 / 80.67 / 93.42 |
Uni3D |
52.76 / 47.29 / 77.41 |
60.98 / 56.56 / 80.87 |
84.26 / 81.81 / 95.31 |
NvPanoptix3D |
54.32 / 49.74 / 74.90 |
62.95 / 58.98 / 80.80 |
83.94 / 82.15 / 92.00 |
Model |
PRQ (All / Things / Stuff) |
RSQ (All / Things / Stuff) |
RRQ (All / Things / Stuff) |
|---|---|---|---|
BUOL |
14.47 / 10.97 / 24.94 |
45.71 / 45.30 / 46.93 |
30.91 / 23.81 / 52.22 |
Uni3D |
16.32 / 13.21 / 29.33 |
44.36 / 44.58 / 44.09 |
36.48 / 29.33 / 65.19 |
NvPanoptix3D |
17.63 / 14.79 / 28.04 |
45.27 / 45.68 / 43.31 |
38.98 / 32.26 / 64.02 |
NvPanoptix3D achieves the highest PRQ on both datasets and the highest RRQ on 3D-Front, demonstrating strong 3D panoptic reconstruction quality across both synthetic and real indoor environments.
Inference#
NvPanoptix3D inference runs on a directory of RGB images. NvPanoptix3D does not require
ground truth annotations. The network accepts .jpg and .png images as input.
To run inference:
INFERENCE_SPECS=$(tao nvpanoptix3d get-job-schema --action inference --base-experiment-id $BASE_EXPERIMENT_ID | jq -r '.default')
INFER_JOB_ID=$(tao nvpanoptix3d create-job \
--kind experiment \
--name "nvpanoptix3d_inference" \
--action inference \
--workspace-id $WORKSPACE_ID \
--parent-job-id $STAGE2_JOB_ID \
--inference-dataset-uri "$DATASET_URI" \
--specs @inference_spec.yaml \
--base-experiment-id "$BASE_EXPERIMENT_ID" \
--encryption-key "nvidia_tlt" | jq -r '.id')
tao nvpanoptix3d inference \
-e /path/to/spec.yaml \
inference.images_dir=/path/to/images \
inference.checkpoint=/path/to/checkpoint.pth \
results_dir=/path/to/inference_results
Required arguments:
-e: Path to the experiment specification file.inference.checkpoint: Path to the trained checkpoint.
Optional arguments:
inference.images_dir: Override the images directory. Defaults to the value in the specification file.inference.num_gpus: Number of GPUs to use. Defaults to1.
Set dataset.enable_3d to True to produce 3D reconstruction outputs, or False
to produce 2D panoptic segmentation and depth outputs only.
The inference outputs saved to results_dir include the following:
Output |
Shape |
Description |
|---|---|---|
2D panoptic segmentation |
(120, 160) |
Per-pixel panoptic label map combining semantic and instance information. |
2D depth map |
(120, 160) |
Per-pixel depth estimate in meters. |
3D geometry |
(256, 256, 256) |
Truncated signed distance field representing the 3D scene geometry. |
3D semantic segmentation |
(256, 256, 256) |
Per-voxel semantic class labels. |
3D panoptic segmentation |
(256, 256, 256) |
Per-voxel panoptic labels combining semantic and instance information. |
Export#
NvPanoptix3D exports the Stage 1 (2D) model to ONNX format for deployment with NVIDIA® TensorRT™.
Note
Only the 2D model supports ONNX export in this release. 3D model export is not yet available.
To export the 2D model:
EXPORT_SPECS=$(tao nvpanoptix3d get-job-schema --action export --base-experiment-id $BASE_EXPERIMENT_ID | jq -r '.default')
EXPORT_JOB_ID=$(tao nvpanoptix3d create-job \
--kind experiment \
--name "nvpanoptix3d_export" \
--action export \
--workspace-id $WORKSPACE_ID \
--parent-job-id $STAGE2_JOB_ID \
--specs @export_spec.yaml \
--base-experiment-id "$BASE_EXPERIMENT_ID" \
--encryption-key "nvidia_tlt" | jq -r '.id')
tao nvpanoptix3d export \
-e /path/to/spec.yaml \
export.checkpoint=/path/to/stage1_checkpoint.pth \
export.onnx_file_2d=/path/to/output/model_2d.onnx \
export.input_height=256 \
export.input_width=320 \
export.opset_version=17
Required arguments:
-e: Path to the experiment specification file.export.checkpoint: Path to the Stage 1 checkpoint to export.export.onnx_file_2d: Output path for the exported 2D ONNX file.
Optional arguments:
export.input_height: Input image height. Default:256.export.input_width: Input image width. Default:320.export.opset_version: ONNX opset version. Default:17.
After export, generate a TensorRT engine from the ONNX file using the provided
gen_trt_engine.py script:
python3 gen_trt_engine.py \
--onnx_file_2d /path/to/model_2d.onnx \
--trt_engine_2d /path/to/trt_2d_engine.engine \
--batch_size 1 \
--input_height 256 \
--input_width 320 \
--workspace_gb 8
Inference with NVIDIA Triton Inference Server#
NvPanoptix3D supports deployment as a hybrid TensorRT and PyTorch ensemble model on
NVIDIA Triton Inference Server. The 2D stage runs as a TensorRT engine, and the 3D
stage runs as a PyTorch model. The Triton model repository and client scripts are
provided in the tlt-triton-apps repository.
The Triton model accepts the following inputs and produces the following outputs:
Name |
Direction |
Data Type |
Description |
|---|---|---|---|
|
Input |
|
RGB image with shape |
|
Input |
|
Frustum mask with shape |
|
Input |
|
Camera intrinsic matrix with shape |
|
Output |
|
2D panoptic segmentation map. |
|
Output |
|
2D depth map. |
|
Output |
|
3D panoptic segmentation volume. |
|
Output |
|
3D scene geometry (truncated signed distance field). |
|
Output |
|
3D semantic segmentation volume. |
To start the Triton server:
bash scripts/nvpanoptix3d_e2e_inference/start_server.sh
Install Python client requirements:
pip install -r scripts/nvpanoptix3d_e2e_inference/client-requirements.txt
To run the Triton client against the server:
bash scripts/nvpanoptix3d_e2e_inference/start_client.sh
Refer to the tlt-triton-apps repository for complete setup instructions, including
NGC authentication, Docker configuration, and client usage.