NvPanoptix3D#

NvPanoptix3D is a 3D panoptic scene reconstruction network that takes a single RGB image as input and produces a complete 3D reconstruction of the scene, including depth estimation, 2D panoptic segmentation, 3D geometry, and 3D panoptic segmentation. The network is built on a VGGT (Visual Geometry Grounded Transformer) backbone combined with a Mask2Former-style decoder for the 2D stage and a sparse 3D convolutional frustum decoder for the 3D stage. The total model size is approximately 1.4 billion parameters.

NvPanoptix3D supports the following tasks:

train
evaluate
inference
export

The tasks are explained in detail in the following sections.

Pipeline Overview#

NvPanoptix3D uses a two-stage training pipeline. You must train Stage 1 before Stage 2.

Stage 1 — 2D Stage#

This stage trains joint 2D panoptic segmentation and depth estimation. It takes a single RGB image as input and produces:

A depth map
2D panoptic segmentation masks
Object queries for the 3D stage
Camera intrinsic matrix

Stage 2 — 3D Stage#

This stage freezes the Stage 1 model weights and trains the 3D U-Net frustum completion module. It takes the Stage 1 outputs as input and produces:

3D scene geometry as a truncated signed distance field (TSDF) at 3 cm voxel resolution
3D panoptic segmentation at a 256 × 256 × 256 voxel grid

The dataset.enable_3d parameter in the configuration file controls which stage is active. Set it to False for Stage 1 and True for Stage 2.

Dataset Format#

NvPanoptix3D supports two datasets: 3D-Front and Matterport3D.

3D-Front Dataset#

3D-Front is a synthetic indoor scene dataset. The annotation JSON file for each split specifies the scene and image IDs to use. Organize the data in the following directory structure:

<base_dir>/
    data/
        <scene_id>/
            rgb_<img_id>.png
            depth_<img_id>.exr
            segmap_<img_id>.mapped.npz
            geometry_<img_id>.npz
            segmentation_<img_id>.mapped.npz
            weighting_<img_id>.npz

The files in each scene directory contain the following:

File	Description
`rgb_<img_id>.png`	RGB input image
`depth_<img_id>.exr`	Depth map in OpenEXR format
`segmap_<img_id>.mapped.npz`	2D panoptic segmentation labels with mapped category IDs
`geometry_<img_id>.npz`	3D geometry encoded as a truncated signed distance field
`segmentation_<img_id>.mapped.npz`	3D panoptic segmentation volumes
`weighting_<img_id>.npz`	Spatial weighting volumes used during 3D training

Matterport3D Dataset#

Matterport3D is a real indoor scene dataset with per-image camera intrinsics. The image ID format is <name>_<angle>_<rot>. Organize the data in the following directory structure:

<base_dir>/
    data/
        <scene_id>/
            <name>_i<angle>_<rot>.jpg
            <name>_segmap<angle>_<rot>.mapped.npz
            <name>_intrinsics_<angle>.npy
    depth_gen/
        <scene_id>/
            <name>_d<angle>_<rot>.png
    room_mask/
        <scene_id>/
            <name>_rm<angle>_<rot>.png

The files in each scene directory contain the following:

File	Description
`<name>_i<angle>_<rot>.jpg`	RGB input image
`<name>_segmap<angle>_<rot>.mapped.npz`	2D panoptic segmentation labels with mapped category IDs
`<name>_intrinsics_<angle>.npy`	Per-image camera intrinsic matrix
`<name>_d<angle>_<rot>.png`	Depth map
`<name>_rm<angle>_<rot>.png`	Room mask used for multiplane occupancy

Note

Unlike 3D-Front, Matterport3D uses per-image intrinsic matrices. Set dataset.downsample_factor to 2 and dataset.iso_value to 2.0 in the configuration file when training on Matterport3D.

Creating a Configuration File#

NvPanoptix3D uses a YAML configuration file with the following top-level sections: dataset, train, evaluate, inference, model, export, and wandb.

Because training is a two-stage process, prepare a separate configuration file for each stage. Sample configuration files for both datasets and both stages are provided in the experiment_specs directory of the NvPanoptix3D source:

spec_front3d_2d.yaml: Stage 1 (2D) training on 3D-Front
spec_front3d_3d.yaml: Stage 2 (3D) training on 3D-Front
spec_matterport_2d.yaml: Stage 1 (2D) training on Matterport3D
spec_matterport_3d.yaml: Stage 2 (3D) training on Matterport3D

The following example shows a Stage 2 (3D) configuration file for the 3D-Front dataset. The Stage 1 configuration is identical except that you set dataset.enable_3d to False, omit train.checkpoint_2d, and set the batch size for training to 16 instead of 1.

results_dir: /workspace/nvpanoptix3d/train3d_front3d

dataset:
  name: front3d
  contiguous_id: True
  label_map: ""
  downsample_factor: 1
  frustum_mask_path: ""
  iso_value: 1.0
  ignore_label: 255
  enable_3d: True        # Set to False for Stage 1 (2D) training
  enable_mp_occ: True
  train:
    json_path: /path/to/train.json
    base_dir: /path/to/front3d/data
    batch_size: 1
    num_workers: 2
  val:
    json_path: /path/to/val.json
    base_dir: /path/to/front3d/data
    batch_size: 1
    num_workers: 2
  test:
    json_path: /path/to/test.json
    base_dir: /path/to/front3d/data
    batch_size: 1
    num_workers: 2
  augmentation:
    train_min_size: [240]
    train_max_size: 960
    test_min_size: 240
    test_max_size: 960
    size_divisibility: 32

train:
  checkpoint_2d: /path/to/stage1/checkpoint.pth
  checkpoint_3d: ""
  freeze: []
  precision: fp32
  num_gpus: 1
  num_nodes: 1
  checkpoint_interval_unit: step
  checkpoint_interval: 1000
  num_epochs: 20
  activation_checkpoint: False
  optim:
    type: AdamW
    lr: 0.0001
    weight_decay: 0.05
    lr_scheduler: WarmupPoly
    max_steps: 110000

evaluate:
  checkpoint: ""

inference:
  images_dir: ""
  checkpoint: ""

model:
  object_mask_threshold: 0.8
  overlap_threshold: 0.8
  test_topk_per_image: 100
  mode: panoptic
  backbone:
    backbone_type: vggt
    pretrained_model_path: /path/to/vggt_pretrained.pth
  sem_seg_head:
    num_classes: 13
  mask_former:
    dropout: 0.0
    num_object_queries: 100
    deep_supervision: True
    no_object_weight: 0.1
    class_weight: 2.0
    mask_weight: 5.0
    dice_weight: 5.0
    depth_weight: 5.0
    mp_occ_weight: 5.0
    size_divisibility: 32
  frustum3d:
    truncation: 3.0
    iso_recon_value: 1.0
    panoptic_weight: 25.0
    completion_weights: [50.0, 25.0, 10.0]
    surface_weight: 5.0
    unet_output_channels: 16
    unet_features: 16
    use_multi_scale: True
    grid_dimensions: 256
    signed_channel: 3
    frustum_dims: 256
  projection:
    voxel_size: 0.03
    sign_channel: True

export:
  checkpoint: ""
  onnx_file_2d: /workspace/nvpanoptix3d/model_2d.onnx
  onnx_file_3d: ""
  on_cpu: False
  input_channel: 3
  input_width: 320
  input_height: 240
  opset_version: 17
  batch_size: 1
  verbose: False

wandb:
  enable: True
  name: nvpanoptix3d_vggt_3d_front3d
  tags: ["training", "nvpanoptix3d", "vggt", "3d_front3d"]

Configuration Parameters#

The following tables describe all available configuration parameters.

Experiment Configuration#

Field	value_type	description	automl_enabled
`model_name`	string	Name of model if invoking task via `model_agnostic`
`encryption_key`	string	Key for encrypting model checkpoints
`results_dir`	string	Path to where all the assets generated from a task are stored
`wandb`	collection		False
`model`	collection	Configurable parameters to construct the model for the NVPanoptix3D experiment	False
`dataset`	collection	Configurable parameters to construct the dataset for the NVPanoptix3D experiment	False
`train`	collection	Configurable parameters to construct the trainer for the NVPanoptix3D experiment	False
`inference`	collection	Configurable parameters to construct the inferencer for the NVPanoptix3D experiment	False
`evaluate`	collection	Configurable parameters to construct the evaluator for the NVPanoptix3D experiment	False
`export`	collection	Configurable parameters to construct the exporter for the NVPanoptix3D experiment	False
`gen_trt_engine`	collection	Configurable parameters to construct the TensorRT engine builder for a NVPanoptix3D experiment	False

WandB Configuration#

Field	value_type	default_value	automl_enabled
`enable`	bool	True
`project`	string	TAO Toolkit
`entity`	string
`group`	string
`tags`	list	[‘tao-toolkit’]	False
`reinit`	bool	False
`sync_tensorboard`	bool	False
`save_code`	bool	False
`name`	string	TAO Toolkit Training
`run_id`	string

Model Configuration#

Field	value_type	description	default_value	valid_options	automl_enabled
`backbone`	collection	Configuration hyper parameters for the NVPanoptix3D Backbone			False
`sem_seg_head`	collection	Configuration hyper parameters for the Mask2Former Semantic Segmentation Head			False
`mask_former`	collection	Configuration hyper parameters for the Mask2Former model			False
`frustum3d`	collection	Configuration hyper parameters for the Frustum3D model			False
`projection`	collection	Configuration hyper parameters for the Projection model			False
`mode`	categorical	Segmentation mode	panoptic	panoptic,instance,semantic
`object_mask_threshold`	float	The value of the threshold to be used when filtering out the object mask	0.4
`overlap_threshold`	float	The value of the threshold to be used when evaluating overlap	0.5
`test_topk_per_image`	int	Keep topk instances per image for instance segmentation	100

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`backbone_type`	categorical	Type of backbone to use. Available backbone: vggt	vggt			vggt
`pretrained_model_path`	string	Path to a pretrained backbone file

Field	value_type	description	default_value	valid_min	valid_max	automl_enabled
`common_stride`	int	Common stride	4	2
`transformer_enc_layers`	int	Number of transformer encoder layers	6	1
`convs_dim`	int	Convolutional layer dimension	256	1
`mask_dim`	int	Mask head dimension	256	1
`depth_dim`	int	Depth head dimension	256	1
`ignore_value`	int	Ignore value	255	0	255
`deformable_transformer_encoder_in_features`	list	List of feature names for deformable transformer encoder input	[‘res3’, ‘res4’, ‘res5’]			False
`num_classes`	int	Number of classes	13	1
`norm`	string	Norm layer type	GN
`in_features`	list	List of input feature names	[‘res2’, ‘res3’, ‘res4’, ‘res5’]			False

Field	value_type	description	default_value	valid_min	valid_max
`dropout`	float	The probability to drop out	0	0.0	1.0
`nheads`	int	Number of heads	8
`num_object_queries`	int	The number of queries	100	1	inf
`hidden_dim`	int	Dimension of the hidden units	256
`transformer_dim_feedforward`	int	Dimension of the feedforward network in the transformer	1024	1
`dim_feedforward`	int	Dimension of the feedforward network	2048	1
`dec_layers`	int	Number of decoder layers in the transformer	10	1
`pre_norm`	bool	Whether to add layer norm in the encoder; 1=add layer norm, 0=do not add	0
`class_weight`	float	The relative weight of the classification error in the matching cost	2	0.0	inf
`dice_weight`	float	The relative weight of the focal loss of the binary mask in the matching cost	5	0.0	inf
`mask_weight`	float	The relative weight of the dice loss of the binary mask in the matching cost	5	0.0	inf
`depth_weight`	float	The relative weight of the depth loss in the matching cost	5	0.0	inf
`mp_occ_weight`	float	The relative weight of the mp occ loss in the matching cost	5	0.0	inf
`train_num_points`	int	The number of points to sample	12544
`oversample_ratio`	float	Oversampling parameter	3
`importance_sample_ratio`	float	Ratio of points that are sampled via important sampling	0.75
`deep_supervision`	bool	Flag to enable deep supervision	1
`no_object_weight`	float	The relative classification weight applied to the no-object category	0.1
`size_divisibility`	int	Size divisibility	32

Field	value_type	description	default_value	automl_enabled
`truncation`	float	The truncation value	3.0
`iso_recon_value`	float	The iso recon value	2.0
`panoptic_weight`	float	The weight of the panoptic loss	25.0
`completion_weights`	list	The weights of the completion loss	[50.0, 25.0, 10.0]	False
`surface_weight`	float	The weight of the surface loss	5.0
`unet_output_channels`	int	The number of output channels of the UNet	16
`unet_features`	int	The number of features of the UNet	16
`use_multi_scale`	bool	Whether to use multi-scale	False
`grid_dimensions`	int	The number of grid dimensions	256
`frustum_dims`	int	The number of frustum dimensions	256
`signed_channel`	int	The number of signed channel	3

Field	value_type	description	default_value
`voxel_size`	float	The size of the voxel	0.03
`sign_channel`	bool	Whether to use signed channel	1
`depth_feature_dim`	int	The dimension of the depth feature	256

Dataset Configuration#

Field	value_type	description	default_value	valid_min	valid_options	automl_enabled
`train`	collection	Configurable parameters to construct the train dataset				False
`val`	collection	Configurable parameters to construct the validation dataset				False
`test`	collection	Configurable parameters to construct the test dataset				False
`workers`	int	The number of parallel workers processing data	8	1
`pin_memory`	bool	Flag to allocate pagelocked memory for faster of data between the CPU and GPU	True
`augmentation`	collection	Configuration parameters for data augmentation				False
`contiguous_id`	bool	Flag to enable contiguous IDs for labels	False
`label_map`	string	A path to label map file
`name`	categorical	Dataset name	front3d		front3d,matterport,synthetic_hospital,synthetic_warehouse
`downsample_factor`	int	Downsample factor (1: Synthetic & Front3D, 2: Matterport3D)	1
`iso_value`	float	ISO value to reconstruct mesh from TUDF volume	1.0
`ignore_label`	int	Ignore label value	255
`min_instance_pixels`	int	Minimum number of pixels required for an instance to be considered valid	200
`img_format`	string	Image format	RGB
`target_size`	list	Input image size to resize	[320, 240]			False
`reduced_target_size`	list	Image size to process at 3D stage	[160, 120]			False
`depth_size`	list	Input depth size to resize	[120, 160]			False
`depth_bound`	bool	Enable depth truncation in bounds	False
`depth_min`	float	Min depth value	0.4
`depth_max`	float	Max depth value	6.0
`frustum_mask_path`	string	Relative frustum mask path	meta/frustum_mask.npz
`occ_truncation_lvl`	list	Value to create occuppancy volume from TUDF volume	[8.0, 6.0]			False
`truncation_range`	list	truncation range for TUDF volume	[0.0, 12.0]			False
`enable_3d`	bool	Enable 3d for training	False
`enable_mp_occ`	bool	Enable multi-plane occupancy	True
`depth_scale`	float	Depth scale	25.0
`num_thing_classes`	int	Number of thing classes	9

Field	value_type	description	default_value	valid_min
`base_dir`	string	Root directory of the dataset
`json_path`	string	JSON file for image/mask pair
`batch_size`	int	Batch size	1	1
`num_workers`	int	Number of workers in the dataloader	1	0

Field	value_type	description	default_value	valid_min
`base_dir`	string	Root directory of the dataset
`json_path`	string	JSON file for image/mask pair
`batch_size`	int	Batch size	1	1
`num_workers`	int	Number of workers in the dataloader	1	0

Field	value_type	description	default_value	valid_min
`base_dir`	string	Root directory of the dataset
`json_path`	string	JSON file for image/mask pair
`batch_size`	int	Batch size	1	1
`num_workers`	int	Number of workers in the dataloader	1	0

Field	value_type	description	default_value	valid_min	valid_max	automl_enabled
`train_min_size`	list	A list of sizes to perform random resize	[448]			False
`train_max_size`	int	The maximum random crop size for training data	768	32	960
`train_crop_size`	list	The random crop size for training data in [H, W]	[240, 240]			False
`test_min_size`	int	The minimum resize size for test data	240	32	960
`test_max_size`	int	The maximum resize size for test	960	32	960
`color_aug_ssd`	bool	Color augmentation	False
`enable_crop`	bool	Enable cropping for input image	False
`crop_size`	list	Size to crop input image	[240, 240]			False
`single_category_max_area`	float	Maximum ratio of crop area that can be occupied by a single semantic category	1.0	0.0	1.0
`random_flip`	string	Flip horizontal/vertical
`random_flip_prob`	float	Flip probability	0.5	0.0	1.0
`size_divisibility`	float	Size divisibility to pad	-1
`gen_aug_weight`	float	Weight for generated augmentation, 0.0 will disable generated augmentation	0.0	0.0	1.0

Training Configuration#

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`num_gpus`	int	The number of GPUs to run the train job	1	1
`gpu_ids`	list	List of GPU IDs to run the training on; length of list must equal `train.num_gpus`	[0]				False
`num_nodes`	int	Number of nodes to run the training on; if > 1, multi-node is enabled	1	1
`seed`	int	Seed for the initializer in PyTorch; if < 0, fixed seed is disabled	1234	-1	inf
`cudnn`	collection						False
`num_epochs`	int	Number of epochs to run the training	10	1	inf
`checkpoint_interval`	int	The interval (in epochs) at which a checkpoint is saved	1	1
`checkpoint_interval_unit`	categorical	The unit of the checkpoint interval	epoch			epoch,step
`validation_interval`	int	The interval (in epochs) at which an evaluation is triggered on the validation set	1	1
`resume_training_checkpoint_path`	string	Path to the checkpoint to resume training from
`results_dir`	string	The folder in which to save the experiment
`checkpoint_2d`	string	Path to 2D stage checkpoint to initialize the 3D stage training
`checkpoint_3d`	string	Path to 3D stage checkpoint to initialize the 3D stage training
`val_check_interval`	int	The number of iterations between validation checks	5
`freeze`	list	List of layer names to freeze Example: [“backbone”, “transformer.encoder”, “input_proj”]	[]				False
`clip_grad_norm`	float	Amount to clip the gradient by L2 Norm	0.1
`clip_grad_norm_type`
`clip_grad_type`	string	Gradient clip type	full
`is_dry_run`	bool	Whether to run the trainer in Dry Run mode	False
`optim`	collection	Hyper parameters to configure the optimizer					False
`precision`	categorical	Precision to run the training on	fp32			fp16,fp32
`distributed_strategy`	categorical	The multi-GPU training strategy DDP (Distributed Data Parallel) and Fully Sharded DDP are supported	ddp			ddp,fsdp
`activation_checkpoint`	bool	A True value instructs train to recompute in backward pass to save GPU memory, rather than storing activations	True
`verbose`	bool	Flag to enable printing of detailed learning rate scaling from the optimizer	False
`iters_per_epoch`	int	Number of iterations per epoch

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`type`	categorical	Type of optimizer used to train the network	AdamW			AdamW
`monitor_name`	categorical	The metric value to be monitored for the `AutoReduce` Scheduler	val_loss			val_loss,train_loss
`lr`	float	The initial learning rate for training the model	0.0002	0.0	1.0		True
`backbone_multiplier`	float	A multiplier for backbone learning rate	0.1	0.0	1.0		True
`momentum`	float	The momentum for the AdamW optimizer	0.9	0.0	1.0		True
`weight_decay`	float	The weight decay coefficient	0.05	0.0	1.0		True
`lr_scheduler`	categorical	The learning scheduler: MultiStep: Decrease the lr by lr_decay from lr_steps Warmuppoly: Poly learning rate schedule	MultiStep			MultiStep,Warmuppoly
`milestones`	list	Learning rate decay epochs	[88, 96]				False
`gamma`	float	Multiplicative factor of learning rate decay	0.1
`max_steps`	int	The maximum number of steps to train the model	160000
`warmup_factor`	float	The warmup factor for the learning rate scheduler	1.0
`warmup_iters`	int	The number of warmup iterations	0

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`benchmark`	bool	Whether to enable cuDNN benchmark mode	False
`deterministic`	bool	Whether to enable cuDNN deterministic mode	True

Inference Configuration#

Field	value_type	description	default_value	valid_min	valid_options	automl_enabled
`num_gpus`	int	The number of GPUs to run the evaluation job	1	1
`gpu_ids`	list	List of GPU IDs to run the inference on; length of list must equal `inference.num_gpus`	[0]			False
`num_nodes`	int	Number of nodes to run the inference on; if > 1, multi-node is enabled	1	1
`checkpoint`	string	Path to the checkpoint file used for inference
`trt_engine`	string	Path to the TensorRT engine folder to be used for inference
`results_dir`	string	Path to where all the assets generated from a task are stored
`batch_size`	int	The batch size of the input tensor; important if `batch_size` > 1 for large datasets	-1	-1
`mode`	categorical	Mode to run inference	panoptic		semantic,instance,panoptic
`images_dir`	string	Path to the images directory

Evaluation Configuration#

Field	value_type	description	default_value	valid_min	automl_enabled
`num_gpus`	int	The number of GPUs to run the evaluation job	1	1
`gpu_ids`	list	List of GPU IDs to run the evaluation on; length of list must equal `evaluate.num_gpus`	[0]		False
`num_nodes`	int	Number of nodes to run the evaluation on; if > 1, multi-node is enabled	1	1
`checkpoint`	string	Path to the checkpoint file used for evaluation
`trt_engine`	string	Path to the TensorRT engine to be used for evaluation; only works with `tao-deploy`
`results_dir`	string	Path to where all the assets generated from a task are stored
`batch_size`	int	The batch size of the input tensor; important if `batch_size` > 1 for large datasets	-1	-1

Export Configuration#

Field	value_type	description	default_value	valid_min	valid_options
`results_dir`	string	Path to where all the assets generated from a task are stored
`gpu_id`	int	The index of the GPU used to build the TensorRT engine	0
`checkpoint`	string	Path to the checkpoint file to run export	???
`onnx_file`	string	Path to the ONNX model file	???
`on_cpu`	bool	Flag to export CPU compatible model	False
`input_channel`	ordered_int	Number of channels in the input tensor	3	1	1,3
`input_width`	int	Width of the input image tensor	960	32
`input_height`	int	Height of the input image tensor	544	32
`opset_version`	int	Operator set version of the ONNX model used to generate the TensorRT engine	17	1
`batch_size`	int	The batch size of the input tensor for the engine A value of `-1` implies dynamic tensor shapes	-1	-1
`verbose`	bool	Flag to enable verbose TensorRT logging	False
`format`	categorical	File format to export to	onnx		onnx,xdl
`onnx_file_2d`	string	Path to the ONNX model 2D file
`onnx_file_3d`	string	Path to the ONNX model 3D file
`max_voxels`	int	The maximum number of voxels in the input tensor for the engine	700000	1

TensorRT Engine Configuration#

Field	value_type	description	default_value	valid_min	automl_enabled
`results_dir`	string	Path to where all the assets generated from a task are stored
`gpu_id`	int	The index of the GPU used to build the TensorRT engine	0	0
`onnx_file`	string	Path to the ONNX model file	???
`trt_engine`	string	Path to the generated TensorRT engine; only works with `tao-deploy`	???
`timing_cache`	string	Path to a TensorRT timing cache that speeds up engine generation; will be created, read, and updated
`batch_size`	int	The batch size of the input tensor for the engine A value of `-1` implies dynamic tensor shapes	-1	-1
`verbose`	bool	Flag to enable verbose TensorRT logging	False
`tensorrt`	collection	Hyper parameters to configure the NVPanoptix3D TensorRT Engine builder			False

Field	value_type	description	default_value	valid_min	valid_options	automl_enabled
`workspace_size`	int	The size (in megabytes) of the workspace TensorRT has to run its optimization tactics and generate the TensorRT engine	1024	0
`min_batch_size`	int	The minimum batch size in the optimization profile for the input tensor of the TensorRT engine	1	1
`opt_batch_size`	int	The optimum batch size in the optimization profile for the input tensor of the TensorRT engine	1	1
`max_batch_size`	int	The maximum batch size in the optimization profile for the input tensor of the TensorRT engine	1	1
`layers_precision`	list	The list to specify layer precision	[]			False
`data_type`	categorical	The precision to be set for building the TensorRT engine	FP32		FP32,FP16

Training#

NvPanoptix3D training requires two sequential stages. Complete Stage 1 before beginning Stage 2.

Stage 1: 2D Panoptic Segmentation#

Stage 1 trains the 2D panoptic segmentation and depth estimation head. Set dataset.enable_3d to False in your configuration file.

Stage 2: 3D Volumetric Reconstruction#

Stage 2 freezes the Stage 1 model weights and trains the 3D U-Net frustum completion module. Set dataset.enable_3d to True and provide the Stage 1 checkpoint via train.checkpoint_2d.

Note

Only fp32 precision is supported. Mixed precision training is not available for NvPanoptix3D.

Checkpointing and Resuming Training

At every train.checkpoint_interval, a PyTorch Lightning checkpoint is saved. It is called model_epoch_<epoch_num>.pth. Checkpoints are saved in train.results_dir, like this:

$ ls /results/train

'model_epoch_000.pth'
'model_epoch_001.pth'
'model_epoch_002.pth'
'model_epoch_003.pth'
'model_epoch_004.pth'

Evaluation#

NvPanoptix3D reports the following metrics, which assess the quality of 3D panoptic reconstruction:

Metric	Description
PRQ	Panoptic Reconstruction Quality. Overall 3D panoptic performance, combining geometry accuracy and semantic recognition.
RSQ	Reconstructed Segmentation Quality. Measures the quality of semantic segmentation in 3D.
RRQ	Reconstruction Recognition Quality. Measures instance recognition quality in 3D.

To evaluate a trained model, provide the path to the checkpoint and the test dataset in the specification file.

Set dataset.enable_3d to True in the specification file to evaluate the full 3D model, or False to evaluate the Stage 1 (2D) model only. When evaluating the 2D model, the reported metric is Panoptic Quality (PQ).

Performance#

The following tables show NvPanoptix3D performance on the 3D-Front test set and Matterport3D validation set, compared against published baselines. Metrics are reported for all categories combined and broken down into Things (countable objects) and Stuff (background regions). Bold values indicate the best result in each column.

3D-Front Test Set#
Model	PRQ (All / Things / Stuff)	RSQ (All / Things / Stuff)	RRQ (All / Things / Stuff)
BUOL	54.01 / 49.73 / 73.30	63.81 / 60.57 / 78.37	82.99 / 80.67 / 93.42
Uni3D	52.76 / 47.29 / 77.41	60.98 / 56.56 / 80.87	84.26 / 81.81 / 95.31
NvPanoptix3D	54.32 / 49.74 / 74.90	62.95 / 58.98 / 80.80	83.94 / 82.15 / 92.00

Matterport3D Validation Set#
Model	PRQ (All / Things / Stuff)	RSQ (All / Things / Stuff)	RRQ (All / Things / Stuff)
BUOL	14.47 / 10.97 / 24.94	45.71 / 45.30 / 46.93	30.91 / 23.81 / 52.22
Uni3D	16.32 / 13.21 / 29.33	44.36 / 44.58 / 44.09	36.48 / 29.33 / 65.19
NvPanoptix3D	17.63 / 14.79 / 28.04	45.27 / 45.68 / 43.31	38.98 / 32.26 / 64.02

NvPanoptix3D achieves the highest PRQ on both datasets and the highest RRQ on 3D-Front, demonstrating strong 3D panoptic reconstruction quality across both synthetic and real indoor environments.

Inference#

NvPanoptix3D inference runs on a directory of RGB images. NvPanoptix3D does not require ground truth annotations. The network accepts .jpg and .png images as input.

Set dataset.enable_3d to True to produce 3D reconstruction outputs, or False to produce 2D panoptic segmentation and depth outputs only.

The inference outputs saved to results_dir include the following:

Output	Shape	Description
2D panoptic segmentation	(120, 160)	Per-pixel panoptic label map combining semantic and instance information.
2D depth map	(120, 160)	Per-pixel depth estimate in meters.
3D geometry	(256, 256, 256)	Truncated signed distance field representing the 3D scene geometry.
3D semantic segmentation	(256, 256, 256)	Per-voxel semantic class labels.
3D panoptic segmentation	(256, 256, 256)	Per-voxel panoptic labels combining semantic and instance information.

Export#

NvPanoptix3D exports the Stage 1 (2D) model to ONNX format for deployment with NVIDIA® TensorRT™.

Note

Only the 2D model supports ONNX export in this release. 3D model export is not yet available.

After export, generate a TensorRT engine from the ONNX file using the provided gen_trt_engine.py script:

python3 gen_trt_engine.py \
    --onnx_file_2d /path/to/model_2d.onnx \
    --trt_engine_2d /path/to/trt_2d_engine.engine \
    --batch_size 1 \
    --input_height 256 \
    --input_width 320 \
    --workspace_gb 8

Inference with NVIDIA Triton Inference Server#

NvPanoptix3D supports deployment as a hybrid TensorRT and PyTorch ensemble model on NVIDIA Triton Inference Server. The 2D stage runs as a TensorRT engine, and the 3D stage runs as a PyTorch model. The Triton model repository and client scripts are provided in the tlt-triton-apps repository.

The Triton model accepts the following inputs and produces the following outputs:

Name	Direction	Data Type	Description
`images`	Input	`UINT8`	RGB image with shape `[3, H, W]`.
`frustum_mask`	Input	`UINT8`	Frustum mask with shape `[D, H, W]`.
`intrinsic`	Input	`FP32`	Camera intrinsic matrix with shape `[4, 4]`.
`panoptic_seg_2d`	Output	`INT32`	2D panoptic segmentation map.
`depth_2d`	Output	`FP32`	2D depth map.
`panoptic_seg_3d`	Output	`INT32`	3D panoptic segmentation volume.
`geometry_3d`	Output	`FP32`	3D scene geometry (truncated signed distance field).
`semantic_seg_3d`	Output	`INT32`	3D semantic segmentation volume.

To start the Triton server:

bash scripts/nvpanoptix3d_e2e_inference/start_server.sh

Install Python client requirements:

pip install -r scripts/nvpanoptix3d_e2e_inference/client-requirements.txt

To run the Triton client against the server:

bash scripts/nvpanoptix3d_e2e_inference/start_client.sh

Refer to the tlt-triton-apps repository for complete setup instructions, including NGC authentication, Docker configuration, and client usage.