Sparse4D#

Sparse4D is a Multi-Camera 3D Detection and Tracking model with 4D (spatial-temporal) capabilities. It takes synchronized input images from multiple cameras and calibration matrices and outputs the 3D bounding boxes and temporally consistent tracking IDs. The model is based on resnet101, which is a general-purpose backbone for computer vision.

Each batch in Sparse4D is trained on a group of cameras. Each group is called as a bird’s-eye view (BEV) group. A BEV group is a collection of multiple overlapping cameras.

The expected data requirements and time to fine-tune the Sparse4D model on a single scene of the MTMC Tracking 2025 dataset are as follows:

Estimated time for fine-tuning Sparse4D on a single scene of the MTMC Tracking 2025 dataset#
Backbone type	GPU type	Image size	No. of BEV groups	No. of cameras in each BEV group	No. of frames in each camera	Total no. of epochs	Total training time
Resnet101	8 x Nvidia H100 - 80GB SXM	3x512x1408	3 (Minimum BEV groups)	4-12	9000 (5 mins @ 30 FPS)	5	10 hours

Sparse4D supports the following tasks:

train
evaluate
inference
export
quantize

Data Input for Sparse4D#

The Sparse4D apps in TAO utilize the dataset and for its training, validation and testing.

Refer to the MTMC Tracking 2025 dataset page in the PhysicalAI-SmartSpaces HuggingFace repository for more about the raw dataset format. The dataset is converted into pickle format and stored in the data/sparse4d/ directory.

Creating an Experiment Specification File#

The specification file for Sparse4D includes model, dataset, train parameters, visualize parameters, evaluate parameters and inference parameters. The following is an example specification file for training a Sparse4D model on one scene of the MTMC Tracking 2025 dataset. We will utilize the Warehouse_014 scene from the MTMC Tracking 2025 dataset for training.

The experiment specification consists of several main components:

dataset
model
train
evaluate
inference
export
visualize

dataset#

The dataset parameter defines the dataset source, training batch size, and augmentation. An example dataset is provided below. This section describes the main parameters of the Omniverse3DDetTrackDatasetConfig.

dataset:
  use_h5_file_for_rgb: false
  use_h5_file_for_depth: true
  num_frames: 9000
  batch_size: 2
  num_bev_groups: 1
  num_workers: 2
  num_ids: 70
  classes: [
    "person",
    "gr1_t2",
    "agility_digit",
    "nova_carter",
  ]
  type: "omniverse_3d_det_track"
  data_root: ???
  train_dataset:
    ann_file: ???
    test_mode: false
    use_valid_flag: true
    with_seq_flag: true
    sequences_split_num: 100
    keep_consistent_seq_aug: true
    same_scene_in_batch: true
  val_dataset:
    ann_file: ???
    test_mode: true
    use_valid_flag: true
    tracking: true
    tracking_threshold: 0.2
  test_dataset:
    ann_file: ???
    test_mode: true
    use_valid_flag: true
    tracking: true
    tracking_threshold: 0.2
  augmentation:
    resize_lim: [0.7, 0.77]
    final_dim: [512, 1408]
    bot_pct_lim: [0.0, 0.0]
    rot_lim: [-5.4, 5.4]
    image_size: [1080, 1920]
    rand_flip: true
    rot3d_range: [-0.3925, 0.3925]
  normalize:
    mean: [123.675, 116.28, 103.53]
    std: [58.395, 57.12, 57.375]
    to_rgb: true
  sequences:
    split_num: 100
    keep_consistent_aug: true
    same_scene_in_batch: true

Field	value_type	description	default_value	valid_min	valid_max	automl_enabled
`type`	string	Dataset type	omniverse_3d_det_track
`batch_size`	int	Batch size	2	1	infinity
`use_h5_file_for_rgb`	bool	Use H5 file	False
`use_h5_file_for_depth`	bool	Use H5 file	True
`num_frames`	int	Number of frames	200	1	infinity
`num_bev_groups`	int	Number of BEV groups	1	1	infinity
`data_root`	string	Path to data root	???
`anno_root`	string	Path to annotation root	???
`classes`	list	Classes to detect	[‘person’, ‘humanoid’, ‘nova_carter’, ‘transporter’, ‘forklift’, ‘box’, ‘pallet’, ‘crate’]			false
`num_workers`	int	Number of workers	4	0	infinity
`num_ids`	int	Number of IDs	70	1	infinity
`augmentation`	collection	Augmentation config				false
`normalize`	collection	Normalize config				false
`sequences`	collection	Sequences config				false
`train_dataset`	collection	Train dataset config				false
`val_dataset`	collection	Val dataset config				false
`test_dataset`	collection	Test dataset config				false

Dynamic Resampling and Dataset-Loading Options#

Starting in TAO 7.0.1, the Sparse4D dataloader supports lazy pickle loading, balanced per-epoch dynamic resampling, and FPS-drop augmentation. These options are most useful when training on large multi-camera annotation sets that are split across many .pkl files.

These options are set as additional keys directly under the dataset block (alongside type, batch_size, and the other fields shown above). Unlike the fields in the Omniverse3DDetTrackDatasetConfig table, they are not part of the typed dataset schema: the dataloader reads them dynamically from the dataset configuration with per-key defaults, so they do not appear in the generated default_specs and are not Hydra-validated. As a result, a misspelled key is silently ignored (it falls back to its default) rather than raising an error, so take care to spell these keys exactly as shown. The keys and their defaults are:

Key	value_type	description	default_value
`lazy_load`	bool	Enable lazy `.pkl` loading	False
`lazy_load_cache_size`	int	Lazy-load LRU cache size	50
`pkl_sample_size`	int	`.pkl` files sampled per epoch (0 = disabled)	0
`pkl_cam_counts_path`	string	Path to the pkl-to-camera-count mapping pickle	“” (treated as `None` at runtime)
`max_cameras`	int	Subsample to this many cameras/frame (-1 = all)	-1
`fps_drop_prob`	float	FPS-drop augmentation probability	0.0
`target_fps_choices`	list	Candidate target FPS values for FPS-drop	[30, 20, 15, 10, 6, 5, 3, 2, 1]

Lazy loading. When lazy_load is set to True, the dataloader does not load every annotation .pkl file into memory up front. Instead it loads a pre-built frame index and reads the underlying .pkl files on demand through a least-recently-used (LRU) cache whose maximum size is controlled by lazy_load_cache_size. This keeps the host-memory footprint bounded regardless of how many .pkl files the dataset spans.

Lazy loading requires that the ann_file of the training dataset is either a directory of .pkl files or a .txt file listing one .pkl path per line, and that a lazy-index cache file is present alongside the annotation file before training starts:

If ann_file is a directory, the cache file is expected at <ann_file>/_lazy_index.pkl.
If ann_file is a .txt list, the cache file is expected at the same path with the .txt extension replaced by _lazy_index.pkl (for example, train.txt → train_lazy_index.pkl).

The dataloader does not build this index automatically. If lazy loading is enabled and the cache file is missing, the dataloader raises a FileNotFoundError and training stops, so you must generate the _lazy_index.pkl file yourself beforehand.

The index file is a pickle containing a dict with a frame_index key. The value of frame_index is a list of per-frame entries, where each entry is a dict that must include at least:

pkl_path — path to the .pkl file that holds that frame’s annotations. The dataloader reads these files on demand using the recorded pkl_path, so the paths must resolve to the actual annotation files at training time (use the same absolute or relative form that the pkl_cam_counts_path mapping below uses).
scene_name — the scene the frame belongs to (used to sort and group frames).
timestamp — the frame timestamp (used to sort frames within a scene).

The dict may optionally include a metadata key, which is surfaced as the dataset version.

If ann_file points to a single .pkl file, the dataloader ignores lazy loading and automatically falls back to normal (non-lazy) loading, so no index file is needed in that case.

Dynamic resampling. When pkl_sample_size is greater than 0 (and lazy loading is enabled), each epoch draws a fresh, balanced subset of pkl_sample_size .pkl files from the full index instead of training on all files. The subset is balanced by the number of cameras per .pkl file, so that scenes captured with different camera counts are represented evenly. When pkl_sample_size is 0 (the default), resampling is disabled and lazy loading uses the full index.

The per-camera-count mapping is read from the pickle file pointed to by pkl_cam_counts_path. This file is a pickle containing a single dict that maps each .pkl path to the number of cameras in that .pkl file, for example:

{
    "/data/sparse4d/train/scene_0001.pkl": 6,
    "/data/sparse4d/train/scene_0002.pkl": 12,
    # ... one entry per .pkl file referenced by the lazy index
}

There is no built-in tool to produce this file, so you must generate it yourself (for example, with a short script that opens each .pkl file and counts its cameras). The keys in this mapping must match the pkl_path values stored in the lazy index exactly (same absolute-vs-relative form and same string). Every .pkl path present in the lazy index must have an entry in this mapping; if any path is missing, resampling raises a KeyError that lists the unmatched paths. pkl_cam_counts_path is only consulted when pkl_sample_size is greater than 0.

Resampling is driven automatically during training: a PklResampleCallback re-runs the balanced sampling at every epoch boundary, then refreshes the in-batch sequence sampler so it picks up the new sequence/scene grouping. The sampling is seeded reproducibly per epoch, so each epoch sees a different but deterministic subset.

FPS-drop augmentation. fps_drop_prob sets the probability of temporally downsampling a scene during loading to one of the frame rates listed in target_fps_choices, which exposes the model to a range of effective frame rates. Setting fps_drop_prob to 0 (the default) disables this augmentation.

Camera subsampling. When max_cameras is greater than 0, each training frame is randomly subsampled to at most that many cameras. The default of -1 keeps all cameras.

The following example enables lazy loading with balanced dynamic resampling of 500 .pkl files per epoch and FPS-drop augmentation:

dataset:
  type: "omniverse_3d_det_track"
  data_root: ???
  batch_size: 4
  num_workers: 4
  lazy_load: true
  lazy_load_cache_size: 50
  pkl_sample_size: 500
  pkl_cam_counts_path: /data/sparse4d/_pkl_cam_counts.pkl
  fps_drop_prob: 0.2
  target_fps_choices: [30, 20, 15, 10, 6, 5, 3, 2, 1]
  max_cameras: -1
  train_dataset:
    ann_file: /data/sparse4d/train     # directory of .pkl files (or a .txt list)
    test_mode: false
    use_valid_flag: true
    with_seq_flag: true
    sequences_split_num: 100
    keep_consistent_seq_aug: true
    same_scene_in_batch: true

Train Dataset Configuration (dataset.train_dataset)#

Configuration for the training dataset.

Field	value_type	description	default_value	valid_min	valid_max
`ann_file`	string	Path to annotation file	???
`test_mode`	bool	Test mode	False
`use_valid_flag`	bool	Use valid flag	True
`with_seq_flag`	bool	With sequence flag	True
`sequences_split_num`	int	Number of sequences	100	1	infinity
`keep_consistent_seq_aug`	bool	Keep consistent sequence augmentation	True
`same_scene_in_batch`	bool	Same scene in batch	True

Validation Dataset Configuration (dataset.val_dataset)#

Configuration for the validation dataset.

Field	value_type	description	default_value	valid_min	valid_max
`ann_file`	string	Path to annotation pickle files/folders	???
`test_mode`	bool	Test mode	False
`use_valid_flag`	bool	Use valid flag	True
`tracking`	bool	Tracking	True
`tracking_threshold`	float	Tracking threshold	0.2	0	1
`same_scene_in_batch`	bool	Same scene in batch	True

Test Dataset Configuration (dataset.test_dataset)#

Configuration for the test dataset.

Field	value_type	description	default_value	valid_min	valid_max
`ann_file`	string	Path to annotation pickle files/folders	???
`test_mode`	bool	Test mode	True
`use_valid_flag`	bool	Use valid flag	True
`tracking`	bool	Tracking	True
`tracking_threshold`	float	Tracking threshold	0.2	0	1
`same_scene_in_batch`	bool	Same scene in batch	True

Augmentation Configuration (dataset.augmentation)#

Configuration for data augmentation.

Field	value_type	description	default_value	automl_enabled
`resize_lim`	list	Resize limits	[0.7, 0.77]	false
`final_dim`	list	Final dimensions	[512, 1408]	false
`bot_pct_lim`	list	Bottom percentage limits	[0.0, 0.0]	false
`rot_lim`	list	Rotation limits in degrees	[-5.4, 5.4]	false
`image_size`	list	Original image size	[1080, 1920]	false
`rand_flip`	bool	Random flip	True
`rot3d_range`	list	3D rotation range in radians	[-0.3925, 0.3925]	false

Normalize Configuration (dataset.normalize)#

Configuration for image normalization.

Field	value_type	description	default_value	automl_enabled
`mean`	list	Mean values for normalization	[123.675, 116.28, 103.53]	false
`std`	list	Standard deviation values for normalization	[58.395, 57.12, 57.375]	false
`to_rgb`	bool	Convert to RGB	True

Sequences Configuration (dataset.sequences)#

Configuration for handling image sequences.

Field	value_type	description	default_value	valid_min	valid_max
`split_num`	int	Number of sequence splits	100	1	infinity
`keep_consistent_aug`	bool	Keep consistent augmentation	True
`same_scene_in_batch`	bool	Keep same scene in batch	True

model#

The model parameter provides options to change the Sparse4D architecture.

model:
  type: "sparse4d"
  use_grid_mask: true
  use_deformable_func: true
  use_temporal_align: true
  input_shape: [1408, 512]
  embed_dims: 256
  neck:
    type: "FPN"
    num_outs: 4
    start_level: 0
    out_channels: 256
    in_channels: [256, 512, 1024, 2048]
    add_extra_convs: "on_output"
    relu_before_extra_convs: true
  depth_branch:
    type: "dense_depth"
    embed_dims: "${model.embed_dims}"
    num_depth_layers: 3
    loss_weight: 0.2
  head:
    type: "sparse4d"
    num_output: 300
    cls_threshold_to_reg: 0.05
    decouple_attn: true
    return_feature: true
    use_reid_sampling: false
    embed_dims: "${model.embed_dims}"
    num_groups: 8
    num_decoder: 6
    num_single_frame_decoder: 1
    drop_out: 0.1
    temporal: true
    with_quality_estimation: true
    instance_bank:
      num_anchor: 900
      anchor: ???
      num_temp_instances: 600
      confidence_decay: 0.8
      feat_grad: false
      default_time_interval: 0.033333
      embed_dims: "${model.embed_dims}"
      use_temporal_align: "${model.use_temporal_align}"
    anchor_encoder:
      type: 'SparseBox3DEncoder'
      vel_dims: 3
      embed_dims: [128, 32, 32, 64]
      mode: 'cat'
      output_fc: false
      in_loops: 1
      out_loops: 4
    operation_order: [
      "deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
      "deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
      "deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
      "deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
      "deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
      "deformable", "ffn", "norm", "refine"
    ]
    temp_graph_model:
      type: "MultiheadAttention"
      embed_dims: 512
      num_heads: 8
      batch_first: true
      dropout: 0.1
    graph_model:
      type: "MultiheadAttention"
      embed_dims: "${model.head.temp_graph_model.embed_dims}"
      num_heads: "${model.head.temp_graph_model.num_heads}"
      batch_first: true
      dropout: "${model.head.temp_graph_model.dropout}"
    norm_layer:
      type: "LN"
      normalized_shape: "${model.embed_dims}"
    ffn:
      type: "AsymmetricFFN"
      in_channels: 512
      pre_norm:
        type: "LN"
      embed_dims: 256
      feedforward_channels: 1024
      num_fcs: 2
      ffn_drop: 0.1
      act_cfg:
        type: "ReLU"
        inplace: true
    deformable_model:
      embed_dims: "${model.embed_dims}"
      num_groups: 8
      num_levels: 4
      attn_drop: 0.15
      use_deformable_func: true
      use_camera_embed: false
      residual_mode: "cat"
      kps_generator:
        embed_dims: "${model.embed_dims}"
        num_learnable_pts: 6
        fix_scale:
          - [0, 0, 0]
          - [0.45, 0, 0]
          - [-0.45, 0, 0]
          - [0, 0.45, 0]
          - [0, -0.45, 0]
          - [0, 0, 0.45]
          - [0, 0, -0.45]
    refine_layer:
      type: "SparseBox3DRefinementModule"
      embed_dims: "${model.embed_dims}"
      refine_yaw: true
      with_quality_estimation: true
    sampler:
      num_dn_groups: 5
      num_temp_dn_groups: 3
      dn_noise_scale: [2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]
      max_dn_gt: 128
      add_neg_dn: true
      cls_weight: 2.0
      box_weight: 0.25
      reg_weights: [2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0]
      use_temporal_align: "${model.use_temporal_align}"
    visibility_net:
      type: "visibility_net"
      embedding_dim: 256
      hidden_channels: 32
    loss:
      reg:
        type: "sparse_box_3d"
        box_weight: 0.25
        cls_allow_reverse: [5, 6, 7]
      cls:
        type: "focal"
        use_sigmoid: true
        gamma: 2.0
        alpha: 0.25
        loss_weight: 2.0
      id:
        type: "cross_entropy_label_smooth"
        num_ids: "${dataset.num_ids}"
    bnneck:
      type: "bnneck"
      feat_dim: 256
      num_ids: "${dataset.num_ids}"
    decoder:
      type: "SparseBox3DDecoder"
      score_threshold: 0.05
    reg_weights: [2.0, 2.0, 2.0, 1 ,1, 1, 1, 1, 1, 1, 1]

Field	value_type	description	default_value	valid_min	valid_max	automl_enabled
`type`	string	Model type	sparse4d
`embed_dims`	int	Embedding dimensions	256	1	infinity
`use_grid_mask`	bool	Use grid mask	True
`use_deformable_func`	bool	Use deformable function	True
`input_shape`	list	Input image shape	[1408, 512]			false
`backbone`	collection	Backbone config				false
`neck`	collection	Neck config				false
`depth_branch`	collection	Depth branch config				false
`head`	collection	Head config				false
`use_temporal_align`	bool	Use temporal alignment	False

Backbone Configuration (model.backbone)#

Configuration for the model’s backbone network. Currently, only resnet_101 is supported.

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`type`	string	Backbone type	resnet_101			resnet_101

Head Configuration (model.head)#

Top-level configuration for the detection and tracking head.

Field	value_type	description	default_value	valid_min	valid_max	automl_enabled
`type`	string	Head type	sparse4d
`num_output`	int	Number of output instances	300	1	infinity
`cls_threshold_to_reg`	float	Classification threshold for regression	0.05	0	1
`decouple_attn`	bool	Decouple attention	True
`return_feature`	bool	Return instance features	True
`use_reid_sampling`	bool	Use Re-ID sampling	False
`embed_dims`	int	Embedding dimensions	256	1	infinity
`reid_dims`	int	Re-ID dimensions	0	0	infinity
`num_groups`	int	Number of groups	8	1	infinity
`num_decoder`	int	Number of decoder layers	6	1	infinity
`num_single_frame_decoder`	int	Number of single-frame decoder layers	1	1	infinity
`drop_out`	float	Dropout rate	0.1	0	1
`temporal`	bool	Enable temporal modeling	True
`with_quality_estimation`	bool	Enable quality estimation	True
`operation_order`	list	Operation order	[‘deformable’, ‘ffn’, ‘norm’, ‘refine’, ‘temp_gnn’, ‘gnn’, ‘norm’, ‘deformable’, ‘ffn’, ‘norm’, ‘refine’, ‘temp_gnn’, ‘gnn’, ‘norm’, ‘deformable’, ‘ffn’, ‘norm’, ‘refine’, ‘temp_gnn’, ‘gnn’, ‘norm’, ‘deformable’, ‘ffn’, ‘norm’, ‘refine’, ‘temp_gnn’, ‘gnn’, ‘norm’, ‘deformable’, ‘ffn’, ‘norm’, ‘refine’, ‘temp_gnn’, ‘gnn’, ‘norm’, ‘deformable’, ‘ffn’, ‘norm’, ‘refine’]			false
`visibility_net`	collection	Visibility net config				false
`instance_bank`	collection	Instance bank config				false
`anchor_encoder`	collection	Anchor encoder config				false
`sampler`	collection	Sampler config				false
`reg_weights`	list	Regression weights	[2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]			false
`loss`	collection	Loss config				false
`bnneck`	collection	BN neck config				false
`deformable_model`	collection	Deformable model config				false
`refine_layer`	collection	Refine layer config				false
`valid_vel_weight`	float	Valid velocity weight	-1	-1	infinity
`graph_model`	collection	Graph model config				false
`temp_graph_model`	collection	Temp graph model config				false
`decoder`	collection	Decoder config				false
`norm_layer`	collection	Norm layer config				false
`ffn`	collection	FFN config				false

Deformable Model Configuration (model.head.deformable_model)#

Configuration for the deformable attention mechanism.

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`embed_dims`	int	Embedding dimensions	256	1	infinity
`num_groups`	int	Number of groups	8	1	infinity
`num_levels`	int	Number of levels	4	1	infinity
`attn_drop`	float	Attention dropout	0.15	0	1
`use_deformable_func`	bool	Use deformable function	True
`use_camera_embed`	bool	Use camera embedding	False
`residual_mode`	categorical	Residual mode	cat			cat,add
`num_cams`	int	Number of cameras	6	1	infinity
`max_num_cams`	int	Maximum number of cameras	20	1	infinity
`proj_drop`	float	Projection dropout	0.0	0	1
`kps_generator`	collection	KPS generator config					false

Instance Bank Configuration (model.head.instance_bank)#

Configuration for managing object instances over time.

Field	value_type	description	default_value	valid_min	valid_max
`num_anchor`	int	Number of anchors	900	1	infinity
`anchor`	string	Path to anchor file
`num_temp_instances`	int	Number of temporal instances	600	0	infinity
`confidence_decay`	float	Confidence decay factor	0.8	0	1
`feat_grad`	bool	Enable gradients for features	False
`default_time_interval`	float	Default time interval	0.033333	0	infinity
`embed_dims`	int	Embedding dimensions	256	1	infinity
`use_temporal_align`	bool	Use temporal alignment	False
`grid_size`	float	Grid size

Anchor Encoder Configuration (model.head.anchor_encoder)#

Configuration for encoding anchor information.

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`type`	string	Anchor encoder type	SparseBox3DEncoder
`vel_dims`	int	Velocity dimensions	3	1	infinity
`embed_dims`	list	Embedding dimensions	[128, 32, 32, 64]				false
`mode`	categorical	Mode	cat			cat,add
`output_fc`	bool	Fully Connected Layer	False
`in_loops`	int	In loops	1	1	infinity
`out_loops`	int	Out loops	4	1	infinity
`pos_embed_only`	bool	Pos embed only	False

Sampler Configuration (model.head.sampler)#

Configuration for sampling positive and negative examples during training.

Field	value_type	description	default_value	valid_min	valid_max	automl_enabled
`num_dn_groups`	int	Number of De-Noising groups	5	1	infinity
`num_temp_dn_groups`	int	Number of temporal DN groups	3	0	infinity
`dn_noise_scale`	list	De-Noising scale	[2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]			false
`max_dn_gt`	int	Maximum DN ground truth	128	1	infinity
`add_neg_dn`	bool	Add negative DN	True
`cls_weight`	float	Classification weight	2.0	0	infinity
`box_weight`	float	Box weight	0.25	0	infinity
`reg_weights`	list	Regression weights	[2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0]			false
`use_temporal_align`	bool	Use temporal alignment	False
`gt_assign_threshold`	float	Ground Truth assign threshold	0.5	0	1

Loss Configuration (model.head.loss)#

This section details the different loss components used in the model head.

Field	value_type	description	automl_enabled
`cls`	collection	Classification loss config	false
`reg`	collection	Regression loss config	false
`id`	collection	ID loss config	false

Classification Loss (model.head.loss.cls)#

Configuration for the classification loss.

Field	value_type	description	default_value	valid_min	valid_max
`type`	string	Classification loss type	focal
`use_sigmoid`	bool	Use sigmoid	True
`gamma`	float	Focal loss gamma	2.0	0	infinity
`alpha`	float	Focal loss alpha	0.25	0	1
`loss_weight`	float	Loss weight	2.0	0	infinity

Regression Loss (model.head.loss.reg)#

Field	value_type	description	default_value	valid_min	valid_max	automl_enabled
`type`	string	Regression loss type	sparse_box_3d
`box_weight`	float	Box loss weight	0.25	0	infinity
`cls_allow_reverse`	list	Class allow reverse	[]			false

ID Loss (model.head.loss.id)#

Configuration for the ID / Re-ID loss.

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`type`	string	ID loss type	cross_entropy_label_smooth
`num_ids`	int	Number of IDs	70	1	infinity

BNNeck Configuration (model.head.bnneck)#

Field	value_type	description	default_value	valid_min	valid_max
`type`	string	Batch Normalization Neck	bnneck
`feat_dim`	int	Feature dimension	256	1	infinity
`num_ids`	int	Number of IDs	70	1	infinity

KPS Generator Configuration (model.head.deformable_model.kps_generator)#

Configuration for KeyPoint (Sampling) Generator.

Field	value_type	description	default_value	valid_min	valid_max	automl_enabled
`embed_dims`	int	Embedding dimensions	256	1	infinity
`num_learnable_pts`	int	Number of learnable points	6	1	infinity
`fix_scale`	list	Fixed scale	[[0, 0, 0], [0.45, 0, 0], [-0.45, 0, 0], [0, 0.45, 0], [0, -0.45, 0], [0, 0, 0.45], [0, 0, -0.45]]			false

Refine Layer Configuration (model.head.refine_layer)#

Field	value_type	description	default_value	valid_min	valid_max
`type`	string	Refine layer type	sparse_box_3d_refinement_module
`embed_dims`	int	Embedding dimensions	256	1	infinity
`refine_yaw`	bool	Refine yaw	True
`with_quality_estimation`	bool	With quality estimation	True

Graph Model Configuration (model.head.graph_model and model.head.temp_graph_model)#

Configuration for graph-based modeling (e.g., GNN or attention) used for spatial and temporal relations.

Field	value_type	description	default_value	valid_min	valid_max
`type`	string	Graph model type	MultiheadAttention
`embed_dims`	int	Embedding dimensions	512	1	infinity
`num_heads`	int	Number of heads	8	1	infinity
`batch_first`	bool	Batch first	True
`dropout`	float	Dropout rate	0.1	0	1

Decoder Configuration (model.head.decoder)#

Configuration for the final output decoder.

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`type`	string	Decoder type	SparseBox3DDecoder
`score_threshold`	float	Score threshold	0.05	0	1

Norm Layer Configuration (model.head.norm_layer and model.head.ffn.pre_norm)#

Configuration for normalization layers.

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`type`	string	Norm layer type	LN
`normalized_shape`	int	Normalized shape	256	1	infinity

FFN Configuration (model.head.ffn)#

Configuration for Feed-Forward Networks used in the decoder layers.

Field	value_type	description	default_value	valid_min	valid_max	automl_enabled
`type`	string	FFN type	AsymmetricFFN
`in_channels`	int	In channels	512	1	infinity
`pre_norm`	collection	Pre-norm config				false
`embed_dims`	int	Embedding dimensions	256	1	infinity
`feedforward_channels`	int	Feedforward channels	1024	1	infinity
`num_fcs`	int	Number of feedforward channels	2	1	infinity
`ffn_drop`	float	FFN dropout	0.1	0	1
`act_cfg`	collection	Activation config				false

Activation Configuration (model.head.ffn.act_cfg)#

Configuration for activation functions.

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`type`	string	Activation type	ReLU
`inplace`	bool	Inplace	True

Visibility Net Configuration (model.head.visibility_net)#

Configuration for the visibility prediction network.

Field	value_type	description	default_value	valid_min	valid_max
`type`	string	VisibilityNet type	visibility_net
`embedding_dim`	int	Embedding dimension	256	1	infinity
`hidden_channels`	int	Hidden channels	32	1	infinity

Neck Configuration (model.neck)#

Configuration for the model’s neck (Feature Pyramid Network).

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`type`	categorical	Neck - Feature Pyramid Network	FPN			FPN
`num_outs`	int		4	1	infinity
`start_level`	int	Start level for FPN	0	0	infinity
`out_channels`	int	Output channels	256	1	infinity
`in_channels`	list	Input channels	[256, 512, 1024, 2048]				false
`add_extra_convs`	categorical	Type of extra conv	on_output			on_input,on_lateral,on_output,False
`relu_before_extra_convs`	bool	Apply ReLU before extra convs	True

Depth Branch Configuration (model.depth_branch)#

Configuration for the depth estimation branch.

Field	value_type	description	default_value	valid_min	valid_max
`type`	string	Depth branch type	dense_depth
`embed_dims`	int	Embedding dimensions	256	1	infinity
`num_depth_layers`	int	Number of depth layers	3	1	infinity
`loss_weight`	float	Weight for depth loss	0.2	0	infinity

train#

The train config contains the parameters related to training. They are described as follows:

train:
  num_epochs: 5
  num_nodes: 1
  num_gpus: 1
  validation_interval: 1
  checkpoint_interval: 1
  pretrained_model_path: ???
  precision: bf16
  optim:
    type: "adamw"
    lr: 0.0001
    weight_decay: 0.001
    paramwise_cfg:
      custom_keys:
        img_backbone:
          lr_mult: 0.25
    grad_clip:
      max_norm: 25
      norm_type: 2
    lr_scheduler:
      policy: "cosine"
      warmup: "linear"
      warmup_iters: 500
      warmup_ratio: 0.333333
      min_lr_ratio: 0.001

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`num_gpus`	int	The number of GPUs to run the train job	1	1
`gpu_ids`	list	List of GPU IDs to run the training on. The length of this list must be equal to the number of gpus in train.num_gpus	[0]				false
`num_nodes`	int	Number of nodes to run the training on. If > 1, then multi-node is enabled	1	1
`seed`	int	The seed for the initializer in PyTorch. If < 0, disable fixed seed	1234	-1	infinity
`cudnn`	collection						false
`num_epochs`	int	Number of epochs to run the training	10	1	infinity
`checkpoint_interval`	float	Checkpoint interval in epochs	0.5	0	infinity
`validation_interval`	float	Validation interval in epochs	0.5	0	infinity
`resume_training_checkpoint_path`	string	Path to the checkpoint to resume training from
`results_dir`	string	Path to where all the assets generated from a task are stored
`pretrained_model_path`	string	Path to pretrained model
`optim`	collection	Optimizer configuration					false
`precision`	categorical	Precision	bf16			bf16,fp16,fp32

optim#

The optim parameter defines the config for the AdamW optimizer in training, including the learning rate, learning scheduler, and weight decay.

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`type`	categorical	Optimizer type	adamw			adamw,adam,sgd
`lr`	float	Learning rate	5e-05	0	infinity		TRUE
`weight_decay`	float	Weight decay coefficient	0.001
`momentum`	float	Momentum for SGD	0.9
`paramwise_cfg`	collection	Parameter-wise configuration	{‘custom_keys’: {‘img_backbone’: {‘lr_mult’: 0.2}}}				false
`grad_clip`	collection	Gradient clipping configuration	{‘max_norm’: 25, ‘norm_type’: ‘L2’}				false
`lr_scheduler`	collection	Learning rate scheduler configuration	{‘policy’: ‘cosine’, ‘warmup’: ‘linear’, ‘warmup_iters’: 500, ‘warmup_ratio’: 0.333333, ‘min_lr_ratio’: 0.001}				false

evaluate#

The evaluate config contains the parameters related to evaluation. Currently, we only support evaluation on a single GPU with batch size 1. The parameters are described as follows:

evaluate:
  checkpoint: ${results_dir}/train/sparse4d_model_latest.pth

Field	value_type	description	default_value	automl_enabled
`checkpoint`	string	Path to the checkpoint used for evaluation	???
`results_dir`	string	Path to where all the assets generated from a task are stored
`metrics`	list	Metrics to evaluate	[‘detection’]	false
`tracking`	collection	Tracking config		false

Set the evaluate checkpoint path in the evaluate specification:

visualize#

The visualize config contains the parameters related to visualization. They are described as follows:

Field	value_type	description	default_value	valid_min	valid_max
`show`	bool	Show visualization	True
`vis_dir`	string	Visualization directory	./vis
`vis_score_threshold`	float	Visualization score threshold	0.25	0	1
`n_images_col`	int	Number of images per column	6	1	infinity
`viz_down_sample`	int	Visualization down sample	3	1	infinity

inference#

The inference config contains the parameters related to training. Currently, we only support inference on a single GPU with batch size 1. They are described as follows:

inference:
  checkpoint: ???
  output_nvschema: true
  jsonfile_prefix: "sparse4d_pred"

Field	value_type	description	default_value	automl_enabled
`checkpoint`	string	Path to checkpoint file	???
`results_dir`	string	Path to where all the assets generated from a task are stored
`jsonfile_prefix`	string	JSON file prefix	sparse4d_pred
`output_nvschema`	bool	Output NVSchema	True
`tracking`	collection	Tracking config		false

Set the inference checkpoint path in the inference specification:

export#

The export config contains the parameters related to export. Currently, we only support export with batch size 1 and dynamic number of camera sensors. They are described as follows:

export:
  results_dir: ???
  checkpoint: ???
  onnx_file: ???

Field	value_type	description	default_value
`results_dir`	string	Path to where all the assets generated from a task are stored
`checkpoint`	string	Path to the checkpoint file to run export	???
`onnx_file`	string	Path to the onnx model file	???

Set the export checkpoint path in the export specification:

Training the Model#

Use the following command to run Sparse4D training:

Evaluating the Model#

The evaluation metrics for Sparse4D are the mean average precision and ranked accuracy.

Use the following command to run Sparse4D evaluation:

Running Inference on the Model#

Use the following command to run inference on Sparse4D with the .pth model.

The output will be a file with JSON logs consisting of object detection and tracking results for each frame.

The expected output is as follows:

{
    "version": "4.0",
    "id": "1", # Frame ID
    "sensorId": "bev-sensor-zone-c4", # BEV Sensor ID
    "timestamp": "2025-01-15T10:30:00.123Z", # Timestamp
    "objects": [
      {
        "id": "1", # Object ID
        "type": "Person", # Object Type
        "confidence": 0.887, # Object Confidence Score
        "coordinate": {
          "x": -1.5, # Object Center X Coordinate
          "y": 3.2, # Object Center Y Coordinate
          "z": 0.75 # Object Center Z Coordinate
        },
        "bbox3d": {
          "coordinates": [
            -1.5, # Object Centeroid X Coordinate
            3.2, # Object Centeroid Y Coordinate
            0.75, # Object Centeroid Z Coordinate
            0.5, # Object Width
            0.5, # Object Length
            0.5, # Object Height
            0.0, # Object Pitch
            0.0, # Object Roll
            1.57 # Object Yaw
          ],
          "embedding": [
            {} # Object Embedding
          ],
          "confidence": 0.887 # Object Confidence Score
        }
      },
      {
        "id": "2",
        "type": "Humanoid",
        "confidence": 0.752,
        "coordinate": {
          "x": 5.1,
          "y": -2.8,
          "z": 0.15
        },
        "bbox3d": {
          "coordinates": [
            5.1,
            -2.8,
            0.15,
            1.2,
            1.0,
            0.2,
            0.0,
            0.0,
            -1.04
          ],
          "embedding": [
            {}
          ],
          "confidence": 0.752
        }
      }
    ]
  }
  {
    # ... more frames
  }

Exporting the Model#

Use the following command to export Sparse4D to .onnx format for deployment:

Quantization#

Sparse4D supports PTQ via TAO Quant using either the torchao (weight-only) or modelopt (static PTQ) backends.

Add a quantize section to your experiment specification (see TAO Quant documentation for schema and backend options).
Run:
Use the quantized checkpoint by setting evaluate.is_quantized: true or inference.is_quantized: true and pointing to the artifact saved under results_dir (for example, quantized_model_torchao.pth or quantized_model_modelopt.pth). For ModelOpt artifacts, the model weights are stored under model_state_dict.

Notes#

For modelopt static PTQ, ensure that your dataset configuration provides a representative calibration loader.
For torchao, activation settings in the configuration are ignored.

Calibration Dataset (ModelOpt)#

When you use the modelopt backend (static PTQ), provide a calibration dataset via dataset.quant_calibration_dataset.

Minimal example:

quantize:
  backend: "modelopt"
  mode: "static_ptq"
  algorithm: "minmax"
dataset:
  quant_calibration_dataset:
    images_dir: "/path/to/calib/images"

See also: TAO Quant overview and its Configuration and backend pages.

TensorRT engine generation and deploying to DeepStream#

Refer to the Nvidia Spatial AI documentation page for more information about deploying a Sparse4D model to DeepStream via TensorRT engine generation.