Sparse4D#

Sparse4D is a Multi-Camera 3D Detection and Tracking model with 4D (spatial-temporal) capabilities. It takes synchronized input images from multiple cameras and calibration matrices and outputs the 3D bounding boxes and temporally consistent tracking IDs. The model is based on resnet101, which is a general-purpose backbone for computer vision.

Each batch in Sparse4D is trained on a group of cameras. Each group is called as a bird’s-eye view (BEV) group. A BEV group is a collection of multiple overlapping cameras.

The expected data requirements and time to fine-tune the Sparse4D model on a single scene of the MTMC Tracking 2025 dataset are as follows:

Estimated time for fine-tuning Sparse4D on a single scene of the MTMC Tracking 2025 dataset#
Backbone type	GPU type	Image size	No. of BEV groups	No. of cameras in each BEV group	No. of frames in each camera	Total no. of epochs	Total training time
Resnet101	8 x Nvidia H100 - 80GB SXM	3x512x1408	3 (Minimum BEV groups)	4-12	9000 (5 mins @ 30 FPS)	5	10 hours

Sparse4D supports the following tasks:

train
evaluate
inference
export

SPECS=$(tao-client sparse4d get-spec --action <sub_task> --job_type experiment --id $EXPERIMENT_ID)

JOB_ID=$(tao-client sparse4d experiment-run-action --action <sub_task> --id $EXPERIMENT_ID --specs "$SPECS")

Required Arguments

--id: The unique identifier of the experiment from which to train the model

Data Input for Sparse4D#

The Sparse4D apps in TAO utilize the dataset and for its training, validation and testing.

Refer to the MTMC Tracking 2025 dataset page in the PhysicalAI-SmartSpaces HuggingFace repository for more about the raw dataset format. The dataset is converted into pickle format and stored in the data/sparse4d/ directory.

Creating an Experiment Spec File#

The spec file for Sparse4D includes model, dataset, train parameters, visualize parameters, evaluate parameters and inference parameters. The following is an example spec file for training a Sparse4D model on one scene of the MTMC Tracking 2025 dataset. We will utilize the Warehouse_014 scene from the MTMC Tracking 2025 dataset for training.

SPECS=$(tao-client sparse4d get-spec --action train --job_type experiment --id $EXPERIMENT_ID)

results_dir: /results

train:
  num_gpus: 1
  gpu_ids:
  - 0
  num_nodes: 1
  seed: 1234
  cudnn:
    benchmark: false
    deterministic: true
  num_epochs: 10
  checkpoint_interval: 0.5
  validation_interval: 0.5
  resume_training_checkpoint_path: null
  results_dir: null
  pretrained_model_path: null
  optim:
    type: adamw
    lr: 5.0e-05
    weight_decay: 0.001
    momentum: 0.9
    paramwise_cfg:
      custom_keys:
        img_backbone:
          lr_mult: 0.2
    grad_clip:
      max_norm: 25
      norm_type: 2
    lr_scheduler:
      policy: cosine
      warmup: linear
      warmup_iters: 500
      warmup_ratio: 0.333333
      min_lr_ratio: 0.001

model:
  type: sparse4d
  embed_dims: 256
  use_grid_mask: true
  use_deformable_func: true
  input_shape:
  - 1408
  - 512
  backbone:
    type: resnet_101
  neck:
    type: FPN
    num_outs: 4
    start_level: 0
    out_channels: 256
    in_channels:
    - 256
    - 512
    - 1024
    - 2048
    add_extra_convs: on_output
    relu_before_extra_convs: true
  depth_branch:
    type: dense_depth
    embed_dims: 256
    num_depth_layers: 3
    loss_weight: 0.2
  head:
    type: sparse4d
    num_output: 300
    cls_threshold_to_reg: 0.05
    decouple_attn: true
    return_feature: true
    use_reid_sampling: false
    embed_dims: 256
    reid_dims: 0
    num_groups: 8
    num_decoder: 6
    num_single_frame_decoder: 1
    drop_out: 0.1
    temporal: true
    with_quality_estimation: true
    operation_order:
    - deformable
    - ffn
    - norm
    - refine
    - temp_gnn
    - gnn
    - norm
    - deformable
    - ffn
    - norm
    - refine
    - temp_gnn
    - gnn
    - norm
    - deformable
    - ffn
    - norm
    - refine
    - temp_gnn
    - gnn
    - norm
    - deformable
    - ffn
    - norm
    - refine
    - temp_gnn
    - gnn
    - norm
    - deformable
    - ffn
    - norm
    - refine
    - temp_gnn
    - gnn
    - norm
    - deformable
    - ffn
    - norm
    - refine
    visibility_net:
      type: visibility_net
      embedding_dim: 256
      hidden_channels: 32
    instance_bank:
      num_anchor: 900
      anchor: ''
      num_temp_instances: 600
      confidence_decay: 0.8
      feat_grad: false
      default_time_interval: 0.033333
      embed_dims: 256
      use_temporal_align: false
      grid_size: null
    anchor_encoder:
      type: SparseBox3DEncoder
      vel_dims: 3
      embed_dims:
      - 128
      - 32
      - 32
      - 64
      mode: cat
      output_fc: false
      in_loops: 1
      out_loops: 4
      pos_embed_only: false
    sampler:
      num_dn_groups: 5
      num_temp_dn_groups: 3
      dn_noise_scale:
      - 2.0
      - 2.0
      - 2.0
      - 0.5
      - 0.5
      - 0.5
      - 0.5
      - 0.5
      - 0.5
      - 0.5
      - 0.5
      max_dn_gt: 128
      add_neg_dn: true
      cls_weight: 2.0
      box_weight: 0.25
      reg_weights:
      - 2.0
      - 2.0
      - 2.0
      - 0.5
      - 0.5
      - 0.5
      - 0.0
      - 0.0
      - 0.0
      - 0.0
      - 0.0
      use_temporal_align: false
      gt_assign_threshold: 0.5
    reg_weights:
    - 2.0
    - 2.0
    - 2.0
    - 1.0
    - 1.0
    - 1.0
    - 1.0
    - 1.0
    - 1.0
    - 1.0
    - 1.0
    loss:
      cls:
        type: focal
        use_sigmoid: true
        gamma: 2.0
        alpha: 0.25
        loss_weight: 2.0
      reg:
        type: sparse_box_3d
        box_weight: 0.25
        cls_allow_reverse: []
      id:
        type: cross_entropy_label_smooth
        num_ids: 70
    bnneck:
      type: bnneck
      feat_dim: 256
      num_ids: 70
    deformable_model:
      embed_dims: 256
      num_groups: 8
      num_levels: 4
      attn_drop: 0.15
      use_deformable_func: true
      use_camera_embed: false
      residual_mode: cat
      num_cams: 6
      max_num_cams: 20
      proj_drop: 0.0
      kps_generator:
        embed_dims: 256
        num_learnable_pts: 6
        fix_scale:
        - - 0
          - 0
          - 0
        - - 0.45
          - 0
          - 0
        - - -0.45
          - 0
          - 0
        - - 0
          - 0.45
          - 0
        - - 0
          - -0.45
          - 0
        - - 0
          - 0
          - 0.45
        - - 0
          - 0
          - -0.45
    refine_layer:
      type: sparse_box_3d_refinement_module
      embed_dims: 256
      refine_yaw: true
      with_quality_estimation: true
    valid_vel_weight: -1.0
    graph_model:
      type: MultiheadAttention
      embed_dims: 512
      num_heads: 8
      batch_first: true
      dropout: 0.1
    temp_graph_model:
      type: MultiheadAttention
      embed_dims: 512
      num_heads: 8
      batch_first: true
      dropout: 0.1
    decoder:
      type: SparseBox3DDecoder
      score_threshold: 0.05
    norm_layer:
      type: LN
      normalized_shape: 256
    ffn:
      type: AsymmetricFFN
      in_channels: 512
      pre_norm:
        type: LN
        normalized_shape: 256
      embed_dims: 256
      feedforward_channels: 1024
      num_fcs: 2
      ffn_drop: 0.1
      act_cfg:
        type: ReLU
        inplace: true
  use_temporal_align: false

inference:
  num_gpus: 1
  gpu_ids:
  - 0
  num_nodes: 1
  checkpoint: ???
  trt_engine: null
  results_dir: null
  jsonfile_prefix: sparse4d_pred
  output_nvschema: true
  tracking:
    enabled: true
    threshold: 0.2

evaluate:
  num_gpus: 1
  gpu_ids:
  - 0
  num_nodes: 1
  checkpoint: ???
  trt_engine: null
  results_dir: null
  metrics:
  - detection
  tracking:
    enabled: true
    threshold: 0.2

export:
  results_dir: null
  gpu_id: 0
  checkpoint: ???
  onnx_file: ???
  on_cpu: false
  input_channel: 3
  input_width: 960
  input_height: 544
  opset_version: 17
  batch_size: -1
  verbose: false
  format: onnx

visualize:
  show: true
  vis_dir: ./vis
  vis_score_threshold: 0.25
  n_images_col: 6
  viz_down_sample: 3

The experiment specification consists of several main components:

dataset
model
train
evaluate
inference
export
visualize

dataset#

The dataset parameter defines the dataset source, training batch size, and augmentation. An example dataset is provided below. This section describes the main parameters of the Omniverse3DDetTrackDatasetConfig.

dataset:
  use_h5_file_for_rgb: false
  use_h5_file_for_depth: true
  num_frames: 9000
  batch_size: 2
  num_bev_groups: 1
  num_workers: 2
  num_ids: 70
  classes: [
    "person",
    "gr1_t2",
    "agility_digit",
    "nova_carter",
  ]
  type: "omniverse_3d_det_track"
  data_root: ???
  train_dataset:
    ann_file: ???
    test_mode: false
    use_valid_flag: true
    with_seq_flag: true
    sequences_split_num: 100
    keep_consistent_seq_aug: true
    same_scene_in_batch: true
  val_dataset:
    ann_file: ???
    test_mode: true
    use_valid_flag: true
    tracking: true
    tracking_threshold: 0.2
  test_dataset:
    ann_file: ???
    test_mode: true
    use_valid_flag: true
    tracking: true
    tracking_threshold: 0.2
  augmentation:
    resize_lim: [0.7, 0.77]
    final_dim: [512, 1408]
    bot_pct_lim: [0.0, 0.0]
    rot_lim: [-5.4, 5.4]
    image_size: [1080, 1920]
    rand_flip: true
    rot3d_range: [-0.3925, 0.3925]
  normalize:
    mean: [123.675, 116.28, 103.53]
    std: [58.395, 57.12, 57.375]
    to_rgb: true
  sequences:
    split_num: 100
    keep_consistent_aug: true
    same_scene_in_batch: true

Field	value_type	description	default_value	valid_min	valid_max	automl_enabled
`type`	string	Dataset type	omniverse_3d_det_track
`batch_size`	int	Batch size	2	1	infinity
`use_h5_file_for_rgb`	bool	Use H5 file	False
`use_h5_file_for_depth`	bool	Use H5 file	True
`num_frames`	int	Number of frames	200	1	infinity
`num_bev_groups`	int	Number of BEV groups	1	1	infinity
`data_root`	string	Path to data root	???
`anno_root`	string	Path to annotation root	???
`classes`	list	Classes to detect	[‘person’, ‘humanoid’, ‘nova_carter’, ‘transporter’, ‘forklift’, ‘box’, ‘pallet’, ‘crate’]			false
`num_workers`	int	Number of workers	4	0	infinity
`num_ids`	int	Number of IDs	70	1	infinity
`augmentation`	collection	Augmentation config				false
`normalize`	collection	Normalize config				false
`sequences`	collection	Sequences config				false
`train_dataset`	collection	Train dataset config				false
`val_dataset`	collection	Val dataset config				false
`test_dataset`	collection	Test dataset config				false

Note

For the FTMS Client, these parameters are set in JSON format.

Train Dataset Configuration (dataset.train_dataset)#

Configuration for the training dataset.

Field	value_type	description	default_value	valid_min	valid_max
`ann_file`	string	Path to annotation file	???
`test_mode`	bool	Test mode	False
`use_valid_flag`	bool	Use valid flag	True
`with_seq_flag`	bool	With sequence flag	True
`sequences_split_num`	int	Number of sequences	100	1	infinity
`keep_consistent_seq_aug`	bool	Keep consistent sequence augmentation	True
`same_scene_in_batch`	bool	Same scene in batch	True

Validation Dataset Configuration (dataset.val_dataset)#

Configuration for the validation dataset.

Field	value_type	description	default_value	valid_min	valid_max
`ann_file`	string	Path to annotation pickle files/folders	???
`test_mode`	bool	Test mode	False
`use_valid_flag`	bool	Use valid flag	True
`tracking`	bool	Tracking	True
`tracking_threshold`	float	Tracking threshold	0.2	0	1
`same_scene_in_batch`	bool	Same scene in batch	True

Test Dataset Configuration (dataset.test_dataset)#

Configuration for the test dataset.

Field	value_type	description	default_value	valid_min	valid_max
`ann_file`	string	Path to annotation pickle files/folders	???
`test_mode`	bool	Test mode	True
`use_valid_flag`	bool	Use valid flag	True
`tracking`	bool	Tracking	True
`tracking_threshold`	float	Tracking threshold	0.2	0	1
`same_scene_in_batch`	bool	Same scene in batch	True

Augmentation Configuration (dataset.augmentation)#

Configuration for data augmentation.

Field	value_type	description	default_value	automl_enabled
`resize_lim`	list	Resize limits	[0.7, 0.77]	false
`final_dim`	list	Final dimensions	[512, 1408]	false
`bot_pct_lim`	list	Bottom percentage limits	[0.0, 0.0]	false
`rot_lim`	list	Rotation limits in degrees	[-5.4, 5.4]	false
`image_size`	list	Original image size	[1080, 1920]	false
`rand_flip`	bool	Random flip	True
`rot3d_range`	list	3D rotation range in radians	[-0.3925, 0.3925]	false

Normalize Configuration (dataset.normalize)#

Configuration for image normalization.

Field	value_type	description	default_value	automl_enabled
`mean`	list	Mean values for normalization	[123.675, 116.28, 103.53]	false
`std`	list	Standard deviation values for normalization	[58.395, 57.12, 57.375]	false
`to_rgb`	bool	Convert to RGB	True

Sequences Configuration (dataset.sequences)#

Configuration for handling image sequences.

Field	value_type	description	default_value	valid_min	valid_max
`split_num`	int	Number of sequence splits	100	1	infinity
`keep_consistent_aug`	bool	Keep consistent augmentation	True
`same_scene_in_batch`	bool	Keep same scene in batch	True

model#

The model parameter provides options to change the Sparse4D architecture.

model:
  type: "sparse4d"
  use_grid_mask: true
  use_deformable_func: true
  use_temporal_align: true
  input_shape: [1408, 512]
  embed_dims: 256
  neck:
    type: "FPN"
    num_outs: 4
    start_level: 0
    out_channels: 256
    in_channels: [256, 512, 1024, 2048]
    add_extra_convs: "on_output"
    relu_before_extra_convs: true
  depth_branch:
    type: "dense_depth"
    embed_dims: "${model.embed_dims}"
    num_depth_layers: 3
    loss_weight: 0.2
  head:
    type: "sparse4d"
    num_output: 300
    cls_threshold_to_reg: 0.05
    decouple_attn: true
    return_feature: true
    use_reid_sampling: false
    embed_dims: "${model.embed_dims}"
    num_groups: 8
    num_decoder: 6
    num_single_frame_decoder: 1
    drop_out: 0.1
    temporal: true
    with_quality_estimation: true
    instance_bank:
      num_anchor: 900
      anchor: ???
      num_temp_instances: 600
      confidence_decay: 0.8
      feat_grad: false
      default_time_interval: 0.033333
      embed_dims: "${model.embed_dims}"
      use_temporal_align: "${model.use_temporal_align}"
    anchor_encoder:
      type: 'SparseBox3DEncoder'
      vel_dims: 3
      embed_dims: [128, 32, 32, 64]
      mode: 'cat'
      output_fc: false
      in_loops: 1
      out_loops: 4
    operation_order: [
      "deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
      "deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
      "deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
      "deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
      "deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
      "deformable", "ffn", "norm", "refine"
    ]
    temp_graph_model:
      type: "MultiheadAttention"
      embed_dims: 512
      num_heads: 8
      batch_first: true
      dropout: 0.1
    graph_model:
      type: "MultiheadAttention"
      embed_dims: "${model.head.temp_graph_model.embed_dims}"
      num_heads: "${model.head.temp_graph_model.num_heads}"
      batch_first: true
      dropout: "${model.head.temp_graph_model.dropout}"
    norm_layer:
      type: "LN"
      normalized_shape: "${model.embed_dims}"
    ffn:
      type: "AsymmetricFFN"
      in_channels: 512
      pre_norm:
        type: "LN"
      embed_dims: 256
      feedforward_channels: 1024
      num_fcs: 2
      ffn_drop: 0.1
      act_cfg:
        type: "ReLU"
        inplace: true
    deformable_model:
      embed_dims: "${model.embed_dims}"
      num_groups: 8
      num_levels: 4
      attn_drop: 0.15
      use_deformable_func: true
      use_camera_embed: false
      residual_mode: "cat"
      kps_generator:
        embed_dims: "${model.embed_dims}"
        num_learnable_pts: 6
        fix_scale:
          - [0, 0, 0]
          - [0.45, 0, 0]
          - [-0.45, 0, 0]
          - [0, 0.45, 0]
          - [0, -0.45, 0]
          - [0, 0, 0.45]
          - [0, 0, -0.45]
    refine_layer:
      type: "SparseBox3DRefinementModule"
      embed_dims: "${model.embed_dims}"
      refine_yaw: true
      with_quality_estimation: true
    sampler:
      num_dn_groups: 5
      num_temp_dn_groups: 3
      dn_noise_scale: [2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]
      max_dn_gt: 128
      add_neg_dn: true
      cls_weight: 2.0
      box_weight: 0.25
      reg_weights: [2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0]
      use_temporal_align: "${model.use_temporal_align}"
    visibility_net:
      type: "visibility_net"
      embedding_dim: 256
      hidden_channels: 32
    loss:
      reg:
        type: "sparse_box_3d"
        box_weight: 0.25
        cls_allow_reverse: [5, 6, 7]
      cls:
        type: "focal"
        use_sigmoid: true
        gamma: 2.0
        alpha: 0.25
        loss_weight: 2.0
      id:
        type: "cross_entropy_label_smooth"
        num_ids: "${dataset.num_ids}"
    bnneck:
      type: "bnneck"
      feat_dim: 256
      num_ids: "${dataset.num_ids}"
    decoder:
      type: "SparseBox3DDecoder"
      score_threshold: 0.05
    reg_weights: [2.0, 2.0, 2.0, 1 ,1, 1, 1, 1, 1, 1, 1]

Field	value_type	description	default_value	valid_min	valid_max	automl_enabled
`type`	string	Model type	sparse4d
`embed_dims`	int	Embedding dimensions	256	1	infinity
`use_grid_mask`	bool	Use grid mask	True
`use_deformable_func`	bool	Use deformable function	True
`input_shape`	list	Input image shape	[1408, 512]			false
`backbone`	collection	Backbone config				false
`neck`	collection	Neck config				false
`depth_branch`	collection	Depth branch config				false
`head`	collection	Head config				false
`use_temporal_align`	bool	Use temporal alignment	False

Note

For FTMS Client, these parameters are set in JSON format.

Backbone Configuration (model.backbone)#

Configuration for the model’s backbone network. Currently, only resnet_101 is supported.

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`type`	string	Backbone type	resnet_101			resnet_101

Head Configuration (model.head)#

Top-level configuration for the detection and tracking head.

Field	value_type	description	default_value	valid_min	valid_max	automl_enabled
`type`	string	Head type	sparse4d
`num_output`	int	Number of output instances	300	1	infinity
`cls_threshold_to_reg`	float	Classification threshold for regression	0.05	0	1
`decouple_attn`	bool	Decouple attention	True
`return_feature`	bool	Return instance features	True
`use_reid_sampling`	bool	Use Re-ID sampling	False
`embed_dims`	int	Embedding dimensions	256	1	infinity
`reid_dims`	int	Re-ID dimensions	0	0	infinity
`num_groups`	int	Number of groups	8	1	infinity
`num_decoder`	int	Number of decoder layers	6	1	infinity
`num_single_frame_decoder`	int	Number of single-frame decoder layers	1	1	infinity
`drop_out`	float	Dropout rate	0.1	0	1
`temporal`	bool	Enable temporal modeling	True
`with_quality_estimation`	bool	Enable quality estimation	True
`operation_order`	list	Operation order	[‘deformable’, ‘ffn’, ‘norm’, ‘refine’, ‘temp_gnn’, ‘gnn’, ‘norm’, ‘deformable’, ‘ffn’, ‘norm’, ‘refine’, ‘temp_gnn’, ‘gnn’, ‘norm’, ‘deformable’, ‘ffn’, ‘norm’, ‘refine’, ‘temp_gnn’, ‘gnn’, ‘norm’, ‘deformable’, ‘ffn’, ‘norm’, ‘refine’, ‘temp_gnn’, ‘gnn’, ‘norm’, ‘deformable’, ‘ffn’, ‘norm’, ‘refine’, ‘temp_gnn’, ‘gnn’, ‘norm’, ‘deformable’, ‘ffn’, ‘norm’, ‘refine’]			false
`visibility_net`	collection	Visibility net config				false
`instance_bank`	collection	Instance bank config				false
`anchor_encoder`	collection	Anchor encoder config				false
`sampler`	collection	Sampler config				false
`reg_weights`	list	Regression weights	[2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]			false
`loss`	collection	Loss config				false
`bnneck`	collection	BN neck config				false
`deformable_model`	collection	Deformable model config				false
`refine_layer`	collection	Refine layer config				false
`valid_vel_weight`	float	Valid velocity weight	-1	-1	infinity
`graph_model`	collection	Graph model config				false
`temp_graph_model`	collection	Temp graph model config				false
`decoder`	collection	Decoder config				false
`norm_layer`	collection	Norm layer config				false
`ffn`	collection	FFN config				false

Deformable Model Configuration (model.head.deformable_model)#

Configuration for the deformable attention mechanism.

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`embed_dims`	int	Embedding dimensions	256	1	infinity
`num_groups`	int	Number of groups	8	1	infinity
`num_levels`	int	Number of levels	4	1	infinity
`attn_drop`	float	Attention dropout	0.15	0	1
`use_deformable_func`	bool	Use deformable function	True
`use_camera_embed`	bool	Use camera embedding	False
`residual_mode`	categorical	Residual mode	cat			cat,add
`num_cams`	int	Number of cameras	6	1	infinity
`max_num_cams`	int	Maximum number of cameras	20	1	infinity
`proj_drop`	float	Projection dropout	0.0	0	1
`kps_generator`	collection	KPS generator config					false

Instance Bank Configuration (model.head.instance_bank)#

Configuration for managing object instances over time.

Field	value_type	description	default_value	valid_min	valid_max
`num_anchor`	int	Number of anchors	900	1	infinity
`anchor`	string	Path to anchor file
`num_temp_instances`	int	Number of temporal instances	600	0	infinity
`confidence_decay`	float	Confidence decay factor	0.8	0	1
`feat_grad`	bool	Enable gradients for features	False
`default_time_interval`	float	Default time interval	0.033333	0	infinity
`embed_dims`	int	Embedding dimensions	256	1	infinity
`use_temporal_align`	bool	Use temporal alignment	False
`grid_size`	float	Grid size

Anchor Encoder Configuration (model.head.anchor_encoder)#

Configuration for encoding anchor information.

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`type`	string	Anchor encoder type	SparseBox3DEncoder
`vel_dims`	int	Velocity dimensions	3	1	infinity
`embed_dims`	list	Embedding dimensions	[128, 32, 32, 64]				false
`mode`	categorical	Mode	cat			cat,add
`output_fc`	bool	Fully Connected Layer	False
`in_loops`	int	In loops	1	1	infinity
`out_loops`	int	Out loops	4	1	infinity
`pos_embed_only`	bool	Pos embed only	False

Sampler Configuration (model.head.sampler)#

Configuration for sampling positive and negative examples during training.

Field	value_type	description	default_value	valid_min	valid_max	automl_enabled
`num_dn_groups`	int	Number of De-Noising groups	5	1	infinity
`num_temp_dn_groups`	int	Number of temporal DN groups	3	0	infinity
`dn_noise_scale`	list	De-Noising scale	[2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]			false
`max_dn_gt`	int	Maximum DN ground truth	128	1	infinity
`add_neg_dn`	bool	Add negative DN	True
`cls_weight`	float	Classification weight	2.0	0	infinity
`box_weight`	float	Box weight	0.25	0	infinity
`reg_weights`	list	Regression weights	[2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0]			false
`use_temporal_align`	bool	Use temporal alignment	False
`gt_assign_threshold`	float	Ground Truth assign threshold	0.5	0	1

Loss Configuration (model.head.loss)#

This section details the different loss components used in the model head.

Field	value_type	description	automl_enabled
`cls`	collection	Classification loss config	false
`reg`	collection	Regression loss config	false
`id`	collection	ID loss config	false

Classification Loss (model.head.loss.cls)#

Configuration for the classification loss.

Field	value_type	description	default_value	valid_min	valid_max
`type`	string	Classification loss type	focal
`use_sigmoid`	bool	Use sigmoid	True
`gamma`	float	Focal loss gamma	2.0	0	infinity
`alpha`	float	Focal loss alpha	0.25	0	1
`loss_weight`	float	Loss weight	2.0	0	infinity

Regression Loss (model.head.loss.reg)#

Field	value_type	description	default_value	valid_min	valid_max	automl_enabled
`type`	string	Regression loss type	sparse_box_3d
`box_weight`	float	Box loss weight	0.25	0	infinity
`cls_allow_reverse`	list	Class allow reverse	[]			false

ID Loss (model.head.loss.id)#

Configuration for the ID / Re-ID loss.

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`type`	string	ID loss type	cross_entropy_label_smooth
`num_ids`	int	Number of IDs	70	1	infinity

BNNeck Configuration (model.head.bnneck)#

Field	value_type	description	default_value	valid_min	valid_max
`type`	string	Batch Normalization Neck	bnneck
`feat_dim`	int	Feature dimension	256	1	infinity
`num_ids`	int	Number of IDs	70	1	infinity

KPS Generator Configuration (model.head.deformable_model.kps_generator)#

Configuration for KeyPoint (Sampling) Generator.

Field	value_type	description	default_value	valid_min	valid_max	automl_enabled
`embed_dims`	int	Embedding dimensions	256	1	infinity
`num_learnable_pts`	int	Number of learnable points	6	1	infinity
`fix_scale`	list	Fixed scale	[[0, 0, 0], [0.45, 0, 0], [-0.45, 0, 0], [0, 0.45, 0], [0, -0.45, 0], [0, 0, 0.45], [0, 0, -0.45]]			false

Refine Layer Configuration (model.head.refine_layer)#

Field	value_type	description	default_value	valid_min	valid_max
`type`	string	Refine layer type	sparse_box_3d_refinement_module
`embed_dims`	int	Embedding dimensions	256	1	infinity
`refine_yaw`	bool	Refine yaw	True
`with_quality_estimation`	bool	With quality estimation	True

Graph Model Configuration (model.head.graph_model and model.head.temp_graph_model)#

Configuration for graph-based modeling (e.g., GNN or attention) used for spatial and temporal relations.

Field	value_type	description	default_value	valid_min	valid_max
`type`	string	Graph model type	MultiheadAttention
`embed_dims`	int	Embedding dimensions	512	1	infinity
`num_heads`	int	Number of heads	8	1	infinity
`batch_first`	bool	Batch first	True
`dropout`	float	Dropout rate	0.1	0	1

Decoder Configuration (model.head.decoder)#

Configuration for the final output decoder.

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`type`	string	Decoder type	SparseBox3DDecoder
`score_threshold`	float	Score threshold	0.05	0	1

Norm Layer Configuration (model.head.norm_layer and model.head.ffn.pre_norm)#

Configuration for normalization layers.

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`type`	string	Norm layer type	LN
`normalized_shape`	int	Normalized shape	256	1	infinity

FFN Configuration (model.head.ffn)#

Configuration for Feed-Forward Networks used in the decoder layers.

Field	value_type	description	default_value	valid_min	valid_max	automl_enabled
`type`	string	FFN type	AsymmetricFFN
`in_channels`	int	In channels	512	1	infinity
`pre_norm`	collection	Pre-norm config				false
`embed_dims`	int	Embedding dimensions	256	1	infinity
`feedforward_channels`	int	Feedforward channels	1024	1	infinity
`num_fcs`	int	Number of feedforward channels	2	1	infinity
`ffn_drop`	float	FFN dropout	0.1	0	1
`act_cfg`	collection	Activation config				false

Activation Configuration (model.head.ffn.act_cfg)#

Configuration for activation functions.

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`type`	string	Activation type	ReLU
`inplace`	bool	Inplace	True

Visibility Net Configuration (model.head.visibility_net)#

Configuration for the visibility prediction network.

Field	value_type	description	default_value	valid_min	valid_max
`type`	string	VisibilityNet type	visibility_net
`embedding_dim`	int	Embedding dimension	256	1	infinity
`hidden_channels`	int	Hidden channels	32	1	infinity

Neck Configuration (model.neck)#

Configuration for the model’s neck (Feature Pyramid Network).

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`type`	categorical	Neck - Feature Pyramid Network	FPN			FPN
`num_outs`	int		4	1	infinity
`start_level`	int	Start level for FPN	0	0	infinity
`out_channels`	int	Output channels	256	1	infinity
`in_channels`	list	Input channels	[256, 512, 1024, 2048]				false
`add_extra_convs`	categorical	Type of extra conv	on_output			on_input,on_lateral,on_output,False
`relu_before_extra_convs`	bool	Apply ReLU before extra convs	True

Depth Branch Configuration (model.depth_branch)#

Configuration for the depth estimation branch.

Field	value_type	description	default_value	valid_min	valid_max
`type`	string	Depth branch type	dense_depth
`embed_dims`	int	Embedding dimensions	256	1	infinity
`num_depth_layers`	int	Number of depth layers	3	1	infinity
`loss_weight`	float	Weight for depth loss	0.2	0	infinity

train#

The train config contains the parameters related to training. They are described as follows:

train:
  num_epochs: 5
  num_nodes: 1
  num_gpus: 1
  validation_interval: 1
  checkpoint_interval: 1
  pretrained_model_path: ???
  precision: bf16
  optim:
    type: "adamw"
    lr: 0.0001
    weight_decay: 0.001
    paramwise_cfg:
      custom_keys:
        img_backbone:
          lr_mult: 0.25
    grad_clip:
      max_norm: 25
      norm_type: 2
    lr_scheduler:
      policy: "cosine"
      warmup: "linear"
      warmup_iters: 500
      warmup_ratio: 0.333333
      min_lr_ratio: 0.001

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`num_gpus`	int	The number of GPUs to run the train job	1	1
`gpu_ids`	list	List of GPU IDs to run the training on. The length of this list must be equal to the number of gpus in train.num_gpus	[0]				false
`num_nodes`	int	Number of nodes to run the training on. If > 1, then multi-node is enabled	1	1
`seed`	int	The seed for the initializer in PyTorch. If < 0, disable fixed seed	1234	-1	infinity
`cudnn`	collection						false
`num_epochs`	int	Number of epochs to run the training	10	1	infinity
`checkpoint_interval`	float	Checkpoint interval in epochs	0.5	0	infinity
`validation_interval`	float	Validation interval in epochs	0.5	0	infinity
`resume_training_checkpoint_path`	string	Path to the checkpoint to resume training from
`results_dir`	string	Path to where all the assets generated from a task are stored
`pretrained_model_path`	string	Path to pretrained model
`optim`	collection	Optimizer configuration					false
`precision`	categorical	Precision	bf16			bf16,fp16,fp32

Note

For FTMS Client, these parameters are set in JSON format.

optim#

The optim parameter defines the config for the AdamW optimizer in training, including the learning rate, learning scheduler, and weight decay.

Field	value_type	description	default_value	valid_min	valid_max	valid_options	automl_enabled
`type`	categorical	Optimizer type	adamw			adamw,adam,sgd
`lr`	float	Learning rate	5e-05	0	infinity		TRUE
`weight_decay`	float	Weight decay coefficient	0.001
`momentum`	float	Momentum for SGD	0.9
`paramwise_cfg`	collection	Parameter-wise configuration	{‘custom_keys’: {‘img_backbone’: {‘lr_mult’: 0.2}}}				false
`grad_clip`	collection	Gradient clipping configuration	{‘max_norm’: 25, ‘norm_type’: ‘L2’}				false
`lr_scheduler`	collection	Learning rate scheduler configuration	{‘policy’: ‘cosine’, ‘warmup’: ‘linear’, ‘warmup_iters’: 500, ‘warmup_ratio’: 0.333333, ‘min_lr_ratio’: 0.001}				false

evaluate#

The evaluate config contains the parameters related to evaluation. Currently, we only support evaluation on a single GPU with batch size 1. The parameters are described as follows:

evaluate:
  checkpoint: ${results_dir}/train/sparse4d_model_latest.pth

Field	value_type	description	default_value	automl_enabled
`checkpoint`	string	Path to the checkpoint used for evaluation	???
`results_dir`	string	Path to where all the assets generated from a task are stored
`metrics`	list	Metrics to evaluate	[‘detection’]	false
`tracking`	collection	Tracking config		false

Note

For FTMS Client these parameters are set in JSON format, and the evaluate checkpoint is deduced from the previous train job ID as specified with the –parent_job_id argument. For TAO Launcher, you must set the path in the evaluate specification:

visualize#

The visualize config contains the parameters related to visualization. They are described as follows:

Field	value_type	description	default_value	valid_min	valid_max
`show`	bool	Show visualization	True
`vis_dir`	string	Visualization directory	./vis
`vis_score_threshold`	float	Visualization score threshold	0.25	0	1
`n_images_col`	int	Number of images per column	6	1	infinity
`viz_down_sample`	int	Visualization down sample	3	1	infinity

inference#

The inference config contains the parameters related to training. Currently, we only support inference on a single GPU with batch size 1. They are described as follows:

inference:
  checkpoint: ???
  output_nvschema: true
  jsonfile_prefix: "sparse4d_pred"

Field	value_type	description	default_value	automl_enabled
`checkpoint`	string	Path to checkpoint file	???
`results_dir`	string	Path to where all the assets generated from a task are stored
`jsonfile_prefix`	string	JSON file prefix	sparse4d_pred
`output_nvschema`	bool	Output NVSchema	True
`tracking`	collection	Tracking config		false

Note

For FTMS Client these parameters are set in JSON format, and the inference checkpoint is deduced from the previous train job ID as specified with the --parent_job_id argument. For TAO Launcher, you must set the path in the inference specification:

export#

The export config contains the parameters related to export. Currently, we only support export with batch size 1 and dynamic number of camera sensors. They are described as follows:

export:
  results_dir: ???
  checkpoint: ???
  onnx_file: ???

Field	value_type	description	default_value
`results_dir`	string	Path to where all the assets generated from a task are stored
`checkpoint`	string	Path to the checkpoint file to run export	???
`onnx_file`	string	Path to the onnx model file	???

Note

For FTMS Client these parameters are set in JSON format, and the export checkpoint is deduced from the previous train job ID as specified with the –parent_job_id argument. For TAO Launcher, you must set the path in the export specification:

Training the Model#

Use the following command to run Sparse4D training:

TRAIN_JOB_ID=$(tao-client sparse4d experiment-run-action --action train --id $EXPERIMENT_ID --specs "$SPECS")

tao model sparse4d train -e <experiment_spec_file>
                          results_dir=<results_dir>
                          [train.gpu_ids=<gpu id list>]

Required Arguments

The following arguments are required:

-e, --experiment_spec_file: The path to the experiment spec file
results_dir: The path to a folder where the experiment outputs are to be written

Optional Arguments

The following arguments are optional.

train.gpu_ids: A list of GPU indices to use for training. If you set more than one GPU ID, multi-GPU training will be triggered automatically.

Here’s an example of using the Sparse4D training command:

tao model sparse4d train -e $DEFAULT_SPEC results_dir=$RESULTS_DIR

Evaluating the Model#

The evaluation metrics for Sparse4D are the mean average precision and ranked accuracy.

Use the following command to run Sparse4D evaluation:

EVALUATE_JOB_ID=$(tao-client sparse4d experiment-run-action --action evaluate --id $EXPERIMENT_ID --specs "$SPECS" --parent_job_id $TRAIN_JOB_ID)

tao model sparse4d evaluate -e <experiment_spec_file>
                           results_dir=<results_dir>
                           evaluate.checkpoint=<model to be evaluated>
                           [evaluate.gpu_id=<gpu index>]

Required Arguments

The following arguments are required:

-e, --experiment_spec_file: The experiment spec file to set up the evaluation experiment
results_dir: The path to a folder where the experiment outputs are to be written
evaluate.checkpoint: The .pth model

Optional Arguments

evaluate.gpu_id: The GPU index used to run evaluation (when the machine has multiple GPUs installed). Note that evaluation can only run on a single GPU.

Here’s an example of using the Sparse4D evaluation command:

tao model sparse4d evaluate -e $DEFAULT_SPEC results_dir=$RESULTS_DIR evaluate.checkpoint=$TRAINED_TLT_MODEL evaluate.test_dataset=$TEST_DATA

Running Inference on the Model#

Use the following command to run inference on Sparse4D with the .pth model.

The output will be a file with JSON logs consisting of object detection and tracking results for each frame.

INFERENCE_JOB_ID=$(tao-client sparse4d experiment-run-action --action inference --id $EXPERIMENT_ID --specs "$SPECS" --parent_job_id $TRAIN_JOB_ID)

tao model sparse4d inference -e <experiment_spec>
                            results_dir=<results_dir>
                            inference.checkpoint=<inference model>
                            [inference.gpu_id=<gpu index>]

Required Arguments

The following arguments are required:

-e, --experiment_spec: The experiment spec file to set up inference
results_dir: The path to a folder where the experiment outputs are to be written
inference.checkpoint: The .pth model to perform inference with

Optional Arguments

The following arguments are optional.

inference.gpu_id: The index of the GPU that will be used to run inference (when the machine has multiple GPUs installed). Note that inference can only run on a single GPU.

Here’s an example of using the Sparse4D inference command:

tao model sparse4d inference -e $DEFAULT_SPEC results_dir=$RESULTS_DIR inference.checkpoint=$TRAINED_TLT_MODEL

The expected output is as follows:

{
    "version": "4.0",
    "id": "1", # Frame ID
    "sensorId": "bev-sensor-zone-c4", # BEV Sensor ID
    "timestamp": "2025-01-15T10:30:00.123Z", # Timestamp
    "objects": [
      {
        "id": "1", # Object ID
        "type": "Person", # Object Type
        "confidence": 0.887, # Object Confidence Score
        "coordinate": {
          "x": -1.5, # Object Center X Coordinate
          "y": 3.2, # Object Center Y Coordinate
          "z": 0.75 # Object Center Z Coordinate
        },
        "bbox3d": {
          "coordinates": [
            -1.5, # Object Centeroid X Coordinate
            3.2, # Object Centeroid Y Coordinate
            0.75, # Object Centeroid Z Coordinate
            0.5, # Object Width
            0.5, # Object Length
            0.5, # Object Height
            0.0, # Object Pitch
            0.0, # Object Roll
            1.57 # Object Yaw
          ],
          "embedding": [
            {} # Object Embedding
          ],
          "confidence": 0.887 # Object Confidence Score
        }
      },
      {
        "id": "2",
        "type": "Humanoid",
        "confidence": 0.752,
        "coordinate": {
          "x": 5.1,
          "y": -2.8,
          "z": 0.15
        },
        "bbox3d": {
          "coordinates": [
            5.1,
            -2.8,
            0.15,
            1.2,
            1.0,
            0.2,
            0.0,
            0.0,
            -1.04
          ],
          "embedding": [
            {}
          ],
          "confidence": 0.752
        }
      }
    ]
  }
  {
    # ... more frames
  }

Exporting the Model#

Use the following command to export Sparse4D to .onnx format for deployment:

EXPORT_JOB_ID=$(tao-client sparse4d experiment-run-action --action export --id $EXPERIMENT_ID --specs "$SPECS" --parent_job_id $TRAIN_JOB_ID)

tao model sparse4d export -e <experiment_spec>
                         results_dir=<results_dir>
                         export.checkpoint=<tlt checkpoint to be exported>
                         [export.onnx_file=<path to exported file>]
                         [export.gpu_id=<gpu index>]

Required Arguments

The following arguments are required:

-e, --experiment_spec: The experiment spec file to configure export
results_dir: The path to a folder where the experiment outputs are to be written
export.checkpoint: The .pth model to be exported

Optional Arguments

The following arguments are optional.

export.onnx_file: The path to save the exported model to. The default path is in the same directory as the *.pth model.
export.gpu_id: The index of the GPU that will be used to run the export (when the machine has multiple GPUs installed). Note that export can only run on a single GPU.

Here’s an example of using the Sparse4D export command:

tao model sparse4d export -e $DEFAULT_SPEC results_dir=$RESULTS_DIR export.checkpoint=$TRAINED_TLT_MODEL

TensorRT engine generation and deploying to DeepStream#

Refer to the Nvidia Spatial AI documentation page for more information about deploying a Sparse4D model to DeepStream via TensorRT engine generation.