Sparse4D#

Sparse4D is a Multi-Camera 3D Detection and Tracking model with 4D (spatial-temporal) capabilities. It takes synchronized input images from multiple cameras and calibration matrices and outputs the 3D bounding boxes and temporally consistent tracking IDs. The model is based on resnet101, which is a general-purpose backbone for computer vision.

Each batch in Sparse4D is trained on a group of cameras. Each group is called as a bird’s-eye view (BEV) group. A BEV group is a collection of multiple overlapping cameras.

The expected data requirements and time to fine-tune the Sparse4D model on a single scene of the MTMC Tracking 2025 dataset are as follows:

Estimated time for fine-tuning Sparse4D on a single scene of the MTMC Tracking 2025 dataset#

Backbone type

GPU type

Image size

No. of BEV groups

No. of cameras in each BEV group

No. of frames in each camera

Total no. of epochs

Total training time

Resnet101

8 x Nvidia H100 - 80GB SXM

3x512x1408

3 (Minimum BEV groups)

4-12

9000 (5 mins @ 30 FPS)

5

10 hours

Sparse4D supports the following tasks:

  • train

  • evaluate

  • inference

  • export

SPECS=$(tao-client sparse4d get-spec --action <sub_task> --job_type experiment --id $EXPERIMENT_ID)

JOB_ID=$(tao-client sparse4d experiment-run-action --action <sub_task> --id $EXPERIMENT_ID --specs "$SPECS")

Required Arguments

  • --id: The unique identifier of the experiment from which to train the model

See also

For information on how to create an experiment using the FTMS client, refer to the Creating an experiment section in the Remote Client Overview and Examples.

Data Input for Sparse4D#

The Sparse4D apps in TAO utilize the dataset and for its training, validation and testing.

Refer to the MTMC Tracking 2025 dataset page in the PhysicalAI-SmartSpaces HuggingFace repository for more about the raw dataset format. The dataset is converted into pickle format and stored in the data/sparse4d/ directory.

Creating an Experiment Spec File#

The spec file for Sparse4D includes model, dataset, train parameters, visualize parameters, evaluate parameters and inference parameters. The following is an example spec file for training a Sparse4D model on one scene of the MTMC Tracking 2025 dataset. We will utilize the Warehouse_014 scene from the MTMC Tracking 2025 dataset for training.

SPECS=$(tao-client sparse4d get-spec --action train --job_type experiment --id $EXPERIMENT_ID)

The experiment specification consists of several main components:

  • dataset

  • model

  • train

  • evaluate

  • inference

  • export

  • visualize

dataset#

The dataset parameter defines the dataset source, training batch size, and augmentation. An example dataset is provided below. This section describes the main parameters of the Omniverse3DDetTrackDatasetConfig.

dataset:
  use_h5_file_for_rgb: false
  use_h5_file_for_depth: true
  num_frames: 9000
  batch_size: 2
  num_bev_groups: 1
  num_workers: 2
  num_ids: 70
  classes: [
    "person",
    "gr1_t2",
    "agility_digit",
    "nova_carter",
  ]
  type: "omniverse_3d_det_track"
  data_root: ???
  train_dataset:
    ann_file: ???
    test_mode: false
    use_valid_flag: true
    with_seq_flag: true
    sequences_split_num: 100
    keep_consistent_seq_aug: true
    same_scene_in_batch: true
  val_dataset:
    ann_file: ???
    test_mode: true
    use_valid_flag: true
    tracking: true
    tracking_threshold: 0.2
  test_dataset:
    ann_file: ???
    test_mode: true
    use_valid_flag: true
    tracking: true
    tracking_threshold: 0.2
  augmentation:
    resize_lim: [0.7, 0.77]
    final_dim: [512, 1408]
    bot_pct_lim: [0.0, 0.0]
    rot_lim: [-5.4, 5.4]
    image_size: [1080, 1920]
    rand_flip: true
    rot3d_range: [-0.3925, 0.3925]
  normalize:
    mean: [123.675, 116.28, 103.53]
    std: [58.395, 57.12, 57.375]
    to_rgb: true
  sequences:
    split_num: 100
    keep_consistent_aug: true
    same_scene_in_batch: true

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type

string

Dataset type

omniverse_3d_det_track

batch_size

int

Batch size

2

1

infinity

use_h5_file_for_rgb

bool

Use H5 file

False

use_h5_file_for_depth

bool

Use H5 file

True

num_frames

int

Number of frames

200

1

infinity

num_bev_groups

int

Number of BEV groups

1

1

infinity

data_root

string

Path to data root

???

anno_root

string

Path to annotation root

???

classes

list

Classes to detect

[‘person’, ‘humanoid’, ‘nova_carter’, ‘transporter’, ‘forklift’, ‘box’, ‘pallet’, ‘crate’]

false

num_workers

int

Number of workers

4

0

infinity

num_ids

int

Number of IDs

70

1

infinity

augmentation

collection

Augmentation config

false

normalize

collection

Normalize config

false

sequences

collection

Sequences config

false

train_dataset

collection

Train dataset config

false

val_dataset

collection

Val dataset config

false

test_dataset

collection

Test dataset config

false

Note

For the FTMS Client, these parameters are set in JSON format.

Train Dataset Configuration (dataset.train_dataset)#

Configuration for the training dataset.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

ann_file

string

Path to annotation file

???

test_mode

bool

Test mode

False

use_valid_flag

bool

Use valid flag

True

with_seq_flag

bool

With sequence flag

True

sequences_split_num

int

Number of sequences

100

1

infinity

keep_consistent_seq_aug

bool

Keep consistent sequence augmentation

True

same_scene_in_batch

bool

Same scene in batch

True

Validation Dataset Configuration (dataset.val_dataset)#

Configuration for the validation dataset.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

ann_file

string

Path to annotation pickle files/folders

???

test_mode

bool

Test mode

False

use_valid_flag

bool

Use valid flag

True

tracking

bool

Tracking

True

tracking_threshold

float

Tracking threshold

0.2

0

1

same_scene_in_batch

bool

Same scene in batch

True

Test Dataset Configuration (dataset.test_dataset)#

Configuration for the test dataset.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

ann_file

string

Path to annotation pickle files/folders

???

test_mode

bool

Test mode

True

use_valid_flag

bool

Use valid flag

True

tracking

bool

Tracking

True

tracking_threshold

float

Tracking threshold

0.2

0

1

same_scene_in_batch

bool

Same scene in batch

True

Augmentation Configuration (dataset.augmentation)#

Configuration for data augmentation.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

resize_lim

list

Resize limits

[0.7, 0.77]

false

final_dim

list

Final dimensions

[512, 1408]

false

bot_pct_lim

list

Bottom percentage limits

[0.0, 0.0]

false

rot_lim

list

Rotation limits in degrees

[-5.4, 5.4]

false

image_size

list

Original image size

[1080, 1920]

false

rand_flip

bool

Random flip

True

rot3d_range

list

3D rotation range in radians

[-0.3925, 0.3925]

false

Normalize Configuration (dataset.normalize)#

Configuration for image normalization.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

mean

list

Mean values for normalization

[123.675, 116.28, 103.53]

false

std

list

Standard deviation values for normalization

[58.395, 57.12, 57.375]

false

to_rgb

bool

Convert to RGB

True

Sequences Configuration (dataset.sequences)#

Configuration for handling image sequences.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

split_num

int

Number of sequence splits

100

1

infinity

keep_consistent_aug

bool

Keep consistent augmentation

True

same_scene_in_batch

bool

Keep same scene in batch

True

model#

The model parameter provides options to change the Sparse4D architecture.

model:
  type: "sparse4d"
  use_grid_mask: true
  use_deformable_func: true
  use_temporal_align: true
  input_shape: [1408, 512]
  embed_dims: 256
  neck:
    type: "FPN"
    num_outs: 4
    start_level: 0
    out_channels: 256
    in_channels: [256, 512, 1024, 2048]
    add_extra_convs: "on_output"
    relu_before_extra_convs: true
  depth_branch:
    type: "dense_depth"
    embed_dims: "${model.embed_dims}"
    num_depth_layers: 3
    loss_weight: 0.2
  head:
    type: "sparse4d"
    num_output: 300
    cls_threshold_to_reg: 0.05
    decouple_attn: true
    return_feature: true
    use_reid_sampling: false
    embed_dims: "${model.embed_dims}"
    num_groups: 8
    num_decoder: 6
    num_single_frame_decoder: 1
    drop_out: 0.1
    temporal: true
    with_quality_estimation: true
    instance_bank:
      num_anchor: 900
      anchor: ???
      num_temp_instances: 600
      confidence_decay: 0.8
      feat_grad: false
      default_time_interval: 0.033333
      embed_dims: "${model.embed_dims}"
      use_temporal_align: "${model.use_temporal_align}"
    anchor_encoder:
      type: 'SparseBox3DEncoder'
      vel_dims: 3
      embed_dims: [128, 32, 32, 64]
      mode: 'cat'
      output_fc: false
      in_loops: 1
      out_loops: 4
    operation_order: [
      "deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
      "deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
      "deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
      "deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
      "deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
      "deformable", "ffn", "norm", "refine"
    ]
    temp_graph_model:
      type: "MultiheadAttention"
      embed_dims: 512
      num_heads: 8
      batch_first: true
      dropout: 0.1
    graph_model:
      type: "MultiheadAttention"
      embed_dims: "${model.head.temp_graph_model.embed_dims}"
      num_heads: "${model.head.temp_graph_model.num_heads}"
      batch_first: true
      dropout: "${model.head.temp_graph_model.dropout}"
    norm_layer:
      type: "LN"
      normalized_shape: "${model.embed_dims}"
    ffn:
      type: "AsymmetricFFN"
      in_channels: 512
      pre_norm:
        type: "LN"
      embed_dims: 256
      feedforward_channels: 1024
      num_fcs: 2
      ffn_drop: 0.1
      act_cfg:
        type: "ReLU"
        inplace: true
    deformable_model:
      embed_dims: "${model.embed_dims}"
      num_groups: 8
      num_levels: 4
      attn_drop: 0.15
      use_deformable_func: true
      use_camera_embed: false
      residual_mode: "cat"
      kps_generator:
        embed_dims: "${model.embed_dims}"
        num_learnable_pts: 6
        fix_scale:
          - [0, 0, 0]
          - [0.45, 0, 0]
          - [-0.45, 0, 0]
          - [0, 0.45, 0]
          - [0, -0.45, 0]
          - [0, 0, 0.45]
          - [0, 0, -0.45]
    refine_layer:
      type: "SparseBox3DRefinementModule"
      embed_dims: "${model.embed_dims}"
      refine_yaw: true
      with_quality_estimation: true
    sampler:
      num_dn_groups: 5
      num_temp_dn_groups: 3
      dn_noise_scale: [2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]
      max_dn_gt: 128
      add_neg_dn: true
      cls_weight: 2.0
      box_weight: 0.25
      reg_weights: [2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0]
      use_temporal_align: "${model.use_temporal_align}"
    visibility_net:
      type: "visibility_net"
      embedding_dim: 256
      hidden_channels: 32
    loss:
      reg:
        type: "sparse_box_3d"
        box_weight: 0.25
        cls_allow_reverse: [5, 6, 7]
      cls:
        type: "focal"
        use_sigmoid: true
        gamma: 2.0
        alpha: 0.25
        loss_weight: 2.0
      id:
        type: "cross_entropy_label_smooth"
        num_ids: "${dataset.num_ids}"
    bnneck:
      type: "bnneck"
      feat_dim: 256
      num_ids: "${dataset.num_ids}"
    decoder:
      type: "SparseBox3DDecoder"
      score_threshold: 0.05
    reg_weights: [2.0, 2.0, 2.0, 1 ,1, 1, 1, 1, 1, 1, 1]

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type

string

Model type

sparse4d

embed_dims

int

Embedding dimensions

256

1

infinity

use_grid_mask

bool

Use grid mask

True

use_deformable_func

bool

Use deformable function

True

input_shape

list

Input image shape

[1408, 512]

false

backbone

collection

Backbone config

false

neck

collection

Neck config

false

depth_branch

collection

Depth branch config

false

head

collection

Head config

false

use_temporal_align

bool

Use temporal alignment

False

Note

For FTMS Client, these parameters are set in JSON format.

Backbone Configuration (model.backbone)#

Configuration for the model’s backbone network. Currently, only resnet_101 is supported.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type

string

Backbone type

resnet_101

resnet_101

Head Configuration (model.head)#

Top-level configuration for the detection and tracking head.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type

string

Head type

sparse4d

num_output

int

Number of output instances

300

1

infinity

cls_threshold_to_reg

float

Classification threshold for regression

0.05

0

1

decouple_attn

bool

Decouple attention

True

return_feature

bool

Return instance features

True

use_reid_sampling

bool

Use Re-ID sampling

False

embed_dims

int

Embedding dimensions

256

1

infinity

reid_dims

int

Re-ID dimensions

0

0

infinity

num_groups

int

Number of groups

8

1

infinity

num_decoder

int

Number of decoder layers

6

1

infinity

num_single_frame_decoder

int

Number of single-frame decoder layers

1

1

infinity

drop_out

float

Dropout rate

0.1

0

1

temporal

bool

Enable temporal modeling

True

with_quality_estimation

bool

Enable quality estimation

True

operation_order

list

Operation order

[‘deformable’, ‘ffn’, ‘norm’, ‘refine’, ‘temp_gnn’, ‘gnn’, ‘norm’, ‘deformable’, ‘ffn’, ‘norm’, ‘refine’, ‘temp_gnn’, ‘gnn’, ‘norm’, ‘deformable’, ‘ffn’, ‘norm’, ‘refine’, ‘temp_gnn’, ‘gnn’, ‘norm’, ‘deformable’, ‘ffn’, ‘norm’, ‘refine’, ‘temp_gnn’, ‘gnn’, ‘norm’, ‘deformable’, ‘ffn’, ‘norm’, ‘refine’, ‘temp_gnn’, ‘gnn’, ‘norm’, ‘deformable’, ‘ffn’, ‘norm’, ‘refine’]

false

visibility_net

collection

Visibility net config

false

instance_bank

collection

Instance bank config

false

anchor_encoder

collection

Anchor encoder config

false

sampler

collection

Sampler config

false

reg_weights

list

Regression weights

[2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

false

loss

collection

Loss config

false

bnneck

collection

BN neck config

false

deformable_model

collection

Deformable model config

false

refine_layer

collection

Refine layer config

false

valid_vel_weight

float

Valid velocity weight

-1

-1

infinity

graph_model

collection

Graph model config

false

temp_graph_model

collection

Temp graph model config

false

decoder

collection

Decoder config

false

norm_layer

collection

Norm layer config

false

ffn

collection

FFN config

false

Deformable Model Configuration (model.head.deformable_model)#

Configuration for the deformable attention mechanism.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

embed_dims

int

Embedding dimensions

256

1

infinity

num_groups

int

Number of groups

8

1

infinity

num_levels

int

Number of levels

4

1

infinity

attn_drop

float

Attention dropout

0.15

0

1

use_deformable_func

bool

Use deformable function

True

use_camera_embed

bool

Use camera embedding

False

residual_mode

categorical

Residual mode

cat

cat,add

num_cams

int

Number of cameras

6

1

infinity

max_num_cams

int

Maximum number of cameras

20

1

infinity

proj_drop

float

Projection dropout

0.0

0

1

kps_generator

collection

KPS generator config

false

Instance Bank Configuration (model.head.instance_bank)#

Configuration for managing object instances over time.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

num_anchor

int

Number of anchors

900

1

infinity

anchor

string

Path to anchor file

num_temp_instances

int

Number of temporal instances

600

0

infinity

confidence_decay

float

Confidence decay factor

0.8

0

1

feat_grad

bool

Enable gradients for features

False

default_time_interval

float

Default time interval

0.033333

0

infinity

embed_dims

int

Embedding dimensions

256

1

infinity

use_temporal_align

bool

Use temporal alignment

False

grid_size

float

Grid size

Anchor Encoder Configuration (model.head.anchor_encoder)#

Configuration for encoding anchor information.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type

string

Anchor encoder type

SparseBox3DEncoder

vel_dims

int

Velocity dimensions

3

1

infinity

embed_dims

list

Embedding dimensions

[128, 32, 32, 64]

false

mode

categorical

Mode

cat

cat,add

output_fc

bool

Fully Connected Layer

False

in_loops

int

In loops

1

1

infinity

out_loops

int

Out loops

4

1

infinity

pos_embed_only

bool

Pos embed only

False

Sampler Configuration (model.head.sampler)#

Configuration for sampling positive and negative examples during training.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

num_dn_groups

int

Number of De-Noising groups

5

1

infinity

num_temp_dn_groups

int

Number of temporal DN groups

3

0

infinity

dn_noise_scale

list

De-Noising scale

[2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]

false

max_dn_gt

int

Maximum DN ground truth

128

1

infinity

add_neg_dn

bool

Add negative DN

True

cls_weight

float

Classification weight

2.0

0

infinity

box_weight

float

Box weight

0.25

0

infinity

reg_weights

list

Regression weights

[2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0]

false

use_temporal_align

bool

Use temporal alignment

False

gt_assign_threshold

float

Ground Truth assign threshold

0.5

0

1

Loss Configuration (model.head.loss)#

This section details the different loss components used in the model head.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

cls

collection

Classification loss config

false

reg

collection

Regression loss config

false

id

collection

ID loss config

false

Classification Loss (model.head.loss.cls)#

Configuration for the classification loss.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type

string

Classification loss type

focal

use_sigmoid

bool

Use sigmoid

True

gamma

float

Focal loss gamma

2.0

0

infinity

alpha

float

Focal loss alpha

0.25

0

1

loss_weight

float

Loss weight

2.0

0

infinity

Regression Loss (model.head.loss.reg)#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type

string

Regression loss type

sparse_box_3d

box_weight

float

Box loss weight

0.25

0

infinity

cls_allow_reverse

list

Class allow reverse

[]

false

ID Loss (model.head.loss.id)#

Configuration for the ID / Re-ID loss.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type

string

ID loss type

cross_entropy_label_smooth

num_ids

int

Number of IDs

70

1

infinity

BNNeck Configuration (model.head.bnneck)#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type

string

Batch Normalization Neck

bnneck

feat_dim

int

Feature dimension

256

1

infinity

num_ids

int

Number of IDs

70

1

infinity

KPS Generator Configuration (model.head.deformable_model.kps_generator)#

Configuration for KeyPoint (Sampling) Generator.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

embed_dims

int

Embedding dimensions

256

1

infinity

num_learnable_pts

int

Number of learnable points

6

1

infinity

fix_scale

list

Fixed scale

[[0, 0, 0], [0.45, 0, 0], [-0.45, 0, 0], [0, 0.45, 0], [0, -0.45, 0], [0, 0, 0.45], [0, 0, -0.45]]

false

Refine Layer Configuration (model.head.refine_layer)#

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type

string

Refine layer type

sparse_box_3d_refinement_module

embed_dims

int

Embedding dimensions

256

1

infinity

refine_yaw

bool

Refine yaw

True

with_quality_estimation

bool

With quality estimation

True

Graph Model Configuration (model.head.graph_model and model.head.temp_graph_model)#

Configuration for graph-based modeling (e.g., GNN or attention) used for spatial and temporal relations.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type

string

Graph model type

MultiheadAttention

embed_dims

int

Embedding dimensions

512

1

infinity

num_heads

int

Number of heads

8

1

infinity

batch_first

bool

Batch first

True

dropout

float

Dropout rate

0.1

0

1

Decoder Configuration (model.head.decoder)#

Configuration for the final output decoder.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type

string

Decoder type

SparseBox3DDecoder

score_threshold

float

Score threshold

0.05

0

1

Norm Layer Configuration (model.head.norm_layer and model.head.ffn.pre_norm)#

Configuration for normalization layers.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type

string

Norm layer type

LN

normalized_shape

int

Normalized shape

256

1

infinity

FFN Configuration (model.head.ffn)#

Configuration for Feed-Forward Networks used in the decoder layers.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type

string

FFN type

AsymmetricFFN

in_channels

int

In channels

512

1

infinity

pre_norm

collection

Pre-norm config

false

embed_dims

int

Embedding dimensions

256

1

infinity

feedforward_channels

int

Feedforward channels

1024

1

infinity

num_fcs

int

Number of feedforward channels

2

1

infinity

ffn_drop

float

FFN dropout

0.1

0

1

act_cfg

collection

Activation config

false

Activation Configuration (model.head.ffn.act_cfg)#

Configuration for activation functions.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type

string

Activation type

ReLU

inplace

bool

Inplace

True

Visibility Net Configuration (model.head.visibility_net)#

Configuration for the visibility prediction network.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type

string

VisibilityNet type

visibility_net

embedding_dim

int

Embedding dimension

256

1

infinity

hidden_channels

int

Hidden channels

32

1

infinity

Neck Configuration (model.neck)#

Configuration for the model’s neck (Feature Pyramid Network).

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type

categorical

Neck - Feature Pyramid Network

FPN

FPN

num_outs

int

4

1

infinity

start_level

int

Start level for FPN

0

0

infinity

out_channels

int

Output channels

256

1

infinity

in_channels

list

Input channels

[256, 512, 1024, 2048]

false

add_extra_convs

categorical

Type of extra conv

on_output

on_input,on_lateral,on_output,False

relu_before_extra_convs

bool

Apply ReLU before extra convs

True

Depth Branch Configuration (model.depth_branch)#

Configuration for the depth estimation branch.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type

string

Depth branch type

dense_depth

embed_dims

int

Embedding dimensions

256

1

infinity

num_depth_layers

int

Number of depth layers

3

1

infinity

loss_weight

float

Weight for depth loss

0.2

0

infinity

train#

The train config contains the parameters related to training. They are described as follows:

train:
  num_epochs: 5
  num_nodes: 1
  num_gpus: 1
  validation_interval: 1
  checkpoint_interval: 1
  pretrained_model_path: ???
  precision: bf16
  optim:
    type: "adamw"
    lr: 0.0001
    weight_decay: 0.001
    paramwise_cfg:
      custom_keys:
        img_backbone:
          lr_mult: 0.25
    grad_clip:
      max_norm: 25
      norm_type: 2
    lr_scheduler:
      policy: "cosine"
      warmup: "linear"
      warmup_iters: 500
      warmup_ratio: 0.333333
      min_lr_ratio: 0.001

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

num_gpus

int

The number of GPUs to run the train job

1

1

gpu_ids

list

List of GPU IDs to run the training on. The length of this list must be equal to the number of gpus in train.num_gpus

[0]

false

num_nodes

int

Number of nodes to run the training on. If > 1, then multi-node is enabled

1

1

seed

int

The seed for the initializer in PyTorch. If < 0, disable fixed seed

1234

-1

infinity

cudnn

collection

false

num_epochs

int

Number of epochs to run the training

10

1

infinity

checkpoint_interval

float

Checkpoint interval in epochs

0.5

0

infinity

validation_interval

float

Validation interval in epochs

0.5

0

infinity

resume_training_checkpoint_path

string

Path to the checkpoint to resume training from

results_dir

string

Path to where all the assets generated from a task are stored

pretrained_model_path

string

Path to pretrained model

optim

collection

Optimizer configuration

false

precision

categorical

Precision

bf16

bf16,fp16,fp32

Note

For FTMS Client, these parameters are set in JSON format.

optim#

The optim parameter defines the config for the AdamW optimizer in training, including the learning rate, learning scheduler, and weight decay.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type

categorical

Optimizer type

adamw

adamw,adam,sgd

lr

float

Learning rate

5e-05

0

infinity

TRUE

weight_decay

float

Weight decay coefficient

0.001

momentum

float

Momentum for SGD

0.9

paramwise_cfg

collection

Parameter-wise configuration

{‘custom_keys’: {‘img_backbone’: {‘lr_mult’: 0.2}}}

false

grad_clip

collection

Gradient clipping configuration

{‘max_norm’: 25, ‘norm_type’: ‘L2’}

false

lr_scheduler

collection

Learning rate scheduler configuration

{‘policy’: ‘cosine’, ‘warmup’: ‘linear’, ‘warmup_iters’: 500, ‘warmup_ratio’: 0.333333, ‘min_lr_ratio’: 0.001}

false

evaluate#

The evaluate config contains the parameters related to evaluation. Currently, we only support evaluation on a single GPU with batch size 1. The parameters are described as follows:

evaluate:
  checkpoint: ${results_dir}/train/sparse4d_model_latest.pth

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

checkpoint

string

Path to the checkpoint used for evaluation

???

results_dir

string

Path to where all the assets generated from a task are stored

metrics

list

Metrics to evaluate

[‘detection’]

false

tracking

collection

Tracking config

false

Note

For FTMS Client these parameters are set in JSON format, and the evaluate checkpoint is deduced from the previous train job ID as specified with the –parent_job_id argument. For TAO Launcher, you must set the path in the evaluate specification:

visualize#

The visualize config contains the parameters related to visualization. They are described as follows:

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

show

bool

Show visualization

True

vis_dir

string

Visualization directory

./vis

vis_score_threshold

float

Visualization score threshold

0.25

0

1

n_images_col

int

Number of images per column

6

1

infinity

viz_down_sample

int

Visualization down sample

3

1

infinity

inference#

The inference config contains the parameters related to training. Currently, we only support inference on a single GPU with batch size 1. They are described as follows:

inference:
  checkpoint: ???
  output_nvschema: true
  jsonfile_prefix: "sparse4d_pred"

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

checkpoint

string

Path to checkpoint file

???

results_dir

string

Path to where all the assets generated from a task are stored

jsonfile_prefix

string

JSON file prefix

sparse4d_pred

output_nvschema

bool

Output NVSchema

True

tracking

collection

Tracking config

false

Note

For FTMS Client these parameters are set in JSON format, and the inference checkpoint is deduced from the previous train job ID as specified with the --parent_job_id argument. For TAO Launcher, you must set the path in the inference specification:

export#

The export config contains the parameters related to export. Currently, we only support export with batch size 1 and dynamic number of camera sensors. They are described as follows:

export:
  results_dir: ???
  checkpoint: ???
  onnx_file: ???

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

results_dir

string

Path to where all the assets generated from a task are stored

checkpoint

string

Path to the checkpoint file to run export

???

onnx_file

string

Path to the onnx model file

???

Note

For FTMS Client these parameters are set in JSON format, and the export checkpoint is deduced from the previous train job ID as specified with the –parent_job_id argument. For TAO Launcher, you must set the path in the export specification:

Training the Model#

Use the following command to run Sparse4D training:

TRAIN_JOB_ID=$(tao-client sparse4d experiment-run-action --action train --id $EXPERIMENT_ID --specs "$SPECS")

Evaluating the Model#

The evaluation metrics for Sparse4D are the mean average precision and ranked accuracy.

Use the following command to run Sparse4D evaluation:

EVALUATE_JOB_ID=$(tao-client sparse4d experiment-run-action --action evaluate --id $EXPERIMENT_ID --specs "$SPECS" --parent_job_id $TRAIN_JOB_ID)

Running Inference on the Model#

Use the following command to run inference on Sparse4D with the .pth model.

The output will be a file with JSON logs consisting of object detection and tracking results for each frame.

INFERENCE_JOB_ID=$(tao-client sparse4d experiment-run-action --action inference --id $EXPERIMENT_ID --specs "$SPECS" --parent_job_id $TRAIN_JOB_ID)

The expected output is as follows:

{
    "version": "4.0",
    "id": "1", # Frame ID
    "sensorId": "bev-sensor-zone-c4", # BEV Sensor ID
    "timestamp": "2025-01-15T10:30:00.123Z", # Timestamp
    "objects": [
      {
        "id": "1", # Object ID
        "type": "Person", # Object Type
        "confidence": 0.887, # Object Confidence Score
        "coordinate": {
          "x": -1.5, # Object Center X Coordinate
          "y": 3.2, # Object Center Y Coordinate
          "z": 0.75 # Object Center Z Coordinate
        },
        "bbox3d": {
          "coordinates": [
            -1.5, # Object Centeroid X Coordinate
            3.2, # Object Centeroid Y Coordinate
            0.75, # Object Centeroid Z Coordinate
            0.5, # Object Width
            0.5, # Object Length
            0.5, # Object Height
            0.0, # Object Pitch
            0.0, # Object Roll
            1.57 # Object Yaw
          ],
          "embedding": [
            {} # Object Embedding
          ],
          "confidence": 0.887 # Object Confidence Score
        }
      },
      {
        "id": "2",
        "type": "Humanoid",
        "confidence": 0.752,
        "coordinate": {
          "x": 5.1,
          "y": -2.8,
          "z": 0.15
        },
        "bbox3d": {
          "coordinates": [
            5.1,
            -2.8,
            0.15,
            1.2,
            1.0,
            0.2,
            0.0,
            0.0,
            -1.04
          ],
          "embedding": [
            {}
          ],
          "confidence": 0.752
        }
      }
    ]
  }
  {
    # ... more frames
  }

Exporting the Model#

Use the following command to export Sparse4D to .onnx format for deployment:

EXPORT_JOB_ID=$(tao-client sparse4d experiment-run-action --action export --id $EXPERIMENT_ID --specs "$SPECS" --parent_job_id $TRAIN_JOB_ID)

TensorRT engine generation and deploying to DeepStream#

Refer to the Nvidia Spatial AI documentation page for more information about deploying a Sparse4D model to DeepStream via TensorRT engine generation.