Sparse4D#
Sparse4D is a Multi-Camera 3D Detection and Tracking model with 4D (spatial-temporal) capabilities. It takes synchronized input images from multiple cameras and calibration matrices and outputs the 3D bounding boxes and temporally consistent tracking IDs. The model is based on resnet101, which is a general-purpose backbone for computer vision.
Each batch in Sparse4D is trained on a group of cameras. Each group is called as a bird’s-eye view (BEV) group. A BEV group is a collection of multiple overlapping cameras.
The expected data requirements and time to fine-tune the Sparse4D model on a single scene of the MTMC Tracking 2025 dataset are as follows:
Backbone type |
GPU type |
Image size |
No. of BEV groups |
No. of cameras in each BEV group |
No. of frames in each camera |
Total no. of epochs |
Total training time |
---|---|---|---|---|---|---|---|
Resnet101 |
8 x Nvidia H100 - 80GB SXM |
3x512x1408 |
3 (Minimum BEV groups) |
4-12 |
9000 (5 mins @ 30 FPS) |
5 |
10 hours |
Sparse4D supports the following tasks:
train
evaluate
inference
export
SPECS=$(tao-client sparse4d get-spec --action <sub_task> --job_type experiment --id $EXPERIMENT_ID)
JOB_ID=$(tao-client sparse4d experiment-run-action --action <sub_task> --id $EXPERIMENT_ID --specs "$SPECS")
Required Arguments
--id
: The unique identifier of the experiment from which to train the model
See also
For information on how to create an experiment using the FTMS client, refer to the Creating an experiment section in the Remote Client Overview and Examples.
tao model sparse4d <sub_task> <args_per_subtask>
Where args_per_subtask
are the command-line arguments required for a given subtask. Each subtask is explained in detail in the following sections.
Data Input for Sparse4D#
The Sparse4D apps in TAO utilize the dataset and for its training, validation and testing.
Refer to the MTMC Tracking 2025 dataset page in the PhysicalAI-SmartSpaces HuggingFace repository for more about the raw dataset format.
The dataset is converted into pickle format and stored in the data/sparse4d/
directory.
Creating an Experiment Spec File#
The spec file for Sparse4D includes model
, dataset
, train
parameters, visualize
parameters, evaluate
parameters and inference
parameters.
The following is an example spec file for training a Sparse4D model on one scene of the MTMC Tracking 2025 dataset.
We will utilize the Warehouse_014 scene from the MTMC Tracking 2025 dataset for training.
SPECS=$(tao-client sparse4d get-spec --action train --job_type experiment --id $EXPERIMENT_ID)
results_dir: /results
train:
num_gpus: 1
gpu_ids:
- 0
num_nodes: 1
seed: 1234
cudnn:
benchmark: false
deterministic: true
num_epochs: 10
checkpoint_interval: 0.5
validation_interval: 0.5
resume_training_checkpoint_path: null
results_dir: null
pretrained_model_path: null
optim:
type: adamw
lr: 5.0e-05
weight_decay: 0.001
momentum: 0.9
paramwise_cfg:
custom_keys:
img_backbone:
lr_mult: 0.2
grad_clip:
max_norm: 25
norm_type: 2
lr_scheduler:
policy: cosine
warmup: linear
warmup_iters: 500
warmup_ratio: 0.333333
min_lr_ratio: 0.001
model:
type: sparse4d
embed_dims: 256
use_grid_mask: true
use_deformable_func: true
input_shape:
- 1408
- 512
backbone:
type: resnet_101
neck:
type: FPN
num_outs: 4
start_level: 0
out_channels: 256
in_channels:
- 256
- 512
- 1024
- 2048
add_extra_convs: on_output
relu_before_extra_convs: true
depth_branch:
type: dense_depth
embed_dims: 256
num_depth_layers: 3
loss_weight: 0.2
head:
type: sparse4d
num_output: 300
cls_threshold_to_reg: 0.05
decouple_attn: true
return_feature: true
use_reid_sampling: false
embed_dims: 256
reid_dims: 0
num_groups: 8
num_decoder: 6
num_single_frame_decoder: 1
drop_out: 0.1
temporal: true
with_quality_estimation: true
operation_order:
- deformable
- ffn
- norm
- refine
- temp_gnn
- gnn
- norm
- deformable
- ffn
- norm
- refine
- temp_gnn
- gnn
- norm
- deformable
- ffn
- norm
- refine
- temp_gnn
- gnn
- norm
- deformable
- ffn
- norm
- refine
- temp_gnn
- gnn
- norm
- deformable
- ffn
- norm
- refine
- temp_gnn
- gnn
- norm
- deformable
- ffn
- norm
- refine
visibility_net:
type: visibility_net
embedding_dim: 256
hidden_channels: 32
instance_bank:
num_anchor: 900
anchor: ''
num_temp_instances: 600
confidence_decay: 0.8
feat_grad: false
default_time_interval: 0.033333
embed_dims: 256
use_temporal_align: false
grid_size: null
anchor_encoder:
type: SparseBox3DEncoder
vel_dims: 3
embed_dims:
- 128
- 32
- 32
- 64
mode: cat
output_fc: false
in_loops: 1
out_loops: 4
pos_embed_only: false
sampler:
num_dn_groups: 5
num_temp_dn_groups: 3
dn_noise_scale:
- 2.0
- 2.0
- 2.0
- 0.5
- 0.5
- 0.5
- 0.5
- 0.5
- 0.5
- 0.5
- 0.5
max_dn_gt: 128
add_neg_dn: true
cls_weight: 2.0
box_weight: 0.25
reg_weights:
- 2.0
- 2.0
- 2.0
- 0.5
- 0.5
- 0.5
- 0.0
- 0.0
- 0.0
- 0.0
- 0.0
use_temporal_align: false
gt_assign_threshold: 0.5
reg_weights:
- 2.0
- 2.0
- 2.0
- 1.0
- 1.0
- 1.0
- 1.0
- 1.0
- 1.0
- 1.0
- 1.0
loss:
cls:
type: focal
use_sigmoid: true
gamma: 2.0
alpha: 0.25
loss_weight: 2.0
reg:
type: sparse_box_3d
box_weight: 0.25
cls_allow_reverse: []
id:
type: cross_entropy_label_smooth
num_ids: 70
bnneck:
type: bnneck
feat_dim: 256
num_ids: 70
deformable_model:
embed_dims: 256
num_groups: 8
num_levels: 4
attn_drop: 0.15
use_deformable_func: true
use_camera_embed: false
residual_mode: cat
num_cams: 6
max_num_cams: 20
proj_drop: 0.0
kps_generator:
embed_dims: 256
num_learnable_pts: 6
fix_scale:
- - 0
- 0
- 0
- - 0.45
- 0
- 0
- - -0.45
- 0
- 0
- - 0
- 0.45
- 0
- - 0
- -0.45
- 0
- - 0
- 0
- 0.45
- - 0
- 0
- -0.45
refine_layer:
type: sparse_box_3d_refinement_module
embed_dims: 256
refine_yaw: true
with_quality_estimation: true
valid_vel_weight: -1.0
graph_model:
type: MultiheadAttention
embed_dims: 512
num_heads: 8
batch_first: true
dropout: 0.1
temp_graph_model:
type: MultiheadAttention
embed_dims: 512
num_heads: 8
batch_first: true
dropout: 0.1
decoder:
type: SparseBox3DDecoder
score_threshold: 0.05
norm_layer:
type: LN
normalized_shape: 256
ffn:
type: AsymmetricFFN
in_channels: 512
pre_norm:
type: LN
normalized_shape: 256
embed_dims: 256
feedforward_channels: 1024
num_fcs: 2
ffn_drop: 0.1
act_cfg:
type: ReLU
inplace: true
use_temporal_align: false
inference:
num_gpus: 1
gpu_ids:
- 0
num_nodes: 1
checkpoint: ???
trt_engine: null
results_dir: null
jsonfile_prefix: sparse4d_pred
output_nvschema: true
tracking:
enabled: true
threshold: 0.2
evaluate:
num_gpus: 1
gpu_ids:
- 0
num_nodes: 1
checkpoint: ???
trt_engine: null
results_dir: null
metrics:
- detection
tracking:
enabled: true
threshold: 0.2
export:
results_dir: null
gpu_id: 0
checkpoint: ???
onnx_file: ???
on_cpu: false
input_channel: 3
input_width: 960
input_height: 544
opset_version: 17
batch_size: -1
verbose: false
format: onnx
visualize:
show: true
vis_dir: ./vis
vis_score_threshold: 0.25
n_images_col: 6
viz_down_sample: 3
The experiment specification consists of several main components:
dataset
model
train
evaluate
inference
export
visualize
dataset#
The dataset
parameter defines the dataset source, training batch size, and
augmentation. An example dataset
is provided below. This section describes the main parameters of the Omniverse3DDetTrackDatasetConfig
.
dataset:
use_h5_file_for_rgb: false
use_h5_file_for_depth: true
num_frames: 9000
batch_size: 2
num_bev_groups: 1
num_workers: 2
num_ids: 70
classes: [
"person",
"gr1_t2",
"agility_digit",
"nova_carter",
]
type: "omniverse_3d_det_track"
data_root: ???
train_dataset:
ann_file: ???
test_mode: false
use_valid_flag: true
with_seq_flag: true
sequences_split_num: 100
keep_consistent_seq_aug: true
same_scene_in_batch: true
val_dataset:
ann_file: ???
test_mode: true
use_valid_flag: true
tracking: true
tracking_threshold: 0.2
test_dataset:
ann_file: ???
test_mode: true
use_valid_flag: true
tracking: true
tracking_threshold: 0.2
augmentation:
resize_lim: [0.7, 0.77]
final_dim: [512, 1408]
bot_pct_lim: [0.0, 0.0]
rot_lim: [-5.4, 5.4]
image_size: [1080, 1920]
rand_flip: true
rot3d_range: [-0.3925, 0.3925]
normalize:
mean: [123.675, 116.28, 103.53]
std: [58.395, 57.12, 57.375]
to_rgb: true
sequences:
split_num: 100
keep_consistent_aug: true
same_scene_in_batch: true
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
string |
Dataset type |
omniverse_3d_det_track |
||||
|
int |
Batch size |
2 |
1 |
infinity |
||
|
bool |
Use H5 file |
False |
||||
|
bool |
Use H5 file |
True |
||||
|
int |
Number of frames |
200 |
1 |
infinity |
||
|
int |
Number of BEV groups |
1 |
1 |
infinity |
||
|
string |
Path to data root |
??? |
||||
|
string |
Path to annotation root |
??? |
||||
|
list |
Classes to detect |
[‘person’, ‘humanoid’, ‘nova_carter’, ‘transporter’, ‘forklift’, ‘box’, ‘pallet’, ‘crate’] |
false |
|||
|
int |
Number of workers |
4 |
0 |
infinity |
||
|
int |
Number of IDs |
70 |
1 |
infinity |
||
|
collection |
Augmentation config |
false |
||||
|
collection |
Normalize config |
false |
||||
|
collection |
Sequences config |
false |
||||
|
collection |
Train dataset config |
false |
||||
|
collection |
Val dataset config |
false |
||||
|
collection |
Test dataset config |
false |
Note
For the FTMS Client, these parameters are set in JSON format.
Train Dataset Configuration (dataset.train_dataset)#
Configuration for the training dataset.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
string |
Path to annotation file |
??? |
||||
|
bool |
Test mode |
False |
||||
|
bool |
Use valid flag |
True |
||||
|
bool |
With sequence flag |
True |
||||
|
int |
Number of sequences |
100 |
1 |
infinity |
||
|
bool |
Keep consistent sequence augmentation |
True |
||||
|
bool |
Same scene in batch |
True |
Validation Dataset Configuration (dataset.val_dataset)#
Configuration for the validation dataset.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
string |
Path to annotation pickle files/folders |
??? |
||||
|
bool |
Test mode |
False |
||||
|
bool |
Use valid flag |
True |
||||
|
bool |
Tracking |
True |
||||
|
float |
Tracking threshold |
0.2 |
0 |
1 |
||
|
bool |
Same scene in batch |
True |
Test Dataset Configuration (dataset.test_dataset)#
Configuration for the test dataset.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
string |
Path to annotation pickle files/folders |
??? |
||||
|
bool |
Test mode |
True |
||||
|
bool |
Use valid flag |
True |
||||
|
bool |
Tracking |
True |
||||
|
float |
Tracking threshold |
0.2 |
0 |
1 |
||
|
bool |
Same scene in batch |
True |
Augmentation Configuration (dataset.augmentation)#
Configuration for data augmentation.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
list |
Resize limits |
[0.7, 0.77] |
false |
|||
|
list |
Final dimensions |
[512, 1408] |
false |
|||
|
list |
Bottom percentage limits |
[0.0, 0.0] |
false |
|||
|
list |
Rotation limits in degrees |
[-5.4, 5.4] |
false |
|||
|
list |
Original image size |
[1080, 1920] |
false |
|||
|
bool |
Random flip |
True |
||||
|
list |
3D rotation range in radians |
[-0.3925, 0.3925] |
false |
Normalize Configuration (dataset.normalize)#
Configuration for image normalization.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
list |
Mean values for normalization |
[123.675, 116.28, 103.53] |
false |
|||
|
list |
Standard deviation values for normalization |
[58.395, 57.12, 57.375] |
false |
|||
|
bool |
Convert to RGB |
True |
Sequences Configuration (dataset.sequences)#
Configuration for handling image sequences.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
int |
Number of sequence splits |
100 |
1 |
infinity |
||
|
bool |
Keep consistent augmentation |
True |
||||
|
bool |
Keep same scene in batch |
True |
model#
The model
parameter provides options to change the Sparse4D architecture.
model:
type: "sparse4d"
use_grid_mask: true
use_deformable_func: true
use_temporal_align: true
input_shape: [1408, 512]
embed_dims: 256
neck:
type: "FPN"
num_outs: 4
start_level: 0
out_channels: 256
in_channels: [256, 512, 1024, 2048]
add_extra_convs: "on_output"
relu_before_extra_convs: true
depth_branch:
type: "dense_depth"
embed_dims: "${model.embed_dims}"
num_depth_layers: 3
loss_weight: 0.2
head:
type: "sparse4d"
num_output: 300
cls_threshold_to_reg: 0.05
decouple_attn: true
return_feature: true
use_reid_sampling: false
embed_dims: "${model.embed_dims}"
num_groups: 8
num_decoder: 6
num_single_frame_decoder: 1
drop_out: 0.1
temporal: true
with_quality_estimation: true
instance_bank:
num_anchor: 900
anchor: ???
num_temp_instances: 600
confidence_decay: 0.8
feat_grad: false
default_time_interval: 0.033333
embed_dims: "${model.embed_dims}"
use_temporal_align: "${model.use_temporal_align}"
anchor_encoder:
type: 'SparseBox3DEncoder'
vel_dims: 3
embed_dims: [128, 32, 32, 64]
mode: 'cat'
output_fc: false
in_loops: 1
out_loops: 4
operation_order: [
"deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
"deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
"deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
"deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
"deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
"deformable", "ffn", "norm", "refine"
]
temp_graph_model:
type: "MultiheadAttention"
embed_dims: 512
num_heads: 8
batch_first: true
dropout: 0.1
graph_model:
type: "MultiheadAttention"
embed_dims: "${model.head.temp_graph_model.embed_dims}"
num_heads: "${model.head.temp_graph_model.num_heads}"
batch_first: true
dropout: "${model.head.temp_graph_model.dropout}"
norm_layer:
type: "LN"
normalized_shape: "${model.embed_dims}"
ffn:
type: "AsymmetricFFN"
in_channels: 512
pre_norm:
type: "LN"
embed_dims: 256
feedforward_channels: 1024
num_fcs: 2
ffn_drop: 0.1
act_cfg:
type: "ReLU"
inplace: true
deformable_model:
embed_dims: "${model.embed_dims}"
num_groups: 8
num_levels: 4
attn_drop: 0.15
use_deformable_func: true
use_camera_embed: false
residual_mode: "cat"
kps_generator:
embed_dims: "${model.embed_dims}"
num_learnable_pts: 6
fix_scale:
- [0, 0, 0]
- [0.45, 0, 0]
- [-0.45, 0, 0]
- [0, 0.45, 0]
- [0, -0.45, 0]
- [0, 0, 0.45]
- [0, 0, -0.45]
refine_layer:
type: "SparseBox3DRefinementModule"
embed_dims: "${model.embed_dims}"
refine_yaw: true
with_quality_estimation: true
sampler:
num_dn_groups: 5
num_temp_dn_groups: 3
dn_noise_scale: [2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]
max_dn_gt: 128
add_neg_dn: true
cls_weight: 2.0
box_weight: 0.25
reg_weights: [2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0]
use_temporal_align: "${model.use_temporal_align}"
visibility_net:
type: "visibility_net"
embedding_dim: 256
hidden_channels: 32
loss:
reg:
type: "sparse_box_3d"
box_weight: 0.25
cls_allow_reverse: [5, 6, 7]
cls:
type: "focal"
use_sigmoid: true
gamma: 2.0
alpha: 0.25
loss_weight: 2.0
id:
type: "cross_entropy_label_smooth"
num_ids: "${dataset.num_ids}"
bnneck:
type: "bnneck"
feat_dim: 256
num_ids: "${dataset.num_ids}"
decoder:
type: "SparseBox3DDecoder"
score_threshold: 0.05
reg_weights: [2.0, 2.0, 2.0, 1 ,1, 1, 1, 1, 1, 1, 1]
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
string |
Model type |
sparse4d |
||||
|
int |
Embedding dimensions |
256 |
1 |
infinity |
||
|
bool |
Use grid mask |
True |
||||
|
bool |
Use deformable function |
True |
||||
|
list |
Input image shape |
[1408, 512] |
false |
|||
|
collection |
Backbone config |
false |
||||
|
collection |
Neck config |
false |
||||
|
collection |
Depth branch config |
false |
||||
|
collection |
Head config |
false |
||||
|
bool |
Use temporal alignment |
False |
Note
For FTMS Client, these parameters are set in JSON format.
Backbone Configuration (model.backbone)#
Configuration for the model’s backbone network. Currently, only resnet_101 is supported.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
string |
Backbone type |
resnet_101 |
resnet_101 |
Head Configuration (model.head)#
Top-level configuration for the detection and tracking head.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
string |
Head type |
sparse4d |
||||
|
int |
Number of output instances |
300 |
1 |
infinity |
||
|
float |
Classification threshold for regression |
0.05 |
0 |
1 |
||
|
bool |
Decouple attention |
True |
||||
|
bool |
Return instance features |
True |
||||
|
bool |
Use Re-ID sampling |
False |
||||
|
int |
Embedding dimensions |
256 |
1 |
infinity |
||
|
int |
Re-ID dimensions |
0 |
0 |
infinity |
||
|
int |
Number of groups |
8 |
1 |
infinity |
||
|
int |
Number of decoder layers |
6 |
1 |
infinity |
||
|
int |
Number of single-frame decoder layers |
1 |
1 |
infinity |
||
|
float |
Dropout rate |
0.1 |
0 |
1 |
||
|
bool |
Enable temporal modeling |
True |
||||
|
bool |
Enable quality estimation |
True |
||||
|
list |
Operation order |
[‘deformable’, ‘ffn’, ‘norm’, ‘refine’, ‘temp_gnn’, ‘gnn’, ‘norm’, ‘deformable’, ‘ffn’, ‘norm’, ‘refine’, ‘temp_gnn’, ‘gnn’, ‘norm’, ‘deformable’, ‘ffn’, ‘norm’, ‘refine’, ‘temp_gnn’, ‘gnn’, ‘norm’, ‘deformable’, ‘ffn’, ‘norm’, ‘refine’, ‘temp_gnn’, ‘gnn’, ‘norm’, ‘deformable’, ‘ffn’, ‘norm’, ‘refine’, ‘temp_gnn’, ‘gnn’, ‘norm’, ‘deformable’, ‘ffn’, ‘norm’, ‘refine’] |
false |
|||
|
collection |
Visibility net config |
false |
||||
|
collection |
Instance bank config |
false |
||||
|
collection |
Anchor encoder config |
false |
||||
|
collection |
Sampler config |
false |
||||
|
list |
Regression weights |
[2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] |
false |
|||
|
collection |
Loss config |
false |
||||
|
collection |
BN neck config |
false |
||||
|
collection |
Deformable model config |
false |
||||
|
collection |
Refine layer config |
false |
||||
|
float |
Valid velocity weight |
-1 |
-1 |
infinity |
||
|
collection |
Graph model config |
false |
||||
|
collection |
Temp graph model config |
false |
||||
|
collection |
Decoder config |
false |
||||
|
collection |
Norm layer config |
false |
||||
|
collection |
FFN config |
false |
Deformable Model Configuration (model.head.deformable_model)#
Configuration for the deformable attention mechanism.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
int |
Embedding dimensions |
256 |
1 |
infinity |
||
|
int |
Number of groups |
8 |
1 |
infinity |
||
|
int |
Number of levels |
4 |
1 |
infinity |
||
|
float |
Attention dropout |
0.15 |
0 |
1 |
||
|
bool |
Use deformable function |
True |
||||
|
bool |
Use camera embedding |
False |
||||
|
categorical |
Residual mode |
cat |
cat,add |
|||
|
int |
Number of cameras |
6 |
1 |
infinity |
||
|
int |
Maximum number of cameras |
20 |
1 |
infinity |
||
|
float |
Projection dropout |
0.0 |
0 |
1 |
||
|
collection |
KPS generator config |
false |
Instance Bank Configuration (model.head.instance_bank)#
Configuration for managing object instances over time.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
int |
Number of anchors |
900 |
1 |
infinity |
||
|
string |
Path to anchor file |
|||||
|
int |
Number of temporal instances |
600 |
0 |
infinity |
||
|
float |
Confidence decay factor |
0.8 |
0 |
1 |
||
|
bool |
Enable gradients for features |
False |
||||
|
float |
Default time interval |
0.033333 |
0 |
infinity |
||
|
int |
Embedding dimensions |
256 |
1 |
infinity |
||
|
bool |
Use temporal alignment |
False |
||||
|
float |
Grid size |
Anchor Encoder Configuration (model.head.anchor_encoder)#
Configuration for encoding anchor information.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
string |
Anchor encoder type |
SparseBox3DEncoder |
||||
|
int |
Velocity dimensions |
3 |
1 |
infinity |
||
|
list |
Embedding dimensions |
[128, 32, 32, 64] |
false |
|||
|
categorical |
Mode |
cat |
cat,add |
|||
|
bool |
Fully Connected Layer |
False |
||||
|
int |
In loops |
1 |
1 |
infinity |
||
|
int |
Out loops |
4 |
1 |
infinity |
||
|
bool |
Pos embed only |
False |
Sampler Configuration (model.head.sampler)#
Configuration for sampling positive and negative examples during training.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
int |
Number of De-Noising groups |
5 |
1 |
infinity |
||
|
int |
Number of temporal DN groups |
3 |
0 |
infinity |
||
|
list |
De-Noising scale |
[2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5] |
false |
|||
|
int |
Maximum DN ground truth |
128 |
1 |
infinity |
||
|
bool |
Add negative DN |
True |
||||
|
float |
Classification weight |
2.0 |
0 |
infinity |
||
|
float |
Box weight |
0.25 |
0 |
infinity |
||
|
list |
Regression weights |
[2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0] |
false |
|||
|
bool |
Use temporal alignment |
False |
||||
|
float |
Ground Truth assign threshold |
0.5 |
0 |
1 |
Loss Configuration (model.head.loss)#
This section details the different loss components used in the model head.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
collection |
Classification loss config |
false |
||||
|
collection |
Regression loss config |
false |
||||
|
collection |
ID loss config |
false |
Classification Loss (model.head.loss.cls)#
Configuration for the classification loss.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
string |
Classification loss type |
focal |
||||
|
bool |
Use sigmoid |
True |
||||
|
float |
Focal loss gamma |
2.0 |
0 |
infinity |
||
|
float |
Focal loss alpha |
0.25 |
0 |
1 |
||
|
float |
Loss weight |
2.0 |
0 |
infinity |
Regression Loss (model.head.loss.reg)#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
string |
Regression loss type |
sparse_box_3d |
||||
|
float |
Box loss weight |
0.25 |
0 |
infinity |
||
|
list |
Class allow reverse |
[] |
false |
ID Loss (model.head.loss.id)#
Configuration for the ID / Re-ID loss.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
string |
ID loss type |
cross_entropy_label_smooth |
||||
|
int |
Number of IDs |
70 |
1 |
infinity |
BNNeck Configuration (model.head.bnneck)#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
string |
Batch Normalization Neck |
bnneck |
||||
|
int |
Feature dimension |
256 |
1 |
infinity |
||
|
int |
Number of IDs |
70 |
1 |
infinity |
KPS Generator Configuration (model.head.deformable_model.kps_generator)#
Configuration for KeyPoint (Sampling) Generator.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
int |
Embedding dimensions |
256 |
1 |
infinity |
||
|
int |
Number of learnable points |
6 |
1 |
infinity |
||
|
list |
Fixed scale |
[[0, 0, 0], [0.45, 0, 0], [-0.45, 0, 0], [0, 0.45, 0], [0, -0.45, 0], [0, 0, 0.45], [0, 0, -0.45]] |
false |
Refine Layer Configuration (model.head.refine_layer)#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
string |
Refine layer type |
sparse_box_3d_refinement_module |
||||
|
int |
Embedding dimensions |
256 |
1 |
infinity |
||
|
bool |
Refine yaw |
True |
||||
|
bool |
With quality estimation |
True |
Graph Model Configuration (model.head.graph_model and model.head.temp_graph_model)#
Configuration for graph-based modeling (e.g., GNN or attention) used for spatial and temporal relations.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
string |
Graph model type |
MultiheadAttention |
||||
|
int |
Embedding dimensions |
512 |
1 |
infinity |
||
|
int |
Number of heads |
8 |
1 |
infinity |
||
|
bool |
Batch first |
True |
||||
|
float |
Dropout rate |
0.1 |
0 |
1 |
Decoder Configuration (model.head.decoder)#
Configuration for the final output decoder.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
string |
Decoder type |
SparseBox3DDecoder |
||||
|
float |
Score threshold |
0.05 |
0 |
1 |
Norm Layer Configuration (model.head.norm_layer and model.head.ffn.pre_norm)#
Configuration for normalization layers.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
string |
Norm layer type |
LN |
||||
|
int |
Normalized shape |
256 |
1 |
infinity |
FFN Configuration (model.head.ffn)#
Configuration for Feed-Forward Networks used in the decoder layers.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
string |
FFN type |
AsymmetricFFN |
||||
|
int |
In channels |
512 |
1 |
infinity |
||
|
collection |
Pre-norm config |
false |
||||
|
int |
Embedding dimensions |
256 |
1 |
infinity |
||
|
int |
Feedforward channels |
1024 |
1 |
infinity |
||
|
int |
Number of feedforward channels |
2 |
1 |
infinity |
||
|
float |
FFN dropout |
0.1 |
0 |
1 |
||
|
collection |
Activation config |
false |
Activation Configuration (model.head.ffn.act_cfg)#
Configuration for activation functions.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
string |
Activation type |
ReLU |
||||
|
bool |
Inplace |
True |
Visibility Net Configuration (model.head.visibility_net)#
Configuration for the visibility prediction network.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
string |
VisibilityNet type |
visibility_net |
||||
|
int |
Embedding dimension |
256 |
1 |
infinity |
||
|
int |
Hidden channels |
32 |
1 |
infinity |
Neck Configuration (model.neck)#
Configuration for the model’s neck (Feature Pyramid Network).
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
categorical |
Neck - Feature Pyramid Network |
FPN |
FPN |
|||
|
int |
4 |
1 |
infinity |
|||
|
int |
Start level for FPN |
0 |
0 |
infinity |
||
|
int |
Output channels |
256 |
1 |
infinity |
||
|
list |
Input channels |
[256, 512, 1024, 2048] |
false |
|||
|
categorical |
Type of extra conv |
on_output |
on_input,on_lateral,on_output,False |
|||
|
bool |
Apply ReLU before extra convs |
True |
Depth Branch Configuration (model.depth_branch)#
Configuration for the depth estimation branch.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
string |
Depth branch type |
dense_depth |
||||
|
int |
Embedding dimensions |
256 |
1 |
infinity |
||
|
int |
Number of depth layers |
3 |
1 |
infinity |
||
|
float |
Weight for depth loss |
0.2 |
0 |
infinity |
train#
The train config contains the parameters related to training. They are described as follows:
train:
num_epochs: 5
num_nodes: 1
num_gpus: 1
validation_interval: 1
checkpoint_interval: 1
pretrained_model_path: ???
precision: bf16
optim:
type: "adamw"
lr: 0.0001
weight_decay: 0.001
paramwise_cfg:
custom_keys:
img_backbone:
lr_mult: 0.25
grad_clip:
max_norm: 25
norm_type: 2
lr_scheduler:
policy: "cosine"
warmup: "linear"
warmup_iters: 500
warmup_ratio: 0.333333
min_lr_ratio: 0.001
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
int |
The number of GPUs to run the train job |
1 |
1 |
|||
|
list |
List of GPU IDs to run the training on. The length of this list must be equal to the number of gpus in train.num_gpus |
[0] |
false |
|||
|
int |
Number of nodes to run the training on. If > 1, then multi-node is enabled |
1 |
1 |
|||
|
int |
The seed for the initializer in PyTorch. If < 0, disable fixed seed |
1234 |
-1 |
infinity |
||
|
collection |
false |
|||||
|
int |
Number of epochs to run the training |
10 |
1 |
infinity |
||
|
float |
Checkpoint interval in epochs |
0.5 |
0 |
infinity |
||
|
float |
Validation interval in epochs |
0.5 |
0 |
infinity |
||
|
string |
Path to the checkpoint to resume training from |
|||||
|
string |
Path to where all the assets generated from a task are stored |
|||||
|
string |
Path to pretrained model |
|||||
|
collection |
Optimizer configuration |
false |
||||
|
categorical |
Precision |
bf16 |
bf16,fp16,fp32 |
Note
For FTMS Client, these parameters are set in JSON format.
optim#
The optim
parameter defines the config for the AdamW optimizer in training, including the
learning rate, learning scheduler, and weight decay.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
categorical |
Optimizer type |
adamw |
adamw,adam,sgd |
|||
|
float |
Learning rate |
5e-05 |
0 |
infinity |
TRUE |
|
|
float |
Weight decay coefficient |
0.001 |
||||
|
float |
Momentum for SGD |
0.9 |
||||
|
collection |
Parameter-wise configuration |
{‘custom_keys’: {‘img_backbone’: {‘lr_mult’: 0.2}}} |
false |
|||
|
collection |
Gradient clipping configuration |
{‘max_norm’: 25, ‘norm_type’: ‘L2’} |
false |
|||
|
collection |
Learning rate scheduler configuration |
{‘policy’: ‘cosine’, ‘warmup’: ‘linear’, ‘warmup_iters’: 500, ‘warmup_ratio’: 0.333333, ‘min_lr_ratio’: 0.001} |
false |
evaluate#
The evaluate config contains the parameters related to evaluation. Currently, we only support evaluation on a single GPU with batch size 1. The parameters are described as follows:
evaluate:
checkpoint: ${results_dir}/train/sparse4d_model_latest.pth
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
string |
Path to the checkpoint used for evaluation |
??? |
||||
|
string |
Path to where all the assets generated from a task are stored |
|||||
|
list |
Metrics to evaluate |
[‘detection’] |
false |
|||
|
collection |
Tracking config |
false |
Note
For FTMS Client these parameters are set in JSON format, and the evaluate checkpoint is deduced from the previous train job ID as specified with the –parent_job_id argument. For TAO Launcher, you must set the path in the evaluate specification:
visualize#
The visualize config contains the parameters related to visualization. They are described as follows:
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
bool |
Show visualization |
True |
||||
|
string |
Visualization directory |
./vis |
||||
|
float |
Visualization score threshold |
0.25 |
0 |
1 |
||
|
int |
Number of images per column |
6 |
1 |
infinity |
||
|
int |
Visualization down sample |
3 |
1 |
infinity |
inference#
The inference config contains the parameters related to training. Currently, we only support inference on a single GPU with batch size 1. They are described as follows:
inference:
checkpoint: ???
output_nvschema: true
jsonfile_prefix: "sparse4d_pred"
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
string |
Path to checkpoint file |
??? |
||||
|
string |
Path to where all the assets generated from a task are stored |
|||||
|
string |
JSON file prefix |
sparse4d_pred |
||||
|
bool |
Output NVSchema |
True |
||||
|
collection |
Tracking config |
false |
Note
For FTMS Client these parameters are set in JSON format, and the inference checkpoint is deduced from the previous train job ID as specified with the --parent_job_id
argument.
For TAO Launcher, you must set the path in the inference specification:
export#
The export config contains the parameters related to export. Currently, we only support export with batch size 1 and dynamic number of camera sensors. They are described as follows:
export:
results_dir: ???
checkpoint: ???
onnx_file: ???
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
---|---|---|---|---|---|---|---|
|
string |
Path to where all the assets generated from a task are stored |
|||||
|
string |
Path to the checkpoint file to run export |
??? |
||||
|
string |
Path to the onnx model file |
??? |
Note
For FTMS Client these parameters are set in JSON format, and the export checkpoint is deduced from the previous train job ID as specified with the –parent_job_id argument. For TAO Launcher, you must set the path in the export specification:
Training the Model#
Use the following command to run Sparse4D training:
TRAIN_JOB_ID=$(tao-client sparse4d experiment-run-action --action train --id $EXPERIMENT_ID --specs "$SPECS")
tao model sparse4d train -e <experiment_spec_file>
results_dir=<results_dir>
[train.gpu_ids=<gpu id list>]
Required Arguments
The following arguments are required:
-e, --experiment_spec_file
: The path to the experiment spec fileresults_dir
: The path to a folder where the experiment outputs are to be written
Optional Arguments
The following arguments are optional.
train.gpu_ids
: A list of GPU indices to use for training. If you set more than one GPU ID, multi-GPU training will be triggered automatically.
Here’s an example of using the Sparse4D training command:
tao model sparse4d train -e $DEFAULT_SPEC results_dir=$RESULTS_DIR
Evaluating the Model#
The evaluation metrics for Sparse4D are the mean average precision and ranked accuracy.
Use the following command to run Sparse4D evaluation:
EVALUATE_JOB_ID=$(tao-client sparse4d experiment-run-action --action evaluate --id $EXPERIMENT_ID --specs "$SPECS" --parent_job_id $TRAIN_JOB_ID)
tao model sparse4d evaluate -e <experiment_spec_file>
results_dir=<results_dir>
evaluate.checkpoint=<model to be evaluated>
[evaluate.gpu_id=<gpu index>]
Required Arguments
The following arguments are required:
-e, --experiment_spec_file
: The experiment spec file to set up the evaluation experimentresults_dir
: The path to a folder where the experiment outputs are to be writtenevaluate.checkpoint
: The.pth
model
Optional Arguments
evaluate.gpu_id
: The GPU index used to run evaluation (when the machine has multiple GPUs installed). Note that evaluation can only run on a single GPU.
Here’s an example of using the Sparse4D evaluation command:
tao model sparse4d evaluate -e $DEFAULT_SPEC results_dir=$RESULTS_DIR evaluate.checkpoint=$TRAINED_TLT_MODEL evaluate.test_dataset=$TEST_DATA
Running Inference on the Model#
Use the following command to run inference on Sparse4D with the .pth
model.
The output will be a file with JSON logs consisting of object detection and tracking results for each frame.
INFERENCE_JOB_ID=$(tao-client sparse4d experiment-run-action --action inference --id $EXPERIMENT_ID --specs "$SPECS" --parent_job_id $TRAIN_JOB_ID)
tao model sparse4d inference -e <experiment_spec>
results_dir=<results_dir>
inference.checkpoint=<inference model>
[inference.gpu_id=<gpu index>]
Required Arguments
The following arguments are required:
-e, --experiment_spec
: The experiment spec file to set up inferenceresults_dir
: The path to a folder where the experiment outputs are to be writteninference.checkpoint
: The.pth
model to perform inference with
Optional Arguments
The following arguments are optional.
inference.gpu_id
: The index of the GPU that will be used to run inference (when the machine has multiple GPUs installed). Note that inference can only run on a single GPU.
Here’s an example of using the Sparse4D inference command:
tao model sparse4d inference -e $DEFAULT_SPEC results_dir=$RESULTS_DIR inference.checkpoint=$TRAINED_TLT_MODEL
The expected output is as follows:
{
"version": "4.0",
"id": "1", # Frame ID
"sensorId": "bev-sensor-zone-c4", # BEV Sensor ID
"timestamp": "2025-01-15T10:30:00.123Z", # Timestamp
"objects": [
{
"id": "1", # Object ID
"type": "Person", # Object Type
"confidence": 0.887, # Object Confidence Score
"coordinate": {
"x": -1.5, # Object Center X Coordinate
"y": 3.2, # Object Center Y Coordinate
"z": 0.75 # Object Center Z Coordinate
},
"bbox3d": {
"coordinates": [
-1.5, # Object Centeroid X Coordinate
3.2, # Object Centeroid Y Coordinate
0.75, # Object Centeroid Z Coordinate
0.5, # Object Width
0.5, # Object Length
0.5, # Object Height
0.0, # Object Pitch
0.0, # Object Roll
1.57 # Object Yaw
],
"embedding": [
{} # Object Embedding
],
"confidence": 0.887 # Object Confidence Score
}
},
{
"id": "2",
"type": "Humanoid",
"confidence": 0.752,
"coordinate": {
"x": 5.1,
"y": -2.8,
"z": 0.15
},
"bbox3d": {
"coordinates": [
5.1,
-2.8,
0.15,
1.2,
1.0,
0.2,
0.0,
0.0,
-1.04
],
"embedding": [
{}
],
"confidence": 0.752
}
}
]
}
{
# ... more frames
}
Exporting the Model#
Use the following command to export Sparse4D to .onnx
format for deployment:
EXPORT_JOB_ID=$(tao-client sparse4d experiment-run-action --action export --id $EXPERIMENT_ID --specs "$SPECS" --parent_job_id $TRAIN_JOB_ID)
tao model sparse4d export -e <experiment_spec>
results_dir=<results_dir>
export.checkpoint=<tlt checkpoint to be exported>
[export.onnx_file=<path to exported file>]
[export.gpu_id=<gpu index>]
Required Arguments
The following arguments are required:
-e, --experiment_spec
: The experiment spec file to configure exportresults_dir
: The path to a folder where the experiment outputs are to be writtenexport.checkpoint
: The.pth
model to be exported
Optional Arguments
The following arguments are optional.
export.onnx_file
: The path to save the exported model to. The default path is in the same directory as the*.pth
model.export.gpu_id
: The index of the GPU that will be used to run the export (when the machine has multiple GPUs installed). Note that export can only run on a single GPU.
Here’s an example of using the Sparse4D export command:
tao model sparse4d export -e $DEFAULT_SPEC results_dir=$RESULTS_DIR export.checkpoint=$TRAINED_TLT_MODEL
TensorRT engine generation and deploying to DeepStream#
Refer to the Nvidia Spatial AI documentation page for more information about deploying a Sparse4D model to DeepStream via TensorRT engine generation.