Sparse4D#
Sparse4D is a Multi-Camera 3D Detection and Tracking model with 4D (spatial-temporal) capabilities. It takes synchronized input images from multiple cameras and calibration matrices and outputs the 3D bounding boxes and temporally consistent tracking IDs. The model is based on resnet101, which is a general-purpose backbone for computer vision.
Each batch in Sparse4D is trained on a group of cameras. Each group is called as a bird’s-eye view (BEV) group. A BEV group is a collection of multiple overlapping cameras.
The expected data requirements and time to fine-tune the Sparse4D model on a single scene of the MTMC Tracking 2025 dataset are as follows:
Backbone type |
GPU type |
Image size |
No. of BEV groups |
No. of cameras in each BEV group |
No. of frames in each camera |
Total no. of epochs |
Total training time |
|---|---|---|---|---|---|---|---|
Resnet101 |
8 x Nvidia H100 - 80GB SXM |
3x512x1408 |
3 (Minimum BEV groups) |
4-12 |
9000 (5 mins @ 30 FPS) |
5 |
10 hours |
Sparse4D supports the following tasks:
trainevaluateinferenceexportquantize
Data Input for Sparse4D#
The Sparse4D apps in TAO utilize the dataset and for its training, validation and testing.
Refer to the MTMC Tracking 2025 dataset page in the PhysicalAI-SmartSpaces HuggingFace repository for more about the raw dataset format.
The dataset is converted into pickle format and stored in the data/sparse4d/ directory.
Creating an Experiment Specification File#
The specification file for Sparse4D includes model, dataset, train parameters, visualize parameters, evaluate parameters and inference parameters.
The following is an example specification file for training a Sparse4D model on one scene of the MTMC Tracking 2025 dataset.
We will utilize the Warehouse_014 scene from the MTMC Tracking 2025 dataset for training.
The experiment specification consists of several main components:
datasetmodeltrainevaluateinferenceexportvisualize
dataset#
The dataset parameter defines the dataset source, training batch size, and
augmentation. An example dataset is provided below. This section describes the main parameters of the Omniverse3DDetTrackDatasetConfig.
dataset:
use_h5_file_for_rgb: false
use_h5_file_for_depth: true
num_frames: 9000
batch_size: 2
num_bev_groups: 1
num_workers: 2
num_ids: 70
classes: [
"person",
"gr1_t2",
"agility_digit",
"nova_carter",
]
type: "omniverse_3d_det_track"
data_root: ???
train_dataset:
ann_file: ???
test_mode: false
use_valid_flag: true
with_seq_flag: true
sequences_split_num: 100
keep_consistent_seq_aug: true
same_scene_in_batch: true
val_dataset:
ann_file: ???
test_mode: true
use_valid_flag: true
tracking: true
tracking_threshold: 0.2
test_dataset:
ann_file: ???
test_mode: true
use_valid_flag: true
tracking: true
tracking_threshold: 0.2
augmentation:
resize_lim: [0.7, 0.77]
final_dim: [512, 1408]
bot_pct_lim: [0.0, 0.0]
rot_lim: [-5.4, 5.4]
image_size: [1080, 1920]
rand_flip: true
rot3d_range: [-0.3925, 0.3925]
normalize:
mean: [123.675, 116.28, 103.53]
std: [58.395, 57.12, 57.375]
to_rgb: true
sequences:
split_num: 100
keep_consistent_aug: true
same_scene_in_batch: true
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Dataset type |
omniverse_3d_det_track |
||||
|
int |
Batch size |
2 |
1 |
infinity |
||
|
bool |
Use H5 file |
False |
||||
|
bool |
Use H5 file |
True |
||||
|
int |
Number of frames |
200 |
1 |
infinity |
||
|
int |
Number of BEV groups |
1 |
1 |
infinity |
||
|
string |
Path to data root |
??? |
||||
|
string |
Path to annotation root |
??? |
||||
|
list |
Classes to detect |
[‘person’, ‘humanoid’, ‘nova_carter’, ‘transporter’, ‘forklift’, ‘box’, ‘pallet’, ‘crate’] |
false |
|||
|
int |
Number of workers |
4 |
0 |
infinity |
||
|
int |
Number of IDs |
70 |
1 |
infinity |
||
|
collection |
Augmentation config |
false |
||||
|
collection |
Normalize config |
false |
||||
|
collection |
Sequences config |
false |
||||
|
collection |
Train dataset config |
false |
||||
|
collection |
Val dataset config |
false |
||||
|
collection |
Test dataset config |
false |
Train Dataset Configuration (dataset.train_dataset)#
Configuration for the training dataset.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Path to annotation file |
??? |
||||
|
bool |
Test mode |
False |
||||
|
bool |
Use valid flag |
True |
||||
|
bool |
With sequence flag |
True |
||||
|
int |
Number of sequences |
100 |
1 |
infinity |
||
|
bool |
Keep consistent sequence augmentation |
True |
||||
|
bool |
Same scene in batch |
True |
Validation Dataset Configuration (dataset.val_dataset)#
Configuration for the validation dataset.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Path to annotation pickle files/folders |
??? |
||||
|
bool |
Test mode |
False |
||||
|
bool |
Use valid flag |
True |
||||
|
bool |
Tracking |
True |
||||
|
float |
Tracking threshold |
0.2 |
0 |
1 |
||
|
bool |
Same scene in batch |
True |
Test Dataset Configuration (dataset.test_dataset)#
Configuration for the test dataset.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Path to annotation pickle files/folders |
??? |
||||
|
bool |
Test mode |
True |
||||
|
bool |
Use valid flag |
True |
||||
|
bool |
Tracking |
True |
||||
|
float |
Tracking threshold |
0.2 |
0 |
1 |
||
|
bool |
Same scene in batch |
True |
Augmentation Configuration (dataset.augmentation)#
Configuration for data augmentation.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
list |
Resize limits |
[0.7, 0.77] |
false |
|||
|
list |
Final dimensions |
[512, 1408] |
false |
|||
|
list |
Bottom percentage limits |
[0.0, 0.0] |
false |
|||
|
list |
Rotation limits in degrees |
[-5.4, 5.4] |
false |
|||
|
list |
Original image size |
[1080, 1920] |
false |
|||
|
bool |
Random flip |
True |
||||
|
list |
3D rotation range in radians |
[-0.3925, 0.3925] |
false |
Normalize Configuration (dataset.normalize)#
Configuration for image normalization.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
list |
Mean values for normalization |
[123.675, 116.28, 103.53] |
false |
|||
|
list |
Standard deviation values for normalization |
[58.395, 57.12, 57.375] |
false |
|||
|
bool |
Convert to RGB |
True |
Sequences Configuration (dataset.sequences)#
Configuration for handling image sequences.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
int |
Number of sequence splits |
100 |
1 |
infinity |
||
|
bool |
Keep consistent augmentation |
True |
||||
|
bool |
Keep same scene in batch |
True |
model#
The model parameter provides options to change the Sparse4D architecture.
model:
type: "sparse4d"
use_grid_mask: true
use_deformable_func: true
use_temporal_align: true
input_shape: [1408, 512]
embed_dims: 256
neck:
type: "FPN"
num_outs: 4
start_level: 0
out_channels: 256
in_channels: [256, 512, 1024, 2048]
add_extra_convs: "on_output"
relu_before_extra_convs: true
depth_branch:
type: "dense_depth"
embed_dims: "${model.embed_dims}"
num_depth_layers: 3
loss_weight: 0.2
head:
type: "sparse4d"
num_output: 300
cls_threshold_to_reg: 0.05
decouple_attn: true
return_feature: true
use_reid_sampling: false
embed_dims: "${model.embed_dims}"
num_groups: 8
num_decoder: 6
num_single_frame_decoder: 1
drop_out: 0.1
temporal: true
with_quality_estimation: true
instance_bank:
num_anchor: 900
anchor: ???
num_temp_instances: 600
confidence_decay: 0.8
feat_grad: false
default_time_interval: 0.033333
embed_dims: "${model.embed_dims}"
use_temporal_align: "${model.use_temporal_align}"
anchor_encoder:
type: 'SparseBox3DEncoder'
vel_dims: 3
embed_dims: [128, 32, 32, 64]
mode: 'cat'
output_fc: false
in_loops: 1
out_loops: 4
operation_order: [
"deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
"deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
"deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
"deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
"deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
"deformable", "ffn", "norm", "refine"
]
temp_graph_model:
type: "MultiheadAttention"
embed_dims: 512
num_heads: 8
batch_first: true
dropout: 0.1
graph_model:
type: "MultiheadAttention"
embed_dims: "${model.head.temp_graph_model.embed_dims}"
num_heads: "${model.head.temp_graph_model.num_heads}"
batch_first: true
dropout: "${model.head.temp_graph_model.dropout}"
norm_layer:
type: "LN"
normalized_shape: "${model.embed_dims}"
ffn:
type: "AsymmetricFFN"
in_channels: 512
pre_norm:
type: "LN"
embed_dims: 256
feedforward_channels: 1024
num_fcs: 2
ffn_drop: 0.1
act_cfg:
type: "ReLU"
inplace: true
deformable_model:
embed_dims: "${model.embed_dims}"
num_groups: 8
num_levels: 4
attn_drop: 0.15
use_deformable_func: true
use_camera_embed: false
residual_mode: "cat"
kps_generator:
embed_dims: "${model.embed_dims}"
num_learnable_pts: 6
fix_scale:
- [0, 0, 0]
- [0.45, 0, 0]
- [-0.45, 0, 0]
- [0, 0.45, 0]
- [0, -0.45, 0]
- [0, 0, 0.45]
- [0, 0, -0.45]
refine_layer:
type: "SparseBox3DRefinementModule"
embed_dims: "${model.embed_dims}"
refine_yaw: true
with_quality_estimation: true
sampler:
num_dn_groups: 5
num_temp_dn_groups: 3
dn_noise_scale: [2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]
max_dn_gt: 128
add_neg_dn: true
cls_weight: 2.0
box_weight: 0.25
reg_weights: [2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0]
use_temporal_align: "${model.use_temporal_align}"
visibility_net:
type: "visibility_net"
embedding_dim: 256
hidden_channels: 32
loss:
reg:
type: "sparse_box_3d"
box_weight: 0.25
cls_allow_reverse: [5, 6, 7]
cls:
type: "focal"
use_sigmoid: true
gamma: 2.0
alpha: 0.25
loss_weight: 2.0
id:
type: "cross_entropy_label_smooth"
num_ids: "${dataset.num_ids}"
bnneck:
type: "bnneck"
feat_dim: 256
num_ids: "${dataset.num_ids}"
decoder:
type: "SparseBox3DDecoder"
score_threshold: 0.05
reg_weights: [2.0, 2.0, 2.0, 1 ,1, 1, 1, 1, 1, 1, 1]
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Model type |
sparse4d |
||||
|
int |
Embedding dimensions |
256 |
1 |
infinity |
||
|
bool |
Use grid mask |
True |
||||
|
bool |
Use deformable function |
True |
||||
|
list |
Input image shape |
[1408, 512] |
false |
|||
|
collection |
Backbone config |
false |
||||
|
collection |
Neck config |
false |
||||
|
collection |
Depth branch config |
false |
||||
|
collection |
Head config |
false |
||||
|
bool |
Use temporal alignment |
False |
Backbone Configuration (model.backbone)#
Configuration for the model’s backbone network. Currently, only resnet_101 is supported.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Backbone type |
resnet_101 |
resnet_101 |
Head Configuration (model.head)#
Top-level configuration for the detection and tracking head.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Head type |
sparse4d |
||||
|
int |
Number of output instances |
300 |
1 |
infinity |
||
|
float |
Classification threshold for regression |
0.05 |
0 |
1 |
||
|
bool |
Decouple attention |
True |
||||
|
bool |
Return instance features |
True |
||||
|
bool |
Use Re-ID sampling |
False |
||||
|
int |
Embedding dimensions |
256 |
1 |
infinity |
||
|
int |
Re-ID dimensions |
0 |
0 |
infinity |
||
|
int |
Number of groups |
8 |
1 |
infinity |
||
|
int |
Number of decoder layers |
6 |
1 |
infinity |
||
|
int |
Number of single-frame decoder layers |
1 |
1 |
infinity |
||
|
float |
Dropout rate |
0.1 |
0 |
1 |
||
|
bool |
Enable temporal modeling |
True |
||||
|
bool |
Enable quality estimation |
True |
||||
|
list |
Operation order |
[‘deformable’, ‘ffn’, ‘norm’, ‘refine’, ‘temp_gnn’, ‘gnn’, ‘norm’, ‘deformable’, ‘ffn’, ‘norm’, ‘refine’, ‘temp_gnn’, ‘gnn’, ‘norm’, ‘deformable’, ‘ffn’, ‘norm’, ‘refine’, ‘temp_gnn’, ‘gnn’, ‘norm’, ‘deformable’, ‘ffn’, ‘norm’, ‘refine’, ‘temp_gnn’, ‘gnn’, ‘norm’, ‘deformable’, ‘ffn’, ‘norm’, ‘refine’, ‘temp_gnn’, ‘gnn’, ‘norm’, ‘deformable’, ‘ffn’, ‘norm’, ‘refine’] |
false |
|||
|
collection |
Visibility net config |
false |
||||
|
collection |
Instance bank config |
false |
||||
|
collection |
Anchor encoder config |
false |
||||
|
collection |
Sampler config |
false |
||||
|
list |
Regression weights |
[2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] |
false |
|||
|
collection |
Loss config |
false |
||||
|
collection |
BN neck config |
false |
||||
|
collection |
Deformable model config |
false |
||||
|
collection |
Refine layer config |
false |
||||
|
float |
Valid velocity weight |
-1 |
-1 |
infinity |
||
|
collection |
Graph model config |
false |
||||
|
collection |
Temp graph model config |
false |
||||
|
collection |
Decoder config |
false |
||||
|
collection |
Norm layer config |
false |
||||
|
collection |
FFN config |
false |
Deformable Model Configuration (model.head.deformable_model)#
Configuration for the deformable attention mechanism.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
int |
Embedding dimensions |
256 |
1 |
infinity |
||
|
int |
Number of groups |
8 |
1 |
infinity |
||
|
int |
Number of levels |
4 |
1 |
infinity |
||
|
float |
Attention dropout |
0.15 |
0 |
1 |
||
|
bool |
Use deformable function |
True |
||||
|
bool |
Use camera embedding |
False |
||||
|
categorical |
Residual mode |
cat |
cat,add |
|||
|
int |
Number of cameras |
6 |
1 |
infinity |
||
|
int |
Maximum number of cameras |
20 |
1 |
infinity |
||
|
float |
Projection dropout |
0.0 |
0 |
1 |
||
|
collection |
KPS generator config |
false |
Instance Bank Configuration (model.head.instance_bank)#
Configuration for managing object instances over time.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
int |
Number of anchors |
900 |
1 |
infinity |
||
|
string |
Path to anchor file |
|||||
|
int |
Number of temporal instances |
600 |
0 |
infinity |
||
|
float |
Confidence decay factor |
0.8 |
0 |
1 |
||
|
bool |
Enable gradients for features |
False |
||||
|
float |
Default time interval |
0.033333 |
0 |
infinity |
||
|
int |
Embedding dimensions |
256 |
1 |
infinity |
||
|
bool |
Use temporal alignment |
False |
||||
|
float |
Grid size |
Anchor Encoder Configuration (model.head.anchor_encoder)#
Configuration for encoding anchor information.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Anchor encoder type |
SparseBox3DEncoder |
||||
|
int |
Velocity dimensions |
3 |
1 |
infinity |
||
|
list |
Embedding dimensions |
[128, 32, 32, 64] |
false |
|||
|
categorical |
Mode |
cat |
cat,add |
|||
|
bool |
Fully Connected Layer |
False |
||||
|
int |
In loops |
1 |
1 |
infinity |
||
|
int |
Out loops |
4 |
1 |
infinity |
||
|
bool |
Pos embed only |
False |
Sampler Configuration (model.head.sampler)#
Configuration for sampling positive and negative examples during training.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
int |
Number of De-Noising groups |
5 |
1 |
infinity |
||
|
int |
Number of temporal DN groups |
3 |
0 |
infinity |
||
|
list |
De-Noising scale |
[2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5] |
false |
|||
|
int |
Maximum DN ground truth |
128 |
1 |
infinity |
||
|
bool |
Add negative DN |
True |
||||
|
float |
Classification weight |
2.0 |
0 |
infinity |
||
|
float |
Box weight |
0.25 |
0 |
infinity |
||
|
list |
Regression weights |
[2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0] |
false |
|||
|
bool |
Use temporal alignment |
False |
||||
|
float |
Ground Truth assign threshold |
0.5 |
0 |
1 |
Loss Configuration (model.head.loss)#
This section details the different loss components used in the model head.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
collection |
Classification loss config |
false |
||||
|
collection |
Regression loss config |
false |
||||
|
collection |
ID loss config |
false |
Classification Loss (model.head.loss.cls)#
Configuration for the classification loss.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Classification loss type |
focal |
||||
|
bool |
Use sigmoid |
True |
||||
|
float |
Focal loss gamma |
2.0 |
0 |
infinity |
||
|
float |
Focal loss alpha |
0.25 |
0 |
1 |
||
|
float |
Loss weight |
2.0 |
0 |
infinity |
Regression Loss (model.head.loss.reg)#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Regression loss type |
sparse_box_3d |
||||
|
float |
Box loss weight |
0.25 |
0 |
infinity |
||
|
list |
Class allow reverse |
[] |
false |
ID Loss (model.head.loss.id)#
Configuration for the ID / Re-ID loss.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
ID loss type |
cross_entropy_label_smooth |
||||
|
int |
Number of IDs |
70 |
1 |
infinity |
BNNeck Configuration (model.head.bnneck)#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Batch Normalization Neck |
bnneck |
||||
|
int |
Feature dimension |
256 |
1 |
infinity |
||
|
int |
Number of IDs |
70 |
1 |
infinity |
KPS Generator Configuration (model.head.deformable_model.kps_generator)#
Configuration for KeyPoint (Sampling) Generator.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
int |
Embedding dimensions |
256 |
1 |
infinity |
||
|
int |
Number of learnable points |
6 |
1 |
infinity |
||
|
list |
Fixed scale |
[[0, 0, 0], [0.45, 0, 0], [-0.45, 0, 0], [0, 0.45, 0], [0, -0.45, 0], [0, 0, 0.45], [0, 0, -0.45]] |
false |
Refine Layer Configuration (model.head.refine_layer)#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Refine layer type |
sparse_box_3d_refinement_module |
||||
|
int |
Embedding dimensions |
256 |
1 |
infinity |
||
|
bool |
Refine yaw |
True |
||||
|
bool |
With quality estimation |
True |
Graph Model Configuration (model.head.graph_model and model.head.temp_graph_model)#
Configuration for graph-based modeling (e.g., GNN or attention) used for spatial and temporal relations.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Graph model type |
MultiheadAttention |
||||
|
int |
Embedding dimensions |
512 |
1 |
infinity |
||
|
int |
Number of heads |
8 |
1 |
infinity |
||
|
bool |
Batch first |
True |
||||
|
float |
Dropout rate |
0.1 |
0 |
1 |
Decoder Configuration (model.head.decoder)#
Configuration for the final output decoder.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Decoder type |
SparseBox3DDecoder |
||||
|
float |
Score threshold |
0.05 |
0 |
1 |
Norm Layer Configuration (model.head.norm_layer and model.head.ffn.pre_norm)#
Configuration for normalization layers.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Norm layer type |
LN |
||||
|
int |
Normalized shape |
256 |
1 |
infinity |
FFN Configuration (model.head.ffn)#
Configuration for Feed-Forward Networks used in the decoder layers.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
FFN type |
AsymmetricFFN |
||||
|
int |
In channels |
512 |
1 |
infinity |
||
|
collection |
Pre-norm config |
false |
||||
|
int |
Embedding dimensions |
256 |
1 |
infinity |
||
|
int |
Feedforward channels |
1024 |
1 |
infinity |
||
|
int |
Number of feedforward channels |
2 |
1 |
infinity |
||
|
float |
FFN dropout |
0.1 |
0 |
1 |
||
|
collection |
Activation config |
false |
Activation Configuration (model.head.ffn.act_cfg)#
Configuration for activation functions.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Activation type |
ReLU |
||||
|
bool |
Inplace |
True |
Visibility Net Configuration (model.head.visibility_net)#
Configuration for the visibility prediction network.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
VisibilityNet type |
visibility_net |
||||
|
int |
Embedding dimension |
256 |
1 |
infinity |
||
|
int |
Hidden channels |
32 |
1 |
infinity |
Neck Configuration (model.neck)#
Configuration for the model’s neck (Feature Pyramid Network).
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
categorical |
Neck - Feature Pyramid Network |
FPN |
FPN |
|||
|
int |
4 |
1 |
infinity |
|||
|
int |
Start level for FPN |
0 |
0 |
infinity |
||
|
int |
Output channels |
256 |
1 |
infinity |
||
|
list |
Input channels |
[256, 512, 1024, 2048] |
false |
|||
|
categorical |
Type of extra conv |
on_output |
on_input,on_lateral,on_output,False |
|||
|
bool |
Apply ReLU before extra convs |
True |
Depth Branch Configuration (model.depth_branch)#
Configuration for the depth estimation branch.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Depth branch type |
dense_depth |
||||
|
int |
Embedding dimensions |
256 |
1 |
infinity |
||
|
int |
Number of depth layers |
3 |
1 |
infinity |
||
|
float |
Weight for depth loss |
0.2 |
0 |
infinity |
train#
The train config contains the parameters related to training. They are described as follows:
train:
num_epochs: 5
num_nodes: 1
num_gpus: 1
validation_interval: 1
checkpoint_interval: 1
pretrained_model_path: ???
precision: bf16
optim:
type: "adamw"
lr: 0.0001
weight_decay: 0.001
paramwise_cfg:
custom_keys:
img_backbone:
lr_mult: 0.25
grad_clip:
max_norm: 25
norm_type: 2
lr_scheduler:
policy: "cosine"
warmup: "linear"
warmup_iters: 500
warmup_ratio: 0.333333
min_lr_ratio: 0.001
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
int |
The number of GPUs to run the train job |
1 |
1 |
|||
|
list |
List of GPU IDs to run the training on. The length of this list must be equal to the number of gpus in train.num_gpus |
[0] |
false |
|||
|
int |
Number of nodes to run the training on. If > 1, then multi-node is enabled |
1 |
1 |
|||
|
int |
The seed for the initializer in PyTorch. If < 0, disable fixed seed |
1234 |
-1 |
infinity |
||
|
collection |
false |
|||||
|
int |
Number of epochs to run the training |
10 |
1 |
infinity |
||
|
float |
Checkpoint interval in epochs |
0.5 |
0 |
infinity |
||
|
float |
Validation interval in epochs |
0.5 |
0 |
infinity |
||
|
string |
Path to the checkpoint to resume training from |
|||||
|
string |
Path to where all the assets generated from a task are stored |
|||||
|
string |
Path to pretrained model |
|||||
|
collection |
Optimizer configuration |
false |
||||
|
categorical |
Precision |
bf16 |
bf16,fp16,fp32 |
optim#
The optim parameter defines the config for the AdamW optimizer in training, including the
learning rate, learning scheduler, and weight decay.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
categorical |
Optimizer type |
adamw |
adamw,adam,sgd |
|||
|
float |
Learning rate |
5e-05 |
0 |
infinity |
TRUE |
|
|
float |
Weight decay coefficient |
0.001 |
||||
|
float |
Momentum for SGD |
0.9 |
||||
|
collection |
Parameter-wise configuration |
{‘custom_keys’: {‘img_backbone’: {‘lr_mult’: 0.2}}} |
false |
|||
|
collection |
Gradient clipping configuration |
{‘max_norm’: 25, ‘norm_type’: ‘L2’} |
false |
|||
|
collection |
Learning rate scheduler configuration |
{‘policy’: ‘cosine’, ‘warmup’: ‘linear’, ‘warmup_iters’: 500, ‘warmup_ratio’: 0.333333, ‘min_lr_ratio’: 0.001} |
false |
evaluate#
The evaluate config contains the parameters related to evaluation. Currently, we only support evaluation on a single GPU with batch size 1. The parameters are described as follows:
evaluate:
checkpoint: ${results_dir}/train/sparse4d_model_latest.pth
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Path to the checkpoint used for evaluation |
??? |
||||
|
string |
Path to where all the assets generated from a task are stored |
|||||
|
list |
Metrics to evaluate |
[‘detection’] |
false |
|||
|
collection |
Tracking config |
false |
Set the evaluate checkpoint path in the evaluate specification:
visualize#
The visualize config contains the parameters related to visualization. They are described as follows:
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
bool |
Show visualization |
True |
||||
|
string |
Visualization directory |
./vis |
||||
|
float |
Visualization score threshold |
0.25 |
0 |
1 |
||
|
int |
Number of images per column |
6 |
1 |
infinity |
||
|
int |
Visualization down sample |
3 |
1 |
infinity |
inference#
The inference config contains the parameters related to training. Currently, we only support inference on a single GPU with batch size 1. They are described as follows:
inference:
checkpoint: ???
output_nvschema: true
jsonfile_prefix: "sparse4d_pred"
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Path to checkpoint file |
??? |
||||
|
string |
Path to where all the assets generated from a task are stored |
|||||
|
string |
JSON file prefix |
sparse4d_pred |
||||
|
bool |
Output NVSchema |
True |
||||
|
collection |
Tracking config |
false |
Set the inference checkpoint path in the inference specification:
export#
The export config contains the parameters related to export. Currently, we only support export with batch size 1 and dynamic number of camera sensors. They are described as follows:
export:
results_dir: ???
checkpoint: ???
onnx_file: ???
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Path to where all the assets generated from a task are stored |
|||||
|
string |
Path to the checkpoint file to run export |
??? |
||||
|
string |
Path to the onnx model file |
??? |
Set the export checkpoint path in the export specification:
Training the Model#
Use the following command to run Sparse4D training:
Evaluating the Model#
The evaluation metrics for Sparse4D are the mean average precision and ranked accuracy.
Use the following command to run Sparse4D evaluation:
Running Inference on the Model#
Use the following command to run inference on Sparse4D with the .pth model.
The output will be a file with JSON logs consisting of object detection and tracking results for each frame.
The expected output is as follows:
{
"version": "4.0",
"id": "1", # Frame ID
"sensorId": "bev-sensor-zone-c4", # BEV Sensor ID
"timestamp": "2025-01-15T10:30:00.123Z", # Timestamp
"objects": [
{
"id": "1", # Object ID
"type": "Person", # Object Type
"confidence": 0.887, # Object Confidence Score
"coordinate": {
"x": -1.5, # Object Center X Coordinate
"y": 3.2, # Object Center Y Coordinate
"z": 0.75 # Object Center Z Coordinate
},
"bbox3d": {
"coordinates": [
-1.5, # Object Centeroid X Coordinate
3.2, # Object Centeroid Y Coordinate
0.75, # Object Centeroid Z Coordinate
0.5, # Object Width
0.5, # Object Length
0.5, # Object Height
0.0, # Object Pitch
0.0, # Object Roll
1.57 # Object Yaw
],
"embedding": [
{} # Object Embedding
],
"confidence": 0.887 # Object Confidence Score
}
},
{
"id": "2",
"type": "Humanoid",
"confidence": 0.752,
"coordinate": {
"x": 5.1,
"y": -2.8,
"z": 0.15
},
"bbox3d": {
"coordinates": [
5.1,
-2.8,
0.15,
1.2,
1.0,
0.2,
0.0,
0.0,
-1.04
],
"embedding": [
{}
],
"confidence": 0.752
}
}
]
}
{
# ... more frames
}
Exporting the Model#
Use the following command to export Sparse4D to .onnx format for deployment:
Quantization#
Sparse4D supports PTQ via TAO Quant using either the torchao (weight-only) or modelopt (static PTQ) backends.
Add a
quantizesection to your experiment specification (see TAO Quant documentation for schema and backend options).Run:
Use the quantized checkpoint by setting
evaluate.is_quantized: trueorinference.is_quantized: trueand pointing to the artifact saved underresults_dir(for example,quantized_model_torchao.pthorquantized_model_modelopt.pth). For ModelOpt artifacts, the model weights are stored undermodel_state_dict.
Notes#
For
modeloptstatic PTQ, ensure that your dataset configuration provides a representative calibration loader.For
torchao, activation settings in the configuration are ignored.
Calibration Dataset (ModelOpt)#
When you use the modelopt backend (static PTQ), provide a calibration dataset via dataset.quant_calibration_dataset.
Minimal example:
quantize:
backend: "modelopt"
mode: "static_ptq"
algorithm: "minmax"
dataset:
quant_calibration_dataset:
images_dir: "/path/to/calib/images"
See also: TAO Quant overview and its Configuration and backend pages.
TensorRT engine generation and deploying to DeepStream#
Refer to the Nvidia Spatial AI documentation page for more information about deploying a Sparse4D model to DeepStream via TensorRT engine generation.