Sparse4D#
Sparse4D is a Multi-Camera 3D Detection and Tracking model with 4D (spatial-temporal) capabilities. It takes synchronized input images from multiple cameras and calibration matrices and outputs the 3D bounding boxes and temporally consistent tracking IDs. The model is based on resnet101, which is a general-purpose backbone for computer vision.
Each batch in Sparse4D is trained on a group of cameras. Each group is called as a bird’s-eye view (BEV) group. A BEV group is a collection of multiple overlapping cameras.
The expected data requirements and time to fine-tune the Sparse4D model on a single scene of the MTMC Tracking 2025 dataset are as follows:
Backbone type |
GPU type |
Image size |
No. of BEV groups |
No. of cameras in each BEV group |
No. of frames in each camera |
Total no. of epochs |
Total training time |
|---|---|---|---|---|---|---|---|
Resnet101 |
8 x Nvidia H100 - 80GB SXM |
3x512x1408 |
3 (Minimum BEV groups) |
4-12 |
9000 (5 mins @ 30 FPS) |
5 |
10 hours |
Sparse4D supports the following tasks:
trainevaluateinferenceexportquantize
Data Input for Sparse4D#
The Sparse4D apps in TAO utilize the dataset and for its training, validation and testing.
Refer to the MTMC Tracking 2025 dataset page in the PhysicalAI-SmartSpaces HuggingFace repository for more about the raw dataset format.
The dataset is converted into pickle format and stored in the data/sparse4d/ directory.
Creating an Experiment Specification File#
The specification file for Sparse4D includes model, dataset, train parameters, visualize parameters, evaluate parameters and inference parameters.
The following is an example specification file for training a Sparse4D model on one scene of the MTMC Tracking 2025 dataset.
We will utilize the Warehouse_014 scene from the MTMC Tracking 2025 dataset for training.
The experiment specification consists of several main components:
datasetmodeltrainevaluateinferenceexportvisualize
dataset#
The dataset parameter defines the dataset source, training batch size, and
augmentation. An example dataset is provided below. This section describes the main parameters of the Omniverse3DDetTrackDatasetConfig.
dataset:
use_h5_file_for_rgb: false
use_h5_file_for_depth: true
num_frames: 9000
batch_size: 2
num_bev_groups: 1
num_workers: 2
num_ids: 70
classes: [
"person",
"gr1_t2",
"agility_digit",
"nova_carter",
]
type: "omniverse_3d_det_track"
data_root: ???
train_dataset:
ann_file: ???
test_mode: false
use_valid_flag: true
with_seq_flag: true
sequences_split_num: 100
keep_consistent_seq_aug: true
same_scene_in_batch: true
val_dataset:
ann_file: ???
test_mode: true
use_valid_flag: true
tracking: true
tracking_threshold: 0.2
test_dataset:
ann_file: ???
test_mode: true
use_valid_flag: true
tracking: true
tracking_threshold: 0.2
augmentation:
resize_lim: [0.7, 0.77]
final_dim: [512, 1408]
bot_pct_lim: [0.0, 0.0]
rot_lim: [-5.4, 5.4]
image_size: [1080, 1920]
rand_flip: true
rot3d_range: [-0.3925, 0.3925]
normalize:
mean: [123.675, 116.28, 103.53]
std: [58.395, 57.12, 57.375]
to_rgb: true
sequences:
split_num: 100
keep_consistent_aug: true
same_scene_in_batch: true
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Dataset type |
omniverse_3d_det_track |
||||
|
int |
Batch size |
2 |
1 |
infinity |
||
|
bool |
Use H5 file |
False |
||||
|
bool |
Use H5 file |
True |
||||
|
int |
Number of frames |
200 |
1 |
infinity |
||
|
int |
Number of BEV groups |
1 |
1 |
infinity |
||
|
string |
Path to data root |
??? |
||||
|
string |
Path to annotation root |
??? |
||||
|
list |
Classes to detect |
[‘person’, ‘humanoid’, ‘nova_carter’, ‘transporter’, ‘forklift’, ‘box’, ‘pallet’, ‘crate’] |
false |
|||
|
int |
Number of workers |
4 |
0 |
infinity |
||
|
int |
Number of IDs |
70 |
1 |
infinity |
||
|
collection |
Augmentation config |
false |
||||
|
collection |
Normalize config |
false |
||||
|
collection |
Sequences config |
false |
||||
|
collection |
Train dataset config |
false |
||||
|
collection |
Val dataset config |
false |
||||
|
collection |
Test dataset config |
false |
Dynamic Resampling and Dataset-Loading Options#
Starting in TAO 7.0.1, the Sparse4D dataloader supports lazy pickle loading, balanced
per-epoch dynamic resampling, and FPS-drop augmentation. These options are most useful when
training on large multi-camera annotation sets that are split across many .pkl files.
These options are set as additional keys directly under the dataset block (alongside
type, batch_size, and the other fields shown above). Unlike the fields in the
Omniverse3DDetTrackDatasetConfig table, they are not part of the typed dataset schema:
the dataloader reads them dynamically from the dataset configuration with per-key defaults,
so they do not appear in the generated default_specs and are not Hydra-validated. As a
result, a misspelled key is silently ignored (it falls back to its default) rather than raising
an error, so take care to spell these keys exactly as shown. The keys and their defaults are:
Key |
value_type |
description |
default_value |
|---|---|---|---|
|
bool |
Enable lazy |
False |
|
int |
Lazy-load LRU cache size |
50 |
|
int |
|
0 |
|
string |
Path to the pkl-to-camera-count mapping pickle |
“” (treated as |
|
int |
Subsample to this many cameras/frame (-1 = all) |
-1 |
|
float |
FPS-drop augmentation probability |
0.0 |
|
list |
Candidate target FPS values for FPS-drop |
[30, 20, 15, 10, 6, 5, 3, 2, 1] |
Lazy loading. When lazy_load is set to True, the dataloader does not
load every annotation .pkl file into memory up front. Instead it loads a pre-built
frame index and reads the underlying .pkl files on demand through a least-recently-used
(LRU) cache whose maximum size is controlled by lazy_load_cache_size. This keeps
the host-memory footprint bounded regardless of how many .pkl files the dataset spans.
Lazy loading requires that the ann_file of the training dataset is either a
directory of .pkl files or a .txt file listing one .pkl path per line, and that a
lazy-index cache file is present alongside the annotation file before training starts:
If
ann_fileis a directory, the cache file is expected at<ann_file>/_lazy_index.pkl.If
ann_fileis a.txtlist, the cache file is expected at the same path with the.txtextension replaced by_lazy_index.pkl(for example,train.txt→train_lazy_index.pkl).
The dataloader does not build this index automatically. If lazy loading is enabled and the
cache file is missing, the dataloader raises a FileNotFoundError and training stops, so you
must generate the _lazy_index.pkl file yourself beforehand.
The index file is a pickle containing a dict with a frame_index key. The value of
frame_index is a list of per-frame entries, where each entry is a dict that must include at
least:
pkl_path— path to the.pklfile that holds that frame’s annotations. The dataloader reads these files on demand using the recordedpkl_path, so the paths must resolve to the actual annotation files at training time (use the same absolute or relative form that thepkl_cam_counts_pathmapping below uses).scene_name— the scene the frame belongs to (used to sort and group frames).timestamp— the frame timestamp (used to sort frames within a scene).
The dict may optionally include a metadata key, which is surfaced as the dataset version.
If ann_file points to a single .pkl file, the dataloader ignores lazy loading and
automatically falls back to normal (non-lazy) loading, so no index file is needed in that case.
Dynamic resampling. When pkl_sample_size is greater than 0 (and lazy loading is
enabled), each epoch draws a fresh, balanced subset of pkl_sample_size .pkl files
from the full index instead of training on all files. The subset is balanced by the number
of cameras per .pkl file, so that scenes captured with different camera counts are
represented evenly. When pkl_sample_size is 0 (the default), resampling is disabled
and lazy loading uses the full index.
The per-camera-count mapping is read from the pickle file pointed to by
pkl_cam_counts_path. This file is a pickle containing a single dict that maps each
.pkl path to the number of cameras in that .pkl file, for example:
{
"/data/sparse4d/train/scene_0001.pkl": 6,
"/data/sparse4d/train/scene_0002.pkl": 12,
# ... one entry per .pkl file referenced by the lazy index
}
There is no built-in tool to produce this file, so you must generate it yourself (for
example, with a short script that opens each .pkl file and counts its cameras). The keys
in this mapping must match the pkl_path values stored in the lazy index exactly
(same absolute-vs-relative form and same string). Every .pkl path present in the lazy
index must have an entry in this mapping; if any path is missing, resampling raises a
KeyError that lists the unmatched paths. pkl_cam_counts_path is only consulted when
pkl_sample_size is greater than 0.
Resampling is driven automatically during training: a PklResampleCallback re-runs the
balanced sampling at every epoch boundary, then refreshes the in-batch sequence sampler so
it picks up the new sequence/scene grouping. The sampling is seeded reproducibly per epoch,
so each epoch sees a different but deterministic subset.
FPS-drop augmentation. fps_drop_prob sets the probability of temporally
downsampling a scene during loading to one of the frame rates listed in
target_fps_choices, which exposes the model to a range of effective frame rates.
Setting fps_drop_prob to 0 (the default) disables this augmentation.
Camera subsampling. When max_cameras is greater than 0, each training frame is
randomly subsampled to at most that many cameras. The default of -1 keeps all cameras.
The following example enables lazy loading with balanced dynamic resampling of 500 .pkl
files per epoch and FPS-drop augmentation:
dataset:
type: "omniverse_3d_det_track"
data_root: ???
batch_size: 4
num_workers: 4
lazy_load: true
lazy_load_cache_size: 50
pkl_sample_size: 500
pkl_cam_counts_path: /data/sparse4d/_pkl_cam_counts.pkl
fps_drop_prob: 0.2
target_fps_choices: [30, 20, 15, 10, 6, 5, 3, 2, 1]
max_cameras: -1
train_dataset:
ann_file: /data/sparse4d/train # directory of .pkl files (or a .txt list)
test_mode: false
use_valid_flag: true
with_seq_flag: true
sequences_split_num: 100
keep_consistent_seq_aug: true
same_scene_in_batch: true
Train Dataset Configuration (dataset.train_dataset)#
Configuration for the training dataset.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Path to annotation file |
??? |
||||
|
bool |
Test mode |
False |
||||
|
bool |
Use valid flag |
True |
||||
|
bool |
With sequence flag |
True |
||||
|
int |
Number of sequences |
100 |
1 |
infinity |
||
|
bool |
Keep consistent sequence augmentation |
True |
||||
|
bool |
Same scene in batch |
True |
Validation Dataset Configuration (dataset.val_dataset)#
Configuration for the validation dataset.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Path to annotation pickle files/folders |
??? |
||||
|
bool |
Test mode |
False |
||||
|
bool |
Use valid flag |
True |
||||
|
bool |
Tracking |
True |
||||
|
float |
Tracking threshold |
0.2 |
0 |
1 |
||
|
bool |
Same scene in batch |
True |
Test Dataset Configuration (dataset.test_dataset)#
Configuration for the test dataset.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Path to annotation pickle files/folders |
??? |
||||
|
bool |
Test mode |
True |
||||
|
bool |
Use valid flag |
True |
||||
|
bool |
Tracking |
True |
||||
|
float |
Tracking threshold |
0.2 |
0 |
1 |
||
|
bool |
Same scene in batch |
True |
Augmentation Configuration (dataset.augmentation)#
Configuration for data augmentation.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
list |
Resize limits |
[0.7, 0.77] |
false |
|||
|
list |
Final dimensions |
[512, 1408] |
false |
|||
|
list |
Bottom percentage limits |
[0.0, 0.0] |
false |
|||
|
list |
Rotation limits in degrees |
[-5.4, 5.4] |
false |
|||
|
list |
Original image size |
[1080, 1920] |
false |
|||
|
bool |
Random flip |
True |
||||
|
list |
3D rotation range in radians |
[-0.3925, 0.3925] |
false |
Normalize Configuration (dataset.normalize)#
Configuration for image normalization.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
list |
Mean values for normalization |
[123.675, 116.28, 103.53] |
false |
|||
|
list |
Standard deviation values for normalization |
[58.395, 57.12, 57.375] |
false |
|||
|
bool |
Convert to RGB |
True |
Sequences Configuration (dataset.sequences)#
Configuration for handling image sequences.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
int |
Number of sequence splits |
100 |
1 |
infinity |
||
|
bool |
Keep consistent augmentation |
True |
||||
|
bool |
Keep same scene in batch |
True |
model#
The model parameter provides options to change the Sparse4D architecture.
model:
type: "sparse4d"
use_grid_mask: true
use_deformable_func: true
use_temporal_align: true
input_shape: [1408, 512]
embed_dims: 256
neck:
type: "FPN"
num_outs: 4
start_level: 0
out_channels: 256
in_channels: [256, 512, 1024, 2048]
add_extra_convs: "on_output"
relu_before_extra_convs: true
depth_branch:
type: "dense_depth"
embed_dims: "${model.embed_dims}"
num_depth_layers: 3
loss_weight: 0.2
head:
type: "sparse4d"
num_output: 300
cls_threshold_to_reg: 0.05
decouple_attn: true
return_feature: true
use_reid_sampling: false
embed_dims: "${model.embed_dims}"
num_groups: 8
num_decoder: 6
num_single_frame_decoder: 1
drop_out: 0.1
temporal: true
with_quality_estimation: true
instance_bank:
num_anchor: 900
anchor: ???
num_temp_instances: 600
confidence_decay: 0.8
feat_grad: false
default_time_interval: 0.033333
embed_dims: "${model.embed_dims}"
use_temporal_align: "${model.use_temporal_align}"
anchor_encoder:
type: 'SparseBox3DEncoder'
vel_dims: 3
embed_dims: [128, 32, 32, 64]
mode: 'cat'
output_fc: false
in_loops: 1
out_loops: 4
operation_order: [
"deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
"deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
"deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
"deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
"deformable", "ffn", "norm", "refine", "temp_gnn", "gnn", "norm",
"deformable", "ffn", "norm", "refine"
]
temp_graph_model:
type: "MultiheadAttention"
embed_dims: 512
num_heads: 8
batch_first: true
dropout: 0.1
graph_model:
type: "MultiheadAttention"
embed_dims: "${model.head.temp_graph_model.embed_dims}"
num_heads: "${model.head.temp_graph_model.num_heads}"
batch_first: true
dropout: "${model.head.temp_graph_model.dropout}"
norm_layer:
type: "LN"
normalized_shape: "${model.embed_dims}"
ffn:
type: "AsymmetricFFN"
in_channels: 512
pre_norm:
type: "LN"
embed_dims: 256
feedforward_channels: 1024
num_fcs: 2
ffn_drop: 0.1
act_cfg:
type: "ReLU"
inplace: true
deformable_model:
embed_dims: "${model.embed_dims}"
num_groups: 8
num_levels: 4
attn_drop: 0.15
use_deformable_func: true
use_camera_embed: false
residual_mode: "cat"
kps_generator:
embed_dims: "${model.embed_dims}"
num_learnable_pts: 6
fix_scale:
- [0, 0, 0]
- [0.45, 0, 0]
- [-0.45, 0, 0]
- [0, 0.45, 0]
- [0, -0.45, 0]
- [0, 0, 0.45]
- [0, 0, -0.45]
refine_layer:
type: "SparseBox3DRefinementModule"
embed_dims: "${model.embed_dims}"
refine_yaw: true
with_quality_estimation: true
sampler:
num_dn_groups: 5
num_temp_dn_groups: 3
dn_noise_scale: [2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]
max_dn_gt: 128
add_neg_dn: true
cls_weight: 2.0
box_weight: 0.25
reg_weights: [2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0]
use_temporal_align: "${model.use_temporal_align}"
visibility_net:
type: "visibility_net"
embedding_dim: 256
hidden_channels: 32
loss:
reg:
type: "sparse_box_3d"
box_weight: 0.25
cls_allow_reverse: [5, 6, 7]
cls:
type: "focal"
use_sigmoid: true
gamma: 2.0
alpha: 0.25
loss_weight: 2.0
id:
type: "cross_entropy_label_smooth"
num_ids: "${dataset.num_ids}"
bnneck:
type: "bnneck"
feat_dim: 256
num_ids: "${dataset.num_ids}"
decoder:
type: "SparseBox3DDecoder"
score_threshold: 0.05
reg_weights: [2.0, 2.0, 2.0, 1 ,1, 1, 1, 1, 1, 1, 1]
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Model type |
sparse4d |
||||
|
int |
Embedding dimensions |
256 |
1 |
infinity |
||
|
bool |
Use grid mask |
True |
||||
|
bool |
Use deformable function |
True |
||||
|
list |
Input image shape |
[1408, 512] |
false |
|||
|
collection |
Backbone config |
false |
||||
|
collection |
Neck config |
false |
||||
|
collection |
Depth branch config |
false |
||||
|
collection |
Head config |
false |
||||
|
bool |
Use temporal alignment |
False |
Backbone Configuration (model.backbone)#
Configuration for the model’s backbone network. Currently, only resnet_101 is supported.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Backbone type |
resnet_101 |
resnet_101 |
Head Configuration (model.head)#
Top-level configuration for the detection and tracking head.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Head type |
sparse4d |
||||
|
int |
Number of output instances |
300 |
1 |
infinity |
||
|
float |
Classification threshold for regression |
0.05 |
0 |
1 |
||
|
bool |
Decouple attention |
True |
||||
|
bool |
Return instance features |
True |
||||
|
bool |
Use Re-ID sampling |
False |
||||
|
int |
Embedding dimensions |
256 |
1 |
infinity |
||
|
int |
Re-ID dimensions |
0 |
0 |
infinity |
||
|
int |
Number of groups |
8 |
1 |
infinity |
||
|
int |
Number of decoder layers |
6 |
1 |
infinity |
||
|
int |
Number of single-frame decoder layers |
1 |
1 |
infinity |
||
|
float |
Dropout rate |
0.1 |
0 |
1 |
||
|
bool |
Enable temporal modeling |
True |
||||
|
bool |
Enable quality estimation |
True |
||||
|
list |
Operation order |
[‘deformable’, ‘ffn’, ‘norm’, ‘refine’, ‘temp_gnn’, ‘gnn’, ‘norm’, ‘deformable’, ‘ffn’, ‘norm’, ‘refine’, ‘temp_gnn’, ‘gnn’, ‘norm’, ‘deformable’, ‘ffn’, ‘norm’, ‘refine’, ‘temp_gnn’, ‘gnn’, ‘norm’, ‘deformable’, ‘ffn’, ‘norm’, ‘refine’, ‘temp_gnn’, ‘gnn’, ‘norm’, ‘deformable’, ‘ffn’, ‘norm’, ‘refine’, ‘temp_gnn’, ‘gnn’, ‘norm’, ‘deformable’, ‘ffn’, ‘norm’, ‘refine’] |
false |
|||
|
collection |
Visibility net config |
false |
||||
|
collection |
Instance bank config |
false |
||||
|
collection |
Anchor encoder config |
false |
||||
|
collection |
Sampler config |
false |
||||
|
list |
Regression weights |
[2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] |
false |
|||
|
collection |
Loss config |
false |
||||
|
collection |
BN neck config |
false |
||||
|
collection |
Deformable model config |
false |
||||
|
collection |
Refine layer config |
false |
||||
|
float |
Valid velocity weight |
-1 |
-1 |
infinity |
||
|
collection |
Graph model config |
false |
||||
|
collection |
Temp graph model config |
false |
||||
|
collection |
Decoder config |
false |
||||
|
collection |
Norm layer config |
false |
||||
|
collection |
FFN config |
false |
Deformable Model Configuration (model.head.deformable_model)#
Configuration for the deformable attention mechanism.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
int |
Embedding dimensions |
256 |
1 |
infinity |
||
|
int |
Number of groups |
8 |
1 |
infinity |
||
|
int |
Number of levels |
4 |
1 |
infinity |
||
|
float |
Attention dropout |
0.15 |
0 |
1 |
||
|
bool |
Use deformable function |
True |
||||
|
bool |
Use camera embedding |
False |
||||
|
categorical |
Residual mode |
cat |
cat,add |
|||
|
int |
Number of cameras |
6 |
1 |
infinity |
||
|
int |
Maximum number of cameras |
20 |
1 |
infinity |
||
|
float |
Projection dropout |
0.0 |
0 |
1 |
||
|
collection |
KPS generator config |
false |
Instance Bank Configuration (model.head.instance_bank)#
Configuration for managing object instances over time.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
int |
Number of anchors |
900 |
1 |
infinity |
||
|
string |
Path to anchor file |
|||||
|
int |
Number of temporal instances |
600 |
0 |
infinity |
||
|
float |
Confidence decay factor |
0.8 |
0 |
1 |
||
|
bool |
Enable gradients for features |
False |
||||
|
float |
Default time interval |
0.033333 |
0 |
infinity |
||
|
int |
Embedding dimensions |
256 |
1 |
infinity |
||
|
bool |
Use temporal alignment |
False |
||||
|
float |
Grid size |
Anchor Encoder Configuration (model.head.anchor_encoder)#
Configuration for encoding anchor information.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Anchor encoder type |
SparseBox3DEncoder |
||||
|
int |
Velocity dimensions |
3 |
1 |
infinity |
||
|
list |
Embedding dimensions |
[128, 32, 32, 64] |
false |
|||
|
categorical |
Mode |
cat |
cat,add |
|||
|
bool |
Fully Connected Layer |
False |
||||
|
int |
In loops |
1 |
1 |
infinity |
||
|
int |
Out loops |
4 |
1 |
infinity |
||
|
bool |
Pos embed only |
False |
Sampler Configuration (model.head.sampler)#
Configuration for sampling positive and negative examples during training.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
int |
Number of De-Noising groups |
5 |
1 |
infinity |
||
|
int |
Number of temporal DN groups |
3 |
0 |
infinity |
||
|
list |
De-Noising scale |
[2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5] |
false |
|||
|
int |
Maximum DN ground truth |
128 |
1 |
infinity |
||
|
bool |
Add negative DN |
True |
||||
|
float |
Classification weight |
2.0 |
0 |
infinity |
||
|
float |
Box weight |
0.25 |
0 |
infinity |
||
|
list |
Regression weights |
[2.0, 2.0, 2.0, 0.5, 0.5, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0] |
false |
|||
|
bool |
Use temporal alignment |
False |
||||
|
float |
Ground Truth assign threshold |
0.5 |
0 |
1 |
Loss Configuration (model.head.loss)#
This section details the different loss components used in the model head.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
collection |
Classification loss config |
false |
||||
|
collection |
Regression loss config |
false |
||||
|
collection |
ID loss config |
false |
Classification Loss (model.head.loss.cls)#
Configuration for the classification loss.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Classification loss type |
focal |
||||
|
bool |
Use sigmoid |
True |
||||
|
float |
Focal loss gamma |
2.0 |
0 |
infinity |
||
|
float |
Focal loss alpha |
0.25 |
0 |
1 |
||
|
float |
Loss weight |
2.0 |
0 |
infinity |
Regression Loss (model.head.loss.reg)#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Regression loss type |
sparse_box_3d |
||||
|
float |
Box loss weight |
0.25 |
0 |
infinity |
||
|
list |
Class allow reverse |
[] |
false |
ID Loss (model.head.loss.id)#
Configuration for the ID / Re-ID loss.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
ID loss type |
cross_entropy_label_smooth |
||||
|
int |
Number of IDs |
70 |
1 |
infinity |
BNNeck Configuration (model.head.bnneck)#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Batch Normalization Neck |
bnneck |
||||
|
int |
Feature dimension |
256 |
1 |
infinity |
||
|
int |
Number of IDs |
70 |
1 |
infinity |
KPS Generator Configuration (model.head.deformable_model.kps_generator)#
Configuration for KeyPoint (Sampling) Generator.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
int |
Embedding dimensions |
256 |
1 |
infinity |
||
|
int |
Number of learnable points |
6 |
1 |
infinity |
||
|
list |
Fixed scale |
[[0, 0, 0], [0.45, 0, 0], [-0.45, 0, 0], [0, 0.45, 0], [0, -0.45, 0], [0, 0, 0.45], [0, 0, -0.45]] |
false |
Refine Layer Configuration (model.head.refine_layer)#
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Refine layer type |
sparse_box_3d_refinement_module |
||||
|
int |
Embedding dimensions |
256 |
1 |
infinity |
||
|
bool |
Refine yaw |
True |
||||
|
bool |
With quality estimation |
True |
Graph Model Configuration (model.head.graph_model and model.head.temp_graph_model)#
Configuration for graph-based modeling (e.g., GNN or attention) used for spatial and temporal relations.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Graph model type |
MultiheadAttention |
||||
|
int |
Embedding dimensions |
512 |
1 |
infinity |
||
|
int |
Number of heads |
8 |
1 |
infinity |
||
|
bool |
Batch first |
True |
||||
|
float |
Dropout rate |
0.1 |
0 |
1 |
Decoder Configuration (model.head.decoder)#
Configuration for the final output decoder.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Decoder type |
SparseBox3DDecoder |
||||
|
float |
Score threshold |
0.05 |
0 |
1 |
Norm Layer Configuration (model.head.norm_layer and model.head.ffn.pre_norm)#
Configuration for normalization layers.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Norm layer type |
LN |
||||
|
int |
Normalized shape |
256 |
1 |
infinity |
FFN Configuration (model.head.ffn)#
Configuration for Feed-Forward Networks used in the decoder layers.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
FFN type |
AsymmetricFFN |
||||
|
int |
In channels |
512 |
1 |
infinity |
||
|
collection |
Pre-norm config |
false |
||||
|
int |
Embedding dimensions |
256 |
1 |
infinity |
||
|
int |
Feedforward channels |
1024 |
1 |
infinity |
||
|
int |
Number of feedforward channels |
2 |
1 |
infinity |
||
|
float |
FFN dropout |
0.1 |
0 |
1 |
||
|
collection |
Activation config |
false |
Activation Configuration (model.head.ffn.act_cfg)#
Configuration for activation functions.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Activation type |
ReLU |
||||
|
bool |
Inplace |
True |
Visibility Net Configuration (model.head.visibility_net)#
Configuration for the visibility prediction network.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
VisibilityNet type |
visibility_net |
||||
|
int |
Embedding dimension |
256 |
1 |
infinity |
||
|
int |
Hidden channels |
32 |
1 |
infinity |
Neck Configuration (model.neck)#
Configuration for the model’s neck (Feature Pyramid Network).
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
categorical |
Neck - Feature Pyramid Network |
FPN |
FPN |
|||
|
int |
4 |
1 |
infinity |
|||
|
int |
Start level for FPN |
0 |
0 |
infinity |
||
|
int |
Output channels |
256 |
1 |
infinity |
||
|
list |
Input channels |
[256, 512, 1024, 2048] |
false |
|||
|
categorical |
Type of extra conv |
on_output |
on_input,on_lateral,on_output,False |
|||
|
bool |
Apply ReLU before extra convs |
True |
Depth Branch Configuration (model.depth_branch)#
Configuration for the depth estimation branch.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Depth branch type |
dense_depth |
||||
|
int |
Embedding dimensions |
256 |
1 |
infinity |
||
|
int |
Number of depth layers |
3 |
1 |
infinity |
||
|
float |
Weight for depth loss |
0.2 |
0 |
infinity |
train#
The train config contains the parameters related to training. They are described as follows:
train:
num_epochs: 5
num_nodes: 1
num_gpus: 1
validation_interval: 1
checkpoint_interval: 1
pretrained_model_path: ???
precision: bf16
optim:
type: "adamw"
lr: 0.0001
weight_decay: 0.001
paramwise_cfg:
custom_keys:
img_backbone:
lr_mult: 0.25
grad_clip:
max_norm: 25
norm_type: 2
lr_scheduler:
policy: "cosine"
warmup: "linear"
warmup_iters: 500
warmup_ratio: 0.333333
min_lr_ratio: 0.001
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
int |
The number of GPUs to run the train job |
1 |
1 |
|||
|
list |
List of GPU IDs to run the training on. The length of this list must be equal to the number of gpus in train.num_gpus |
[0] |
false |
|||
|
int |
Number of nodes to run the training on. If > 1, then multi-node is enabled |
1 |
1 |
|||
|
int |
The seed for the initializer in PyTorch. If < 0, disable fixed seed |
1234 |
-1 |
infinity |
||
|
collection |
false |
|||||
|
int |
Number of epochs to run the training |
10 |
1 |
infinity |
||
|
float |
Checkpoint interval in epochs |
0.5 |
0 |
infinity |
||
|
float |
Validation interval in epochs |
0.5 |
0 |
infinity |
||
|
string |
Path to the checkpoint to resume training from |
|||||
|
string |
Path to where all the assets generated from a task are stored |
|||||
|
string |
Path to pretrained model |
|||||
|
collection |
Optimizer configuration |
false |
||||
|
categorical |
Precision |
bf16 |
bf16,fp16,fp32 |
optim#
The optim parameter defines the config for the AdamW optimizer in training, including the
learning rate, learning scheduler, and weight decay.
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
categorical |
Optimizer type |
adamw |
adamw,adam,sgd |
|||
|
float |
Learning rate |
5e-05 |
0 |
infinity |
TRUE |
|
|
float |
Weight decay coefficient |
0.001 |
||||
|
float |
Momentum for SGD |
0.9 |
||||
|
collection |
Parameter-wise configuration |
{‘custom_keys’: {‘img_backbone’: {‘lr_mult’: 0.2}}} |
false |
|||
|
collection |
Gradient clipping configuration |
{‘max_norm’: 25, ‘norm_type’: ‘L2’} |
false |
|||
|
collection |
Learning rate scheduler configuration |
{‘policy’: ‘cosine’, ‘warmup’: ‘linear’, ‘warmup_iters’: 500, ‘warmup_ratio’: 0.333333, ‘min_lr_ratio’: 0.001} |
false |
evaluate#
The evaluate config contains the parameters related to evaluation. Currently, we only support evaluation on a single GPU with batch size 1. The parameters are described as follows:
evaluate:
checkpoint: ${results_dir}/train/sparse4d_model_latest.pth
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Path to the checkpoint used for evaluation |
??? |
||||
|
string |
Path to where all the assets generated from a task are stored |
|||||
|
list |
Metrics to evaluate |
[‘detection’] |
false |
|||
|
collection |
Tracking config |
false |
Set the evaluate checkpoint path in the evaluate specification:
visualize#
The visualize config contains the parameters related to visualization. They are described as follows:
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
bool |
Show visualization |
True |
||||
|
string |
Visualization directory |
./vis |
||||
|
float |
Visualization score threshold |
0.25 |
0 |
1 |
||
|
int |
Number of images per column |
6 |
1 |
infinity |
||
|
int |
Visualization down sample |
3 |
1 |
infinity |
inference#
The inference config contains the parameters related to training. Currently, we only support inference on a single GPU with batch size 1. They are described as follows:
inference:
checkpoint: ???
output_nvschema: true
jsonfile_prefix: "sparse4d_pred"
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Path to checkpoint file |
??? |
||||
|
string |
Path to where all the assets generated from a task are stored |
|||||
|
string |
JSON file prefix |
sparse4d_pred |
||||
|
bool |
Output NVSchema |
True |
||||
|
collection |
Tracking config |
false |
Set the inference checkpoint path in the inference specification:
export#
The export config contains the parameters related to export. Currently, we only support export with batch size 1 and dynamic number of camera sensors. They are described as follows:
export:
results_dir: ???
checkpoint: ???
onnx_file: ???
Field |
value_type |
description |
default_value |
valid_min |
valid_max |
valid_options |
automl_enabled |
|---|---|---|---|---|---|---|---|
|
string |
Path to where all the assets generated from a task are stored |
|||||
|
string |
Path to the checkpoint file to run export |
??? |
||||
|
string |
Path to the onnx model file |
??? |
Set the export checkpoint path in the export specification:
Training the Model#
Use the following command to run Sparse4D training:
Evaluating the Model#
The evaluation metrics for Sparse4D are the mean average precision and ranked accuracy.
Use the following command to run Sparse4D evaluation:
Running Inference on the Model#
Use the following command to run inference on Sparse4D with the .pth model.
The output will be a file with JSON logs consisting of object detection and tracking results for each frame.
The expected output is as follows:
{
"version": "4.0",
"id": "1", # Frame ID
"sensorId": "bev-sensor-zone-c4", # BEV Sensor ID
"timestamp": "2025-01-15T10:30:00.123Z", # Timestamp
"objects": [
{
"id": "1", # Object ID
"type": "Person", # Object Type
"confidence": 0.887, # Object Confidence Score
"coordinate": {
"x": -1.5, # Object Center X Coordinate
"y": 3.2, # Object Center Y Coordinate
"z": 0.75 # Object Center Z Coordinate
},
"bbox3d": {
"coordinates": [
-1.5, # Object Centeroid X Coordinate
3.2, # Object Centeroid Y Coordinate
0.75, # Object Centeroid Z Coordinate
0.5, # Object Width
0.5, # Object Length
0.5, # Object Height
0.0, # Object Pitch
0.0, # Object Roll
1.57 # Object Yaw
],
"embedding": [
{} # Object Embedding
],
"confidence": 0.887 # Object Confidence Score
}
},
{
"id": "2",
"type": "Humanoid",
"confidence": 0.752,
"coordinate": {
"x": 5.1,
"y": -2.8,
"z": 0.15
},
"bbox3d": {
"coordinates": [
5.1,
-2.8,
0.15,
1.2,
1.0,
0.2,
0.0,
0.0,
-1.04
],
"embedding": [
{}
],
"confidence": 0.752
}
}
]
}
{
# ... more frames
}
Exporting the Model#
Use the following command to export Sparse4D to .onnx format for deployment:
Quantization#
Sparse4D supports PTQ via TAO Quant using either the torchao (weight-only) or modelopt (static PTQ) backends.
Add a
quantizesection to your experiment specification (see TAO Quant documentation for schema and backend options).Run:
Use the quantized checkpoint by setting
evaluate.is_quantized: trueorinference.is_quantized: trueand pointing to the artifact saved underresults_dir(for example,quantized_model_torchao.pthorquantized_model_modelopt.pth). For ModelOpt artifacts, the model weights are stored undermodel_state_dict.
Notes#
For
modeloptstatic PTQ, ensure that your dataset configuration provides a representative calibration loader.For
torchao, activation settings in the configuration are ignored.
Calibration Dataset (ModelOpt)#
When you use the modelopt backend (static PTQ), provide a calibration dataset via dataset.quant_calibration_dataset.
Minimal example:
quantize:
backend: "modelopt"
mode: "static_ptq"
algorithm: "minmax"
dataset:
quant_calibration_dataset:
images_dir: "/path/to/calib/images"
See also: TAO Quant overview and its Configuration and backend pages.
TensorRT engine generation and deploying to DeepStream#
Refer to the Nvidia Spatial AI documentation page for more information about deploying a Sparse4D model to DeepStream via TensorRT engine generation.