TAO v5.5.0
NVIDIA TAO v5.5.0

BEVFusion

BEVFusion is an 3D object-detection model that is included in the TAO. It supports the following tasks:

  • convert

  • train

  • evaluate

  • inference

The sample inferenc result:

bevfusion_sample.png

TAO BEVFusion Inferece Image

These tasks may be invoked from the TAO Launcher using the following convention on the command line:

Copy
Copied!
            

tao model bevfusion <sub_task> <args_per_subtask>

where args_per_subtask are the command-line arguments required for a given subtask. Each of these subtasks are explained as follows.

The dataset for BEVFusion contains point cloud data, rgb image and the corresponding annotations of 3D objects. The directory structure should be organized as KITTI directory structure.

Copy
Copied!
            

/kitti /training /calib 000000.txt 000001.txt ... N.txt /image_2 000000.png 000001.png ... N.png /label_2 000000.txt 000001.txt ... N.txt /velodyne 000000.bin 000001.bin ... N.bin /ImageSets train.txt val.txt test.txt

Each .bin file should comply with the format described above. Each .txt label file should comply to the KITTI format.

Below is a sample BEVFusion spec file. It has six components -model, inference, evaluate, dataset and train-as well as several global parameters, which are described below. The format of the spec file is a YAML file.

Here’s a sample of the BEVFusion spec file:

Copy
Copied!
            

results_dir: /results/bevfusion dataset: type: KittiPersonDataset root_dir: /data/ gt_box_type: camera default_cam_key: CAM2 train_dataset: repeat_time: 2 ann_file: /data/kitti_person_infos_train.pkl data_prefix: pts: training/velodyne_reduced img: training/image_2 batch_size: 4 num_workers: 8 val_dataset: ann_file: /data/kitti_person_infos_val.pkl data_prefix: pts: training/velodyne_reduced img: training/image_2 batch_size: 2 num_workers: 4 test_dataset: ann_file: /data/kitti_person_infos_val.pkl data_prefix: pts: training/velodyne_reduced img: training/image_2 batch_size: 4 num_workers: 4 model: type: BEVFusion point_cloud_range: [0, -40, -3, 70.4, 40, 1] voxel_size: [0.05, 0.05, 0.1] grid_size: [1440, 1440, 41] train: num_gpus: 1 num_nodes: 1 validation_interval: 1 num_epochs: 5 optimizer: type: AdamW lr: 0.0002 lr_scheduler: - type: LinearLR start_factor: 0.33333333 by_epoch: False begin: 0 end: 500 - type: CosineAnnealingLR T_max: 10 begin: 0 end: 10 by_epoch: True eta_min_ratio: 1e-4 - type: CosineAnnealingMomentum eta_min: 0.8947 begin: 0 end: 2.4 by_epoch: True - type: CosineAnnealingMomentum eta_min: 1 begin: 2.4 end: 10 by_epoch: True inference: num_gpus: 1 conf_threshold: 0.3 checkpoint: /results/train/bevfusion_model.pth evaluate: num_gpus: 1 checkpoint: /results/train/bevfusion_model.pth

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

results_dir string /results FALSE
default_scope string Default scope to use mmdet3d mmdet3d FALSE
default_hooks collection Default hooks for mmlabs {‘timer’: {‘type’: ‘IterTimerHook’}, ‘logger’: {‘type’: ‘LoggerHook’, ‘interval’: 1, ‘log_metric_by_epoch’: True}, ‘param_scheduler’: {‘type’: ‘ParamSchedulerHook’}, ‘checkpoint’: {‘type’: ‘CheckpointHook’, ‘by_epoch’: True, ‘interval’: 1}, ‘sampler_seed’: {‘type’: ‘DistSamplerSeedHook’}, ‘visualization’: {‘type’: ‘Det3DVisualizationHook’}} FALSE
logger_hook string Default logger hook type TAOBEVFusionLoggerHook FALSE
manual_seed int Optional manual seed. Seed is set when the value is given in spec file. FALSE
input_modality collection Input modality for the model. Set True for each modality to use. {‘use_lidar’: True, ‘use_camera’: True, ‘use_radar’: False, ‘use_map’: False, ‘use_external’: False} FALSE
model collection Configurable parameters to construct the model for a BEVFusion experiment. FALSE
dataset collection Configurable parameters to construct the dataset for a BEVFusion experiment. FALSE
train collection Configurable parameters to construct the trainer for a BEVFusion experiment. FALSE
evaluate collection Configurable parameters to construct the evaluator for a BEVFusion experiment. FALSE
inference collection Configurable parameters to construct the inferencer for a BEVFusion experiment. FALSE

Data Preprocessor Config

The dataset configuration (data_preprocessor) defines the data source and pre-processing hyperparameters.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type string Name of Data Pre-processor for 3D Fusion Det3DDataPreprocessor FALSE
mean list The input mean for RGB frames [123.675, 116.28, 103.53] FALSE
std list The input standard deviation per pixel for RGB frames [58.395, 57.12, 57.375] FALSE
bgr_to_rgb bool whether to convert image from BGR to RGB. 32 FALSE
pad_size_divisor int The size of padded image should be divisible. 32 FALSE
voxelize_cfg collection {‘max_num_points’: 10, ‘max_voxels’: [120000, 160000], ‘voxelize_reduce’: True} FALSE

Dataset Config

The dataset configuration (dataset) defines the dataset directories, annotation file and batch size for either train, val or test.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type string Dataset types for 3D Fusion KittiPersonDataset TAO3DSyntheticDataset,TAO3DDataset,KittiPersonDataset FALSE
root_dir string A path to the root directory of the given dataset /data/ FALSE
classes list A List of the classes to be trained. [‘person’] FALSE
box_type_3d string 3D bounding boxes type to be used when training. lidar lidar,camera FALSE
gt_box_type string 3D bounding boxes type in ground truth. camera lidar,camera FALSE
origin list The origin of the given center point in ground truth 3D bounding boxes. [0.5, 1.0, 0.5] FALSE
default_cam_key string Default camera name in dataset CAM0 FALSE
per_sequence bool Whether to save results in per sequence format. False FALSE
num_views int Number of camera view in dataset. 1 FALSE
point_cloud_dim int Input lidar point cloud data dimension 4 FALSE
train_dataset collection Configurable parameters to construct the train dataset. FALSE
val_dataset collection Configurable parameters to construct the validation dataset. FALSE
test_dataset collection Configurable parameters to construct the test dataset. FALSE
img_file string Image file for single file inference FALSE
pc_file string Point cloud file for single file inference FALSE
cam2img list Camera instrinsic matrix for single file inference FALSE
lidar2cam list Lidar to camera extrinsic matrix for single file inference FALSE

Model Config

The model configuration (model) defines the BEVFusion model structure. This model is used for training, evaluation, and inference. A detailed description is included in the table below.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type string Model name BEVFusion BEVFusion FALSE
point_cloud_range list point cloud range [0, -40, -3, 70.4, 40, 1] FALSE
voxel_size list voxel size in voxelization [0.05, 0.05, 0.1] FALSE
post_center_range list post processing center filter range [-61.2, -61.2, -20.0, 61.2, 61.2, 20.0] FALSE
grid_size list Grid size for bevfusion model [1440, 1440, 41] FALSE
data_preprocessor collection Configurable parameters to construct the preprocessor for the bevfusion model. FALSE
img_backbone collection Configurable parameters to construct the camera image backbone for the bevfusion model. FALSE
img_neck collection Configurable parameters to construct the camera image neck for the bevfusion model. FALSE
view_transform collection Configurable parameters to construct the camera view transform for the bevfusion model. FALSE
pts_backbone collection Configurable parameters to construct the lidar point cloud backbone for the bevfusion model. FALSE
pts_voxel_encoder collection Configurable parameters to construct the lidar point cloud voxel encoder for the bevfusion model. {‘type’: ‘HardSimpleVFE’, ‘num_features’: 4} FALSE
pts_middle_encoder collection Configurable parameters to construct the lidar encoder for the bevfusion model. FALSE
pts_neck collection Configurable parameters to construct the lidar neck for the bevfusion model. FALSE
fusion_layer collection Configurable parameters to construct the fusion layer for the bevfusion model. FALSE
bbox_head collection Configurable parameters to construct the bounding box head for the bevfusion model. FALSE

Image Backbone Config

The backbone configuration (img_backbone) defines the backbone structure. A detailed description is included in the table below. Currently, BEVFusion only supports Swin-Transformers and ResNet50 image backbone.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type string Name of Image Backbone for 3D Fusion mmdet.SwinTransformer FALSE
embed_dims int Number of input channels. 96 FALSE
depths list Depths of each Swin Transformer stage. [2, 2, 6, 2] FALSE
num_heads list Number of attention head of each stage. [3, 6, 12, 24] FALSE
window_size int Window size for Swin Transformer. 7 FALSE
mlp_ratio int Ratio of mlp hidden dim to embedding dim. 4 FALSE
qkv_bias bool If True, add a learnable bias to query, key, value. True FALSE
qk_scale string Override default qk scale of head_dim ** -0.5 if set. FALSE
drop_rate float Dropout rate. 0.0 FALSE
attn_drop_rate float Attention dropout rate. 0.0 FALSE
drop_path_rate float Stochastic drop rate 0.2 FALSE
patch_norm bool If True, add normalization after patch embedding. True FALSE
out_indices list Output from which stages. [1, 2, 3] FALSE
with_cp bool

Use checkpoint or not. Using checkpoint
will save some memory while slowing down the training speed.

False

FALSE
convert_weights bool

The flag indicates whether the
pre-trained model is from the original repo.

True

FALSE
init_cfg collection Configuration for initialzation. FALSE

Image Neck Config

The neck configuration (img_neck) defines the image neck structure. A detailed description is included in the table below. Currently, BEVFusion only supports GeneralizedLSSFPN image backbone.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type string Image Neck Name GeneralizedLSSFPN FALSE
in_channels list The number of input channels for image neck. [192, 384, 768] FALSE
out_channels int The number of output channels for image neck. 256 FALSE
start_level int Starting level for image neck. 0 FALSE
num_outs int The number of outputput for image neck. 0 FALSE
norm_cfg collection The configuration of normalization for image neck. {‘type’: ‘BN2d’, ‘requires_grad’: True} FALSE
act_cfg collection The configuration of activation for image neck. {‘type’: ‘ReLU’, ‘inplace’: True} FALSE
upsample_cfg collection The configuration of upsampling for image neck. {‘mode’: ‘bilinear’, ‘align_corners’: False} FALSE

View Transform Config

The configuration (view_transform) defines the view transform structure for camera input. A detailed description is included in the table below. Currently, BEVFusion only supports DepthLSSTransform and LSSTransform image backbone.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type string Image view transform name. DepthLSSTransform DepthLSSTransform,LSSTransform FALSE
in_channels int The number of input channels for view transform. 256 FALSE
out_channels int The number of output channels for view transform. 80 FALSE
image_size list Image size for view transform. [256, 704] FALSE
feature_size list Feature size for view transform. [32, 88] FALSE
xbound list The grid range for x-axis. [-54.0, 54.0, 0.3] FALSE
ybound list The grid range for y-axis. [-54.0, 54.0, 0.3] FALSE
zbound list The grid range for z-axis. [-10.0, 10.0, 20.0] FALSE
dbound list The grid range for depth. [1.0, 60.0, 0.5] FALSE
downsample int The ratio for downsampling. 2 FALSE

Lidar Backbone Config

The backbone configuration (lidar_backbone) defines the image backbone structure. A detailed description is included in the table below. Currently, BEVFusion only supports SECOND lidar backbone at the moment.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type string The lidar backbone name. SECOND FALSE
in_channels int The number of input channels for lidar backbone. 256 FALSE
out_channels list The number of output channels for lidar backbone. [128, 256] FALSE
layer_nums list The number of layer in each stage for lidar backbone. [5, 5] FALSE
layer_strides list Number of layers in each stage for lidar backbone. [1, 2] FALSE
norm_cfg collection The configuration of normalization for lidar backbone. {‘type’: ‘BN’, ‘eps’: 0.001, ‘momentum’: 0.01} FALSE
conv_cfg collection The configuration of convolution layers for lidar backbone. {‘type’: ‘Conv2d’, ‘bias’: False} FALSE

Lidar Encoder Config

The encoder configuration (pts_middle_encoder) defines the lidar encoder structure. A detailed description is included in the table below. Currently, BEVFusion only supports BEVFusionSparseEncoder structure at the moment.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type string The lidar encoder name. BEVFusionSparseEncoder FALSE
in_channels int The number of input channels for lidar encoder. 4 FALSE
sparse_shape list The sparse shape of input tensor. [1440, 1440, 41] FALSE
order list Order of conv module. [‘conv’, ‘norm’, ‘act’] FALSE
norm_cfg collection The configuration of normalization for lidar encoder. {‘type’: ‘BN1d’, ‘eps’: 0.001, ‘momentum’: 0.01} FALSE
encoder_channels
encoder_paddings
block_type string Type of the block to use. basicblock FALSE

Lidar Neck Config

The configuration (pts_neck) defines the lidar neck structure. A detailed description is included in the table below. Currently, BEVFusion only supports SECONDFPN structure at the moment.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type string The lidar neck name. SECONDFPN FALSE
in_channels list The number of input channels for lidar neck. [128, 256] FALSE
out_channels list The number of output channels for lidar neck. [256, 256] FALSE
upsample_strides list Strides used to upsample the feature map for lidar neck. [1, 2] FALSE
norm_cfg collection The configuration of normalization for lidar neck. {‘type’: ‘BN’, ‘eps’: 0.001, ‘momentum’: 0.01} FALSE
upsample_cfg collection The configuration of upsample layers for lidar neck. {‘type’: ‘deconv’, ‘bias’: False} FALSE
use_conv_for_no_stride bool Whether to use conv when stride is 1. True FALSE

Fusion Layer Config

The configuration (fusion_layer) defines the fusion layer structure. A detailed description is included in the table below. Currently, BEVFusion only supports ConvFuser structure at the moment.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type string The fusion layer name. ConvFuser FALSE
in_channels list The number of input channels for fusion layer. [80, 256] FALSE
out_channels int The number of output channels for fusion layer. 256 FALSE

BBoxHead Config

The configuration (bbox_head) defines the bbox prediction head structure. A detailed description is included in the table below. Currently, BEVFusion only supports BEVFusionHead structure at the moment.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type string Prediction head name. BEVFusionHead BEVFusionHead FALSE
num_proposals int Number of proposals. 200 FALSE
auxiliary bool Whether to enable auxiliary training. True FALSE
in_channels int Number of channels in the input feature map. 512 FALSE
hidden_channel int Number of hiden channel. 128 FALSE
num_classes int Number of classes. 1 FALSE
nms_kernel_size int NMS kernel size. 3 FALSE
bn_momentum float Batch Norm momentum. 0.1 FALSE
num_decoder_layers int Number of decoder layer. 1 FALSE
out_size_factor int Output size factor. 8 FALSE
bbox_coder collection The configuration for bounding box encoder. FALSE
decoder_layer collection The configuration for decoder layer. FALSE
code_weights list Weights for box encoder. [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] FALSE
nms_type string The type of NMS. FALSE
assigner collection The configuration for assginer. {‘type’: ‘HungarianAssigner3D’, ‘iou_calculator’: {‘type’: ‘BboxOverlaps3D’, ‘coordinate’: ‘lidar’}, ‘cls_cost’: {‘type’: ‘mmdet.FocalLossCost’, ‘gamma’: 2.0, ‘alpha’: 0.25, ‘weight’: 0.15}, ‘reg_cost’: {‘type’: ‘BBoxBEVL1Cost’, ‘weight’: 0.25}, ‘iou_cost’: {‘type’: ‘IoU3DCost’, ‘weight’: 0.25}} FALSE
common_heads collection The configuration for common heads. {‘center’: [2, 2], ‘height’: [1, 2], ‘dim’: [3, 2], ‘rot’: [6, 2]} FALSE
loss_cls collection The configuration for classification loss. {‘type’: ‘mmdet.FocalLoss’, ‘use_sigmoid’: True, ‘gamma’: 2.0, ‘alpha’: 0.25, ‘reduction’: ‘mean’, ‘loss_weight’: 1.0} FALSE
loss_heatmap collection The configuration for heatmap loss. {‘type’: ‘mmdet.GaussianFocalLoss’, ‘reduction’: ‘mean’, ‘loss_weight’: 1.0} FALSE
loss_bbox collection The configuration for bounding box loss. {‘type’: ‘mmdet.L1Loss’, ‘reduction’: ‘mean’, ‘loss_weight’: 0.25} FALSE

Train Config

The train configuration defines the hyperparameters of the training process.

Copy
Copied!
            

train: precision: 'fp16' num_gpus: 1 checkpoint_interval: 10 validation_interval: 10 num_epochs: 50 optim: type: "AdamW" lr: 0.0001 weight_decay: 0.05

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

num_gpus int The number of GPUs to run the train job. 1 1 FALSE
gpu_ids list List of GPU IDs to run the training on. The length of this list must be equal to the number of gpus in train.num_gpus. [0] FALSE
num_nodes int Number of nodes to run the training on. If > 1, then multi-node is enabled. 1 FALSE
seed int The seed for the initializer in PyTorch. If < 0, disable fixed seed. 1234 -1 inf FALSE
cudnn collection FALSE
num_epochs int Number of epochs to run the training. 10 1 inf TRUE
checkpoint_interval int The interval (in epochs) at which a checkpoint will be saved. Helps resume training. 1 1 FALSE
validation_interval int The interval (in epochs) at which a evaluation will be triggered on the validation dataset. 1 1 FALSE
resume_training_checkpoint_path string Path to the checkpoint to resume training from. FALSE
results_dir string Path to where all the assets generated from a task are stored. FALSE
by_epoch bool Whether EpochBasedRunner is used. True FALSE
logging_interval int logging interval every k iterations. 1 FALSE
resume bool Whether to resume the training or not. False FALSE
pretrained_checkpoint string Path to a pre-trained BEVFusion model to initialize the current training from. FALSE
optimizer collection Hyper parameters to configure the optimizer FALSE
lr_scheduler list Hyper parameters to configure the learning rate scheduler. [{‘type’: ‘LinearLR’, ‘start_factor’: 0.33333333, ‘by_epoch’: False, ‘begin’: 0, ‘end’: 500}, {‘type’: ‘CosineAnnealingLR’, ‘T_max’: 10, ‘eta_min_ratio’: 0.0001, ‘begin’: 0, ‘end’: 10, ‘by_epoch’: True}, {‘type’: ‘CosineAnnealingMomentum’, ‘eta_min’: 0.8947, ‘begin’: 0, ‘end’: 2.4, ‘by_epoch’: True}, {‘type’: ‘CosineAnnealingMomentum’, ‘eta_min’: 1, ‘begin’: 2.4, ‘end’: 10, ‘by_epoch’: True}] FALSE

Optimizer config

The optim parameter defines the config for the optimizer in training, including the learning rate, learning scheduler, and weight decay.

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

type string Type of optimizer used to train the network. AdamW FALSE
lr float The initial learning rate for training the model. 0.0002 FALSE
weight_decay float The weight decay coefficient. 0.01 FALSE
betas list The moving average parameter for adaptive learning rate. [0.9, 0.999] FALSE
clip_grad collection Clip the gradient norm of an iterable of parameters. {‘max_norm’: 35, ‘norm_type’: 2} FALSE
wrapper_type string Opitmizer Wrapper in MMengine. AmpOptimWrapper to enables mixed precision training OptimWrapper FALSE

Evaluation Config

The evaluate parameter defines the hyperparameters of the evaluation process.

Copy
Copied!
            

evaluate: checkpoint: /path/to/model.pth num_gpus: 1

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

num_gpus int 1 FALSE
gpu_ids list [0] FALSE
num_nodes int 1 FALSE
checkpoint string ??? FALSE
results_dir string FALSE

Inference Config

The inference parameter defines the hyperparameters of the inference process.

Copy
Copied!
            

inference: checkpoint: /path/to/model.pth num_gpus: 1

Field

value_type

description

default_value

valid_min

valid_max

valid_options

automl_enabled

num_gpus int 1 FALSE
gpu_ids list [0] FALSE
num_nodes int 1 FALSE
checkpoint string ??? FALSE
results_dir string FALSE
conf_threshold float Confidence Threshold 0.5 FALSE
show bool Whether to show the 3D visualizaiton on screen False FALSE

To train a BEVFusion model, use this command:

Copy
Copied!
            

tao model bevfusion train [-h] -e <experiment_spec> [-r <results_dir>]

Required Arguments

  • -e, --experiment_spec: The experiment specification file to set up the training experiment

Optional Arguments

  • -r, --results_dir: The path to the folder where the experiment outputs should be written. If this argument is not specified, the results_dir from the spec file is used.

  • --gpus: The number of GPUs used to run training

  • --num_nodes: The number of nodes used to run training. If this value is larger than 1, distributed multi-node training is enabled.

  • -h, --help: Show this help message and exit.

Sample Usage

Here’s an example of the train command:

Copy
Copied!
            

tao bevfusion model train -e /path/to/spec.yaml

To run evaluation with a BEVFusion model, use this command:

Copy
Copied!
            

tao model bevfusion evaluate [-h] -e <experiment_spec> [-r <results_dir>]

Required Arguments

  • -e, --experiment_spec: The experiment spec file to set up the evaluation experiment

Optional Arguments

  • -r, --results_dir: The directory where the evaluation result is stored

Sample Usage

Here’s an example of using the evaluate command:

Copy
Copied!
            

tao model bevfusion evaluate -e /path/to/spec.yaml -r /path/to/results/ evaluate.checkpoint=/path/to/model.pth

Use the following command to run inference on BEVFusion with .pth:

Copy
Copied!
            

tao model bevfusion inference [-h] -e <experiment spec file> [-r <results_dir>]

Required Arguments

  • -e, --experiment_spec: The experiment spec file to set up the inference experiment

Optional Arguments

  • -r, --results_dir: The directory where the inference result is stored

Sample Usage

Here’s an example of using the inference command:

Copy
Copied!
            

tao model bevfusion inference -e /path/to/spec.yaml -r /path/to/results/ inference.checkpoint=/path/to/model.pth

Previous ActionRecognitionNet
Next Image Classification PyT
© Copyright 2024, NVIDIA. Last updated on Aug 30, 2024.