The spec format is YAML for TAO Launcher, and JSON for FTMS Client.

File-related parameters, such as dataset paths or pretrained model paths, are required only for TAO Launcher, not for FTMS Client.

Dataset Format# The dataset for BEVFusion contains point cloud data, rgb image and the corresponding annotations of 3D objects. The directory structure should be organized as KITTI directory structure. /kitti /training /calib 000000 .txt 000001 .txt ... N.txt /image_2 000000 .png 000001 .png ... N.png /label_2 000000 .txt 000001 .txt ... N.txt /velodyne 000000 .bin 000001 .bin ... N.bin /ImageSets train.txt val.txt test.txt Each .bin file should comply with the format described above. Each .txt label file should comply to the KITTI format.

Creating a Configuration File# Below is a sample BEVFusion spec file. It has six components - model , inference , evaluate , dataset and train -as well as several global parameters, which are described below. The format of the spec file is a YAML file. Here’s a sample of the BEVFusion spec file: TAO Client (v2 API) Use the following command to get an experiment spec file for BEVFusion: BASE_EXPERIMENT_ID = $( tao bevfusion list-base-experiments | jq -r '.[0].id' ) SPECS = $( tao bevfusion get-job-schema --action train --base-experiment-id $BASE_EXPERIMENT_ID | jq -r '.default' ) TAO Launcher results_dir : /results/bevfusion dataset : type : KittiPersonDataset root_dir : /data/ gt_box_type : camera default_cam_key : CAM2 train_dataset : repeat_time : 2 ann_file : /data/kitti_person_infos_train.pkl data_prefix : pts : training/velodyne_reduced img : training/image_2 batch_size : 4 num_workers : 8 val_dataset : ann_file : /data/kitti_person_infos_val.pkl data_prefix : pts : training/velodyne_reduced img : training/image_2 batch_size : 2 num_workers : 4 test_dataset : ann_file : /data/kitti_person_infos_val.pkl data_prefix : pts : training/velodyne_reduced img : training/image_2 batch_size : 4 num_workers : 4 model : type : BEVFusion point_cloud_range : [ 0 , -40 , -3 , 70.4 , 40 , 1 ] voxel_size : [ 0.05 , 0.05 , 0.1 ] grid_size : [ 1440 , 1440 , 41 ] train : num_gpus : 1 num_nodes : 1 validation_interval : 1 num_epochs : 5 optimizer : type : AdamW lr : 0.0002 lr_scheduler : - type : LinearLR start_factor : 0.33333333 by_epoch : False begin : 0 end : 500 - type : CosineAnnealingLR T_max : 10 begin : 0 end : 10 by_epoch : True eta_min_ratio : 1e-4 - type : CosineAnnealingMomentum eta_min : 0.8947 begin : 0 end : 2.4 by_epoch : True - type : CosineAnnealingMomentum eta_min : 1 begin : 2.4 end : 10 by_epoch : True inference : num_gpus : 1 conf_threshold : 0.3 checkpoint : /results/train/bevfusion_model.pth evaluate : num_gpus : 1 checkpoint : /results/train/bevfusion_model.pth Field value_type description default_value valid_min valid_max valid_options automl_enabled results_dir string /results FALSE default_scope string Default scope to use mmdet3d mmdet3d FALSE default_hooks collection Default hooks for mmlabs {‘timer’: {‘type’: ‘IterTimerHook’}, ‘logger’: {‘type’: ‘LoggerHook’, ‘interval’: 1, ‘log_metric_by_epoch’: True}, ‘param_scheduler’: {‘type’: ‘ParamSchedulerHook’}, ‘checkpoint’: {‘type’: ‘CheckpointHook’, ‘by_epoch’: True, ‘interval’: 1}, ‘sampler_seed’: {‘type’: ‘DistSamplerSeedHook’}, ‘visualization’: {‘type’: ‘Det3DVisualizationHook’}} FALSE logger_hook string Default logger hook type TAOBEVFusionLoggerHook FALSE manual_seed int Optional manual seed. Seed is set when the value is given in spec file. FALSE input_modality collection Input modality for the model. Set True for each modality to use. {‘use_lidar’: True, ‘use_camera’: True, ‘use_radar’: False, ‘use_map’: False, ‘use_external’: False} FALSE model collection Configurable parameters to construct the model for a BEVFusion experiment. FALSE dataset collection Configurable parameters to construct the dataset for a BEVFusion experiment. FALSE train collection Configurable parameters to construct the trainer for a BEVFusion experiment. FALSE evaluate collection Configurable parameters to construct the evaluator for a BEVFusion experiment. FALSE inference collection Configurable parameters to construct the inferencer for a BEVFusion experiment. FALSE Data Preprocessor Config# The dataset configuration ( data_preprocessor ) defines the data source and pre-processing hyperparameters. Field value_type description default_value valid_min valid_max valid_options automl_enabled type string Name of Data Pre-processor for 3D Fusion Det3DDataPreprocessor FALSE mean list The input mean for RGB frames [123.675, 116.28, 103.53] FALSE std list The input standard deviation per pixel for RGB frames [58.395, 57.12, 57.375] FALSE bgr_to_rgb bool whether to convert image from BGR to RGB. 32 FALSE pad_size_divisor int The size of padded image should be divisible. 32 FALSE voxelize_cfg collection {‘max_num_points’: 10, ‘max_voxels’: [120000, 160000], ‘voxelize_reduce’: True} FALSE Dataset Config# The dataset configuration ( dataset ) defines the dataset directories, annotation file and batch size for either train , val or test . Field value_type description default_value valid_min valid_max valid_options automl_enabled type string Dataset types for 3D Fusion KittiPersonDataset TAO3DSyntheticDataset,TAO3DDataset,KittiPersonDataset FALSE root_dir string A path to the root directory of the given dataset /data/ FALSE classes list A List of the classes to be trained. [‘person’] FALSE box_type_3d string 3D bounding boxes type to be used when training. lidar lidar,camera FALSE gt_box_type string 3D bounding boxes type in ground truth. camera lidar,camera FALSE origin list The origin of the given center point in ground truth 3D bounding boxes. [0.5, 1.0, 0.5] FALSE default_cam_key string Default camera name in dataset CAM0 FALSE per_sequence bool Whether to save results in per sequence format. False FALSE num_views int Number of camera view in dataset. 1 FALSE point_cloud_dim int Input lidar point cloud data dimension 4 FALSE train_dataset collection Configurable parameters to construct the train dataset. FALSE val_dataset collection Configurable parameters to construct the validation dataset. FALSE test_dataset collection Configurable parameters to construct the test dataset. FALSE img_file string Image file for single file inference FALSE pc_file string Point cloud file for single file inference FALSE cam2img list Camera instrinsic matrix for single file inference FALSE lidar2cam list Lidar to camera extrinsic matrix for single file inference FALSE Model Config# The model configuration ( model ) defines the BEVFusion model structure. This model is used for training, evaluation, and inference. A detailed description is included in the table below. Field value_type description default_value valid_min valid_max valid_options automl_enabled type string Model name BEVFusion BEVFusion FALSE point_cloud_range list point cloud range [0, -40, -3, 70.4, 40, 1] FALSE voxel_size list voxel size in voxelization [0.05, 0.05, 0.1] FALSE post_center_range list post processing center filter range [-61.2, -61.2, -20.0, 61.2, 61.2, 20.0] FALSE grid_size list Grid size for bevfusion model [1440, 1440, 41] FALSE data_preprocessor collection Configurable parameters to construct the preprocessor for the bevfusion model. FALSE img_backbone collection Configurable parameters to construct the camera image backbone for the bevfusion model. FALSE img_neck collection Configurable parameters to construct the camera image neck for the bevfusion model. FALSE view_transform collection Configurable parameters to construct the camera view transform for the bevfusion model. FALSE pts_backbone collection Configurable parameters to construct the lidar point cloud backbone for the bevfusion model. FALSE pts_voxel_encoder collection Configurable parameters to construct the lidar point cloud voxel encoder for the bevfusion model. {‘type’: ‘HardSimpleVFE’, ‘num_features’: 4} FALSE pts_middle_encoder collection Configurable parameters to construct the lidar encoder for the bevfusion model. FALSE pts_neck collection Configurable parameters to construct the lidar neck for the bevfusion model. FALSE fusion_layer collection Configurable parameters to construct the fusion layer for the bevfusion model. FALSE bbox_head collection Configurable parameters to construct the bounding box head for the bevfusion model. FALSE Image Backbone Config# The backbone configuration ( img_backbone ) defines the backbone structure. A detailed description is included in the table below. Currently, BEVFusion only supports Swin-Transformers and ResNet50 image backbone. Field value_type description default_value valid_min valid_max valid_options automl_enabled type string Name of Image Backbone for 3D Fusion mmdet.SwinTransformer FALSE embed_dims int Number of input channels. 96 FALSE depths list Depths of each Swin Transformer stage. [2, 2, 6, 2] FALSE num_heads list Number of attention head of each stage. [3, 6, 12, 24] FALSE window_size int Window size for Swin Transformer. 7 FALSE mlp_ratio int Ratio of mlp hidden dim to embedding dim. 4 FALSE qkv_bias bool If True, add a learnable bias to query, key, value. True FALSE qk_scale string Override default qk scale of head_dim ** -0.5 if set. FALSE drop_rate float Dropout rate. 0.0 FALSE attn_drop_rate float Attention dropout rate. 0.0 FALSE drop_path_rate float Stochastic drop rate 0.2 FALSE patch_norm bool If True, add normalization after patch embedding. True FALSE out_indices list Output from which stages. [1, 2, 3] FALSE with_cp bool Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. False

FALSE convert_weights bool The flag indicates whether the pre-trained model is from the original repo. True

FALSE init_cfg collection Configuration for initialzation. FALSE Image Neck Config# The neck configuration ( img_neck ) defines the image neck structure. A detailed description is included in the table below. Currently, BEVFusion only supports GeneralizedLSSFPN image backbone. Field value_type description default_value valid_min valid_max valid_options automl_enabled type string Image Neck Name GeneralizedLSSFPN FALSE in_channels list The number of input channels for image neck. [192, 384, 768] FALSE out_channels int The number of output channels for image neck. 256 FALSE start_level int Starting level for image neck. 0 FALSE num_outs int The number of outputput for image neck. 0 FALSE norm_cfg collection The configuration of normalization for image neck. {‘type’: ‘BN2d’, ‘requires_grad’: True} FALSE act_cfg collection The configuration of activation for image neck. {‘type’: ‘ReLU’, ‘inplace’: True} FALSE upsample_cfg collection The configuration of upsampling for image neck. {‘mode’: ‘bilinear’, ‘align_corners’: False} FALSE View Transform Config# The configuration ( view_transform ) defines the view transform structure for camera input. A detailed description is included in the table below. Currently, BEVFusion only supports DepthLSSTransform and LSSTransform image backbone. Field value_type description default_value valid_min valid_max valid_options automl_enabled type string Image view transform name. DepthLSSTransform DepthLSSTransform,LSSTransform FALSE in_channels int The number of input channels for view transform. 256 FALSE out_channels int The number of output channels for view transform. 80 FALSE image_size list Image size for view transform. [256, 704] FALSE feature_size list Feature size for view transform. [32, 88] FALSE xbound list The grid range for x-axis. [-54.0, 54.0, 0.3] FALSE ybound list The grid range for y-axis. [-54.0, 54.0, 0.3] FALSE zbound list The grid range for z-axis. [-10.0, 10.0, 20.0] FALSE dbound list The grid range for depth. [1.0, 60.0, 0.5] FALSE downsample int The ratio for downsampling. 2 FALSE Lidar Backbone Config# The backbone configuration ( lidar_backbone ) defines the image backbone structure. A detailed description is included in the table below. Currently, BEVFusion only supports SECOND lidar backbone at the moment. Field value_type description default_value valid_min valid_max valid_options automl_enabled type string The lidar backbone name. SECOND FALSE in_channels int The number of input channels for lidar backbone. 256 FALSE out_channels list The number of output channels for lidar backbone. [128, 256] FALSE layer_nums list The number of layer in each stage for lidar backbone. [5, 5] FALSE layer_strides list Number of layers in each stage for lidar backbone. [1, 2] FALSE norm_cfg collection The configuration of normalization for lidar backbone. {‘type’: ‘BN’, ‘eps’: 0.001, ‘momentum’: 0.01} FALSE conv_cfg collection The configuration of convolution layers for lidar backbone. {‘type’: ‘Conv2d’, ‘bias’: False} FALSE Lidar Encoder Config# The encoder configuration ( pts_middle_encoder ) defines the lidar encoder structure. A detailed description is included in the table below. Currently, BEVFusion only supports BEVFusionSparseEncoder structure at the moment. Field value_type description default_value valid_min valid_max valid_options automl_enabled type string The lidar encoder name. BEVFusionSparseEncoder FALSE in_channels int The number of input channels for lidar encoder. 4 FALSE sparse_shape list The sparse shape of input tensor. [1440, 1440, 41] FALSE order list Order of conv module. [‘conv’, ‘norm’, ‘act’] FALSE norm_cfg collection The configuration of normalization for lidar encoder. {‘type’: ‘BN1d’, ‘eps’: 0.001, ‘momentum’: 0.01} FALSE encoder_channels encoder_paddings block_type string Type of the block to use. basicblock FALSE Lidar Neck Config# The configuration ( pts_neck ) defines the lidar neck structure. A detailed description is included in the table below. Currently, BEVFusion only supports SECONDFPN structure at the moment. Field value_type description default_value valid_min valid_max valid_options automl_enabled type string The lidar neck name. SECONDFPN FALSE in_channels list The number of input channels for lidar neck. [128, 256] FALSE out_channels list The number of output channels for lidar neck. [256, 256] FALSE upsample_strides list Strides used to upsample the feature map for lidar neck. [1, 2] FALSE norm_cfg collection The configuration of normalization for lidar neck. {‘type’: ‘BN’, ‘eps’: 0.001, ‘momentum’: 0.01} FALSE upsample_cfg collection The configuration of upsample layers for lidar neck. {‘type’: ‘deconv’, ‘bias’: False} FALSE use_conv_for_no_stride bool Whether to use conv when stride is 1. True FALSE Fusion Layer Config# The configuration ( fusion_layer ) defines the fusion layer structure. A detailed description is included in the table below. Currently, BEVFusion only supports ConvFuser structure at the moment. Field value_type description default_value valid_min valid_max valid_options automl_enabled type string The fusion layer name. ConvFuser FALSE in_channels list The number of input channels for fusion layer. [80, 256] FALSE out_channels int The number of output channels for fusion layer. 256 FALSE BBoxHead Config# The configuration ( bbox_head ) defines the bbox prediction head structure. A detailed description is included in the table below. Currently, BEVFusion only supports BEVFusionHead structure at the moment. Field value_type description default_value valid_min valid_max valid_options automl_enabled type string Prediction head name. BEVFusionHead BEVFusionHead FALSE num_proposals int Number of proposals. 200 FALSE auxiliary bool Whether to enable auxiliary training. True FALSE in_channels int Number of channels in the input feature map. 512 FALSE hidden_channel int Number of hiden channel. 128 FALSE num_classes int Number of classes. 1 FALSE nms_kernel_size int NMS kernel size. 3 FALSE bn_momentum float Batch Norm momentum. 0.1 FALSE num_decoder_layers int Number of decoder layer. 1 FALSE out_size_factor int Output size factor. 8 FALSE bbox_coder collection The configuration for bounding box encoder. FALSE decoder_layer collection The configuration for decoder layer. FALSE code_weights list Weights for box encoder. [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] FALSE nms_type string The type of NMS. FALSE assigner collection The configuration for assginer. {‘type’: ‘HungarianAssigner3D’, ‘iou_calculator’: {‘type’: ‘BboxOverlaps3D’, ‘coordinate’: ‘lidar’}, ‘cls_cost’: {‘type’: ‘mmdet.FocalLossCost’, ‘gamma’: 2.0, ‘alpha’: 0.25, ‘weight’: 0.15}, ‘reg_cost’: {‘type’: ‘BBoxBEVL1Cost’, ‘weight’: 0.25}, ‘iou_cost’: {‘type’: ‘IoU3DCost’, ‘weight’: 0.25}} FALSE common_heads collection The configuration for common heads. {‘center’: [2, 2], ‘height’: [1, 2], ‘dim’: [3, 2], ‘rot’: [6, 2]} FALSE loss_cls collection The configuration for classification loss. {‘type’: ‘mmdet.FocalLoss’, ‘use_sigmoid’: True, ‘gamma’: 2.0, ‘alpha’: 0.25, ‘reduction’: ‘mean’, ‘loss_weight’: 1.0} FALSE loss_heatmap collection The configuration for heatmap loss. {‘type’: ‘mmdet.GaussianFocalLoss’, ‘reduction’: ‘mean’, ‘loss_weight’: 1.0} FALSE loss_bbox collection The configuration for bounding box loss. {‘type’: ‘mmdet.L1Loss’, ‘reduction’: ‘mean’, ‘loss_weight’: 0.25} FALSE Train Config# The train configuration defines the hyperparameters of the training process. TAO Client (v2 API) BASE_EXPERIMENT_ID = $( tao bevfusion list-base-experiments | jq -r '.[0].id' ) SPECS = $( tao bevfusion get-job-schema --action train --base-experiment-id $BASE_EXPERIMENT_ID | jq -r '.default' ) TAO Launcher train : precision : 'fp16' num_gpus : 1 checkpoint_interval : 10 validation_interval : 10 num_epochs : 50 optim : type : "AdamW" lr : 0.0001 weight_decay : 0.05 Field value_type description default_value valid_min valid_max valid_options automl_enabled num_gpus int The number of GPUs to run the train job. 1 1 FALSE gpu_ids list List of GPU IDs to run the training on. The length of this list must be equal to the number of gpus in train.num_gpus. [0] FALSE num_nodes int Number of nodes to run the training on. If > 1, then multi-node is enabled. 1 FALSE seed int The seed for the initializer in PyTorch. If < 0, disable fixed seed. 1234 -1 inf FALSE cudnn collection FALSE num_epochs int Number of epochs to run the training. 10 1 inf TRUE checkpoint_interval int The interval (in epochs) at which a checkpoint will be saved. Helps resume training. 1 1 FALSE validation_interval int The interval (in epochs) at which a evaluation will be triggered on the validation dataset. 1 1 FALSE resume_training_checkpoint_path string Path to the checkpoint to resume training from. FALSE results_dir string Path to where all the assets generated from a task are stored. FALSE by_epoch bool Whether EpochBasedRunner is used. True FALSE logging_interval int logging interval every k iterations. 1 FALSE resume bool Whether to resume the training or not. False FALSE pretrained_checkpoint string Path to a pre-trained BEVFusion model to initialize the current training from. FALSE optimizer collection Hyper parameters to configure the optimizer FALSE lr_scheduler list Hyper parameters to configure the learning rate scheduler. [{‘type’: ‘LinearLR’, ‘start_factor’: 0.33333333, ‘by_epoch’: False, ‘begin’: 0, ‘end’: 500}, {‘type’: ‘CosineAnnealingLR’, ‘T_max’: 10, ‘eta_min_ratio’: 0.0001, ‘begin’: 0, ‘end’: 10, ‘by_epoch’: True}, {‘type’: ‘CosineAnnealingMomentum’, ‘eta_min’: 0.8947, ‘begin’: 0, ‘end’: 2.4, ‘by_epoch’: True}, {‘type’: ‘CosineAnnealingMomentum’, ‘eta_min’: 1, ‘begin’: 2.4, ‘end’: 10, ‘by_epoch’: True}] FALSE Optimizer config# The optim parameter defines the config for the optimizer in training, including the learning rate, learning scheduler, and weight decay. Field value_type description default_value valid_min valid_max valid_options automl_enabled type string Type of optimizer used to train the network. AdamW FALSE lr float The initial learning rate for training the model. 0.0002 FALSE weight_decay float The weight decay coefficient. 0.01 FALSE betas list The moving average parameter for adaptive learning rate. [0.9, 0.999] FALSE clip_grad collection Clip the gradient norm of an iterable of parameters. {‘max_norm’: 35, ‘norm_type’: 2} FALSE wrapper_type string Opitmizer Wrapper in MMengine. AmpOptimWrapper to enables mixed precision training OptimWrapper FALSE Evaluation Config# The evaluate parameter defines the hyperparameters of the evaluation process. TAO Client (v2 API) BASE_EXPERIMENT_ID = $( tao bevfusion list-base-experiments | jq -r '.[0].id' ) SPECS = $( tao bevfusion get-job-schema --action evaluate --base-experiment-id $BASE_EXPERIMENT_ID | jq -r '.default' ) TAO Launcher evaluate : checkpoint : /path/to/model.pth num_gpus : 1 Field value_type description default_value valid_min valid_max valid_options automl_enabled num_gpus int 1 FALSE gpu_ids list [0] FALSE num_nodes int 1 FALSE checkpoint string ??? FALSE results_dir string FALSE Inference Config# The inference parameter defines the hyperparameters of the inference process. TAO Client (v2 API) BASE_EXPERIMENT_ID = $( tao bevfusion list-base-experiments | jq -r '.[0].id' ) SPECS = $( tao bevfusion get-job-schema --action inference --base-experiment-id $BASE_EXPERIMENT_ID | jq -r '.default' ) TAO Launcher inference: checkpoint: /path/to/model.pth num_gpus: 1 Field value_type description default_value valid_min valid_max valid_options automl_enabled num_gpus int 1 FALSE gpu_ids list [0] FALSE num_nodes int 1 FALSE checkpoint string ??? FALSE results_dir string FALSE conf_threshold float Confidence Threshold 0.5 FALSE show bool Whether to show the 3D visualizaiton on screen False FALSE

Training the Model# To train a BEVFusion model, use this command: TAO Client (v2 API) TRAIN_JOB_ID = $( tao bevfusion create-job \ --kind experiment \ --name "bevfusion_train" \ --action train \ --workspace-id $WORKSPACE_ID \ --specs " $TRAIN_SPECS " \ --train-datasets '["' $DATASET_ID '"]' \ --eval-dataset " $DATASET_ID " \ --base-experiment-ids '["' $BASE_EXPERIMENT_ID '"]' \ --encryption-key "nvidia_tlt" | jq -r '.id' ) TAO Launcher tao model bevfusion train [ -h ] -e <experiment_spec> [ -r <results_dir> ] Required Arguments The following arguments are required to run the command. -e, --experiment_spec : The experiment specification file to set up the training experiment Optional Arguments The following arguments are optional to run the command. -r, --results_dir : The path to the folder where the experiment outputs should be written. If this argument is not specified, the results_dir from the spec file is used.

--gpus : The number of GPUs used to run training

--num_nodes : The number of nodes used to run training. If this value is larger than 1, distributed multi-node training is enabled.

-h, --help : Show this help message and exit. Sample Usage Here’s an example of the train command: tao bevfusion model train -e /path/to/spec.yaml

Evaluating the Model# To run evaluation with a BEVFusion model, use this command: TAO Client (v2 API) EVAL_JOB_ID = $( tao bevfusion create-job \ --kind experiment \ --name "bevfusion_evaluate" \ --action evaluate \ --workspace-id $WORKSPACE_ID \ --parent-job-id $TRAIN_JOB_ID \ --eval-dataset " $DATASET_ID " \ --specs " $EVALUATE_SPECS " \ --base-experiment-ids '["' $BASE_EXPERIMENT_ID '"]' \ --encryption-key "nvidia_tlt" | jq -r '.id' ) TAO Launcher tao model bevfusion evaluate [ -h ] -e <experiment_spec> [ -r <results_dir> ] Required Arguments The following arguments are required. -e, --experiment_spec : The experiment spec file to set up the evaluation experiment Optional Arguments The following arguments are optional to run the command. -r, --results_dir : The directory where the evaluation result is stored Sample Usage Here’s an example of using the evaluate command: tao model bevfusion evaluate -e /path/to/spec.yaml -r /path/to/results/ evaluate.checkpoint = /path/to/model.pth