# Object Detection with DetectNetv2

The object detection workflow in the Isaac SDK uses the NVIDIA object detection DNN architecture, DetectNetv2. It is available on NVIDIA NGC and is trained on a real image dataset. Tools integrated with the Isaac SDK enable you to generate your own synthetic training dataset and fine-tune the DNN with the Transfer Learning Toolkit (TLT). The fine-tuned DetectNetv2 can then be used for inference in your robotics applications.

The following sections explain how to:

1. Generate a KITTI dataset from Isaac Sim.
2. Fine-tune a pre-trained DetectNetv2 model on the generated dataset.
3. Run inference on various inputs using the Isaac TensorRT Inference codelet.

## Object Detection Training Workflow with Isaac SDK and TLT

Training a DetectNetv2 model involves generating simulated data and using TLT to train a model on this data. Isaac SDK provides sample models that are used in applications, for example the Cart Delivery application. The following step-by-step instructions walk through the process of how one such model was trained for the Industrial Dolly (the figure below shows training samples from the Factory of the Future environment). Use these steps as guidelines to train models on your own objects.

## Data Generation

This section describes how to 1) run the simulation to generate data, 2) run the Isaac application alongside the simulator to capture the data and save it to a dataset, and 3) verify the generated dataset by visual inspection.

### Generating Simulated Data with Unity

#### Generating data for sample objects from scene binary: Industrial Dolly and Industrial Box

A sample Factory of the Future scene binary that generates the above data for two objects, industrial dolly and industrial box, is available in the isaac_sim_unity3d repository in builds. The Cart Delivery application and the shuffle box applications use models trained on data from this scene.

A subset of the scenarios in the scene are data generation scenes for object detection. Scenarios 7, 13, 14, and 15 provide training data for industrial dolly detection. Scenario 9 provides training data for industrial box detection. To start the scene with scenario 7, for example, run the following command from the Isaac Sim release folder:

Copy
Copied!

./builds/factory_of_the_future.x86_64 --scene Factory01 --scenario 7


#### Generating data for custom objects from scene source file

To generate data for custom objects, a sample scene is provided in in the isaac_sim_unity3d repository in packages/Nvidia/Samples/ObjectDetection/. This sample scene can generate data with randomized backgrounds, occluding objects, lighting conditions, and camera poses.

1. Objects are spawned in the procedural > objects GameObject. The list of objects for training is in packages/Nvidia/Samples/ObjectDetection/ObjectDetectionAssetGroup. By default, this AssetGroup contains the dolly prefab. Modify the list of GameObjects to match the list of objects you wish to train on by increasing the size of the ObjectDetectionAssetGroup and dragging each new prefab into this list. Each prefab in this list should contain a LabelSetter component that contains the name of the object.

If you would like each label from the prefab to be associated with the same instance, add a InstanceLabelGroup to the prefab as well. For example, if each wheel in the dolly prefab has the “wheel” label, an InstanceLabelGroup component in the game object containing all the wheels would result in one bounding box containing all wheels, instead of four separate boxes, one per wheel.

2. Modify the MaxCount and MaxTrials parameters in the procedural > objects > Collider Asset Spawner component to reflect the number of objects to spawn each frame. The maxCount parameter specifies the number of objects to spawn. The maxPickTrials and maxPlaceTrials values denote how many times each object should be placed again if the initial spawning location is invalid. Additionally, the Dropout parameter under procedural > objects > Collider Asset Spawner represents the probability of an asset being “dropped out” of the frame (the default value is 0.2). Increasing this value results in a dataset with more negative samples, which should be present in the dataset to minimize false positives during inference.

3. Modify the ClassLabelManager game object in the scene. By default, it contains one class label rule (dolly) and two class labels (one for background, and one for dolly). Modify this such that there is one class label rule and one class label per object in your ObjectDetectionAssetGroup. Set the “name” and “expression” fields to the label of the object–this should match the string that was set as the label in LabelSetter in step (c). Make sure that the rule index of each object class label is the same as its class label index (for example, the dolly uses index 1 by default). The index value is used as the value to set the pixels in the label image that is later used to generate bounding boxes. Leave the “Default Label” field to 0, as it is the value used to populate all the pixels that are not associated with objects (background pixels).

### Running the Isaac Application to Generate a KITTI Dataset

Configure parameters for the dataset in packages/ml/apps/generate_kitti_dataset/generate_kitti_dataset.app.json. Here the config can be modified to vary, among other parameters, the dataset output location, the output resolution of the images (for best results, use dimensions that are multiples of 16), number of training images, and number of testing images to create. The default application generates a dataset of 10k training images and 100 testing images; all images are in PNG format, with a resolution of 640x368.

Run the following application alongside a simulation to generate a dataset for input to the TLT training pipeline:

Copy
Copied!



## Fine-tuning the pre-trained DetectNetv2 model

1. Create a local directory called tlt-experiments to mount in the docker container. Move the unity3d_kitti_dataset directory into this directory.

2. Follow these instructions from IVA to set up docker and NGC.

3. Start a docker container and mount the directory with the following command. With Isaac 2020.2, the v2.0_dp_py2 container is supported, and includes all the necessary files to train a DetectNetv2 model.

Copy
Copied!

docker run --runtime=nvidia -it -v <path_to_tlt-experiments>:/workspace/tlt-experiments -p 8888:8888 nvcr.io/nvidia/tlt-streamanalytics:v2.0_dp_py2


4. Navigate to the /workspace/examples/detectnet_v2/ directory in the docker image.

5. Copy the /workspace/examples/detectnet_v2/specs folder into your workspace/tlt-experiments folder. We will later modify these specs in the mounted folder so that the training specs persist after the docker container is terminated.

6. Start a Jupyter notebook server as described in the TLT documentation:

Copy
Copied!

jupyter notebook --ip 0.0.0.0 --allow-root


7. Open the detectnet_v2.ipynb notebook and follow the instructions, taking into account these special instructions for each step.

1. Set up env variables:

• $KEY: Create a “key”, which is used to protect trained models and must be known at inference time to access model weights. • $USER_EXPERIMENT_DIR: Leave this set to /workspace/tlt-experiments.
• $DATA_DOWNLOAD_DIR: Set this to the path of your unity3d_kitti_dataset. • $SPECS_DIR: Set this to the path of the copied specs directory within the mounted folder from step #6.
2. Verify the downloaded dataset. Skip the first two cells, which download a KITTI object detection dataset into the $DATA_DOWNLOAD_DIR specified above. The simulated dataset from Unity3D should already be at this path, so run the last two cells of this section to validate your simulated dataset. 3. Prepare tf records from the KITTI format dataset. Modify the $SPECS_DIR/detectnet_v2_tfrecords_kitti_trainval.txt file to reflect the correct dataset path. An example is provided below for training dolly detection.

Copy
Copied!

kitti_config {
root_directory_path: "/workspace/tlt-experiments/unity3d_kitti_dataset/training"
image_dir_name: "image_2"
label_dir_name: "label_2"
image_extension: ".png"
partition_mode: "random"
num_partitions: 2
val_split: 14
num_shards: 10
}
image_directory_path: "/workspace/tlt-experiments/unity3d_kitti_dataset/training"


Then run the cells as instructed in the notebook. The cell containing the tlt-dataset-convert outputs a message regarding the classmap such as the one below. Note the “label in tfrecords” file. This value is used as the key when writing the training configuration in step (e).

Copy
Copied!

2020-05-09 01:30:12,694 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Class map.
Label in GT: Label in tfrecords file
dolly: dolly
For the dataset_config in the experiment_spec, please use labels in the tfrecords file, while writing the classmap.


4. Download the pre-trained model: Run the cells as instructed in the notebook.

5. Modify the training parameters for object classes in $SPECS_DIR/detectnet_v2_train_resnet18_kitti.txt for your use case: 1. First, change the dataset_config > data_sources > image_directory_path and tfrecords_path to the training folder inside your generated dataset: Copy Copied!  dataset_config { data_sources { tfrecords_path: "/workspace/tlt-experiments/unity3d_kitti_dataset/tfrecords/kitti_trainval/*" image_directory_path: "/workspace/tlt-experiments/unity3d_kitti_dataset/training" }  2. Update the list of target_class_mapping parameters, adding one for each object class. For each object, the key field of this struct must exactly match the corresponding “label in tfrecords file” from step 9c. Copy Copied!  target_class_mapping { key: "dolly" value: "dolly" }  3. Edit the output_image_width and output_image_height parameters under augmentation_config > preprocessing. Copy Copied!  preprocessing { output_image_width: 640 output_image_height: 368 ... }  4. Under the postprocessing_config header, make sure there is one target_class_config configuration per object class. Leave the clustering_config set to default values. Copy Copied!  target_class_config { key: "dolly" value { clustering_config { ... } }  5. Use the default values for the model_config section. 6. Modify the evaluation_config section. Edit the validation_period_during_training parameter to change the number of epochs between validation steps. Make sure there is one minimum_detection_ground_truth_overlap and one evaluation_box_config struct for each object class, using the default values within the struct: Copy Copied!  evaluation_config { validation_period_during_training: 10 first_validation_epoch: 1 minimum_detection_ground_truth_overlap { key: "dolly" value: 0.5 } evaluation_box_config { key: "dolly" value { ... } ... }  7. In cost_function_config, make sure that there is one target_classes struct per object class, using the default values within the struct. Note The cost_function_config section contains parameters for setting weights per class for calculation of the loss or cost. 8. Modify the training_config section. In this example, the images are 640x368, so the batch_size_per_gpu can be increased to 16 for faster learning, thus allowing for reduction of the num_epochs to 100. Use the default values for the learning_rate, regularizer, optimizer, and cost_scaling parameters, keeping in mind that these can be adjusted if needed. By default, the training outputs a model checkpoint every 10 epochs; modify the checkpoint_interval parameter to change this frequency. 9. Modify the bbox_rasterizer_config section to have one target_class_config per object class. For the dolly object, these values were used: Copy Copied!  bbox_rasterizer_config { target_class_config { key: "dolly" value: { cov_center_x: 0.5 cov_center_y: 0.5 cov_radius_x: 0.4 cov_radius_y: 0.4 bbox_min_radius: 1.0 } } ... }  For more guidance on these training parameters, see the TLT documentation and this blog post. 1. Run TLT training using the tlt-train command, as shown in the notebook. 2. Evaluate the trained model. Run the tlt-evaluate command as shown in the notebook to evaluate the final trained model. You can also evaluate any of the checkpoint models using the -m flag with the path of the model.step-xxx.tlt files. 3. Prune the trained model to reduce the number of parameters, thus decreasing inference runtimes and the overall size of the model. To prune, run the tlt-prune command as shown in the notebook. Read the pruning instructions and adjust the pruning threshold accordingly. A pth value of 0.01 is a good starting point for detectnet_v2 models. NVIDIA recommends a pruning ratio between 0.1 and 0.3. 4. Retrain the pruned model by modifying the $SPECS_DIR/detectnet_v2_retrain_resnet18_kitti.txt file, similar to $SPECS_DIR/detectnet_v2_train_resnet18_kitti.txt. Update the model_config so that the load_graph option is set to true. Make sure to also set the correct path to the pruned model from the previous step in the pretrained_model_file parameter under model_config. 5. Evaluate the retrained model. Run the tlt-evaluate command as shown in the notebook to evaluate the final trained model. You can also evaluate any of the checkpoint models using the -m flag with the path of the model.step-xxx.tlt files. 6. Edit the $SPECS_DIR/detectnet_v2_inference_kitti_tlt.txt file to set inference parameters. In the inferencer_config, set the target classes and inference dimensions accordingly, and provide the correct path to the model to be used for inference. In the bbox_handler_config, makes sure there is one classwise_bbox_handler_config per class with the appropriate key in addition to the default classwise_bbox_handler_config.
1. Visualize inferences using the tlt-infer command as shown in the notebook. Update the -i flag to the testing directory of the simulated dataset and the -m flag to the path to the retrained model.
2. After the model is trained, pruned, and evaluated to your satisfaction, export it using the tlt-export command under the “Deploy!” section of the notebook. This provides you with a file of .etlt format, which you can then use for inference with Isaac.

## TensorRT Inference on TLT models

A sample DetectNetv2 model that was trained using the above workflow is provided. This model was trained on a different dolly than the one shown above, but with the same configuration. In addition, a sample inference application is provided in packages/detect_net/apps, utilizing the detect_net_inference subgraph located in the same folder. With this app, you can do the following:

• Run inference on a set of real images:
Copy
Copied!

bob@desktop:~/isaac/sdk$bazel run packages/detect_net/apps:detect_net_inference_app -- --mode image --rows 480 --cols 848  • Run inference on a recorded Isaac log: Copy Copied!  bob@desktop:~/isaac/sdk$ bazel run packages/detect_net/apps:detect_net_inference_app -- --mode cask --rows 480 --cols 848


• Run inference on an image stream from Isaac Sim Unity3D:
Copy
Copied!

bob@desktop:~/isaac/sdk$bazel run packages/detect_net/apps:detect_net_inference_app -- --mode sim  • Run inference on a camera feed from an Intel Realsense camera: Copy Copied!  bob@desktop:~/isaac/sdk$ bazel run packages/detect_net/apps:detect_net_inference_app -- --mode realsense


• Run inference on a camera feed from a V4L camera (be sure to adjust the framerate and resolution according to your camera):
Copy
Copied!

bob@desktop:~/isaac/sdk$bazel run packages/detect_net/apps:detect_net_inference_app -- --mode v4l --fps 30 --rows 448 --cols 800  Another inference application /packages/object_pose_estimation/detect_net/apps/detect_net_inference_deploy_app.json is provided without including the sample log data for industrial dolly and box that helps in faster deployment of the package. So, it is recommended to use this application instead of detect_net_inference_app.json in case of package deployment. To use the app, replace detect_net_inference_app in the commands listed above with detect_net_inference_deploy_app to run inference in different modes. The app includes sample images for industrial dolly and box to test the inference in image mode. Note When performing inference on the sample model, the resolution of input images must be greater than or equal to 640x368. The inference application uses the ColorCameraEncoder codelet to downscale input images to match the network input resolution, which is 640x368 for the provided sample dolly detection network. However, ColorCameraEncoder does not support upscaling, so images that are input to the inference applications cannot have a smaller resolution than the network input resolution in either dimension. ### Inference on custom models These applications can be configured to run inference on your own trained models. The above sample applications use the configuration provided in sdk/packages/detect_net/apps/detect_net_industrial_dolly.config.json. Create a similar configuration file for your model with the appropriate ETLT model path and password. By default, inference supports a single-object model trained on a 640x368 input resolution. Pass this new config to the application using the --config command line parameter. Note that if the number of objects or resolution of images is anything other than this, the input tensor info under “detect_net_inference.tensor_encoder” must be updated in the detect_net_inference subgraph. #### Detection Inference Parameters During inference, there is a set of parameters that dictates the post processing of the raw detections that are output by the neural network. Specifically note the the following parameters of the detection decoder. The same configuration file that holds the model path and password has default values for these, and should be tuned for each new trained model based on the inference settings. • confidence_threshold: Each detection has an associated confidence value, and the confidence threshold filters out all detections with a confidence below the threshold. • non_maximum_suppression_threshold: To post-process the raw detection outputs from the DetectNetv2 model used for object detection, non-maximum suppression is used to eliminate multiple detections for a single object instance. Decrease the non-maximum suppression threshold to filter out detections that have high intersection-over-union overlap with other detections. Note This sample was trained on a limited dataset and is not guaranteed to work in every situation and lighting condition. To improve model accuracy in a custom environment, you can train your own model using the instructions provided above. The object pose estimation pipeline is one of the many use cases for DetectNet. For more information about the pose estimation pipeline, refer to the 3D Object Pose Estimation with Pose CNN Decoder documentation. ## Evaluation of Object Detection Models Evaluation of a model can help improve the model in several ways: 1. Data validation: A model is only as good as the data it was trained on. There are many aspects to a training dataset that can affect performance: data integrity, class balance/imbalance, etc. 2. Model improvement: Developers may wish to make incremental changes to model architectures, hyperparameters, etc. in order to explore their effects on performance. One of the most common metrics used to evaluate object detection models is Average Precision (AP), which is calculated as follows: $$(true positives) / (true positives + false positives)$$. Average precision (AP) is the precision averaged over image frames. Average recall (AR) is also an important measure, where recall is $$(true positives) / (true positives + false negatives)$$. Precision quantifies how well each prediction made by the network matches a ground truth object, while recall captures how many ground truth objects are identified by the network. The basic values needed to calculate the above metrics are the true positive (TP), false positive (FP), and false negative (FN) scores. In other words, we need to build a confusion matrix for the inference results. To determine if a prediction and a ground truth bounding box match well enough to consider it a true positive, we use the IOU (Intersection over Union) threshold. IOU is a measure of how much two bounding boxes overlap (0 being no overlap, and 1 being an exact match). Setting a lower IOU threshold corresponds to higher tolerance for bounding box errors. We define true positives as the bounding box pairs for which the IOU score is greater than the IOU threshold. The following image shows the ground truth box in black and the predicted bounding box in green for a sample image. The detection evaluation pipeline in Isaac SDK is set up to ingest data in Isaac log/cask format to ensure that the pipeline is agnostic to the source of evaluation data, i.e., simulation or real data. The evaluation workflow provided below walks through the process of collecting evaluation raw data and associated ground truth, collecting prediction data, and evaluating the metrics defined above by comparing ground truth against predictions. ### Evaluation Data Collection In order to evaluate a model, a set of single image data samples and associated ground truth are required in Isaac log/cask format. The evaluation pipeline requires one log/cask containing image data, and one containing associated ground truth data (where the timestamp associates the ground truth message with an image frame). The image data cask must contain ImageProto messages and the ground truth data cask must contain Detections2Proto messages. This data can be either real or simulated. #### From simulation Simulation provides easy access to an unlimited amount of labeled data for evaluation. The evaluation pipeline is set up to consume data in Isaac log/cask format, so data samples from simulation along with the ground truth pose are recorded in cask format using an Isaac SDK application: Copy Copied!  bob@desktop:~/isaac/sdk$ bazel run packages/ml/apps/record_sim_ground_truth:record_sim_ground_truth -- --mode bounding_box


This command connects to Isaac Sim and collects RGB image samples along with ground truth bounding boxes of the dolly. By default, the image data cask is saved to /tmp/data/raw and contains one channel (‘color’). The ground truth data cask is saved to /tmp/data/ground_truth and contains one channel (‘bounding_boxes’). Note the following important arguments to record detection data:

• --mode: This should be set to bounding_box for detection data collection.
• --image_channel: The name of the image channel sending the RGB data in simulation. Defaults to “color”.
• --intrinsics_channel: The name of the image instrinsics channel containing the pinhole model of the camera. Defaults to “color_intrinsics”.
• --segmentation_channel: The base name of the segmentation channels containing the class, labels, and instance segmentation information from the camera. Defaults to “segmentation”.
• --runtime: Run time of the simulation to collect the data samples. The default is set to 20 seconds. Increase this value to collect more evaluation data samples.

Each run of the application saves one image and ground truth data cask pair. This way, multiple sets of data can be collected for evaluation in the form of multiple logs.

This recording application must be run alongside a simulation, similar to the generate_kitti_dataset described above. To collect evaluation data for a custom object, follow the steps listed in the above section titled “Generating data for custom objects from scene source file”.

For dolly detection evaluation, two evaluation scenes are provided in the Factory of the Future scene.

Scenario 17 spawns a cart in front of the camera at various angles and positions between 1.5 and 2.5 meters from the camera. To run Scenario 17, use the following command from within the Isaac Sim release folder:

Copy
Copied!

./builds/factory_of_the_future.x86_64 --scene Factory01 --scenario 17


Scenario 18 spawns multiple carts along the robot’s path as it drives along the factory floor. To run Scenario 18, use the following command from within the Isaac Sim release folder:

Copy
Copied!

./builds/factory_of_the_future.x86_64 --scene Factory01 --scenario 18


#### From real data

To evaluate on real data, collect an image data cask using the Isaac record component.

One way to collect appropriate ground truth data for this real data is to use the CVAT tool. CVAT XML data can then be converted to a ground truth cask for the evaluation pipeline. An example application to perform this conversion is provided.

The following are the important input arguments needed for this script:

• --cvat_xml: Path to the input CVAT XML file. This argument is required.
• --slice_mode: The slicing mode to determine which detections to extract. By default this is set to all, meaning that all the detections found in the XML are extracted. One other slicing mode is available (dolly), which can be used to slice out dolly detections from the sample CVAT file described below.
• --base_directory_gt: The directory to save the generated ground truth detections cask. By default this is /tmp/data/ground_truth.
• --raci_metadata: Saves JSON metadata along with cask to <app_uuid>_md.json. This is required to use the following steps in the evaluation pipeline.
• --no_raci_metadata: No metadata saved.

To run the application on a sample CVAT file with ground truth dolly detections, run the following command:

Copy
Copied!

bob@desktop:~/isaac/sdk$bazel run packages/detect_net/evaluation:cvat_to_cask -- --cvat_xml external/detect_net_cvat_sample_data/data/cvat/0c2d809a-38cd-11ea-8bb7-79860d087101.labels.cvat.images.xml --slice_mode dolly  This application creates a cask and saves it to /tmp/data/ground_truth. The associated image cask is a single Isaac log located in the following location: isaac/sdk/bazel-sdk/external/detect_net_cvat_sample_data/data/raw/0c2d809a-38cd-11ea-8bb7-79860d087101. To keep consistent with the workspace organization in tmp/data, this cask directory should be moved to the /tmp/data/raw directory. Please note that this application serves as an example of data ingestion from the CVAT XML format. The sample CVAT file contains many detections of various classes, and the application slices out dolly detections to save to the cask. To write a custom slice mode, modify the slice_detections function for your use case. ### Collecting prediction data: Inference Recording In this step, the application replays all the image cask files in a given input directory, runs the detection inference application and records the inferred detections as a cask log for each input image cask file. The following are the important input arguments needed for this script: • --inference_app: The path to the application file that replays a log, performs inference, and records results. By default this is the detection inference application. • --config: The config file to load for inference parameters for the above inference app. By default this is the dolly inference configuration used in the above sample inference apps. • --raci_metadata: Saves JSON metadata along with cask to <app_uuid>_md.json. This is required to use the following steps in the evaluation pipeline. • --no_raci_metadata: No metadata saved. • --input_cask_workspace: The workspace containing the input cask files. Input image logs must be in the data/raw directory inside this workspace. • --output_cask_workspace: The output cask files are written in data/<output_directory_name> inside this workspace. If this parameter is not set, it is assumed to be same as the input_cask_workspace. • --output_directory_name: Base directory name to which to write the predictions cask output. Cask files are created in <output_cask_workspace>/data/<output_directory_name>. The default is set to “predictions”. Assuming that the image casks are stored in /tmp/data/raw, the predicted detections can be generated by running the following command from the Isaac SDK directory: Copy Copied!  bob@desktop:~/isaac/sdk$ bazel run packages/ml/apps/evaluation_inference:evaluation_inference -- --input_cask_workspace /tmp


With the above command, the output prediction detection casks are stored in the path /tmp/data/predictions. The output detection casks are named with the same name as input image casks, with an additional tag so that the image cask and the corresponding pose casks can be associated in the next step for model evaluation.

Once the image, ground truth, and prediction data are collected, the evaluation metrics can be computed. This step reads the full list of casks in the image cask directory and their corresponding ground truth data and predictions, computes framewise metrics, and aggregates the metrics. The configuration file located in packages/detect_net/evaluation/detect_net_cask_evaluation.config.json is used to set the evaluation parameters including the IOU thresholds, outlier thresholds, and KPI thresholds. The default values in this file are for the dolly model.

The following are input arguments needed for this script:

• config: The path to the config file for evaluation. The default is set to packages/detect_net/evaluation/detect_net_cask_evaluation.config.json.
• --image_cask_dir: The path to the image cask directory. Only the image logs must be placed in this directory. The data is aggregated over all the logs in this directory. The default path is set to /tmp/data/raw.
• --gt_cask_dir: The path to the ground truth pose cask directory corresponding to the image casks in the image_cask_dir. The default path is set to /tmp/data/ground_truth.
• --predicted_cask_dir: The path to the predicted pose cask directory corresponding to the image casks in the image_cask_dir. It is expected to contain 2D detections as well if use_2d_detections is set to true. The default path is set to /tmp/data/predictions.
• --results_dir: The path to store the evaluation results. The directory is created if not already present. The default path is set to /tmp/data/results.
• --save_outliers [true/false]: If true, saves the outliers to the disk under the results directory. By default, this value is false as it may take some time to save each frame to disk if there are many outliers.

To run the application, use the following command:

Copy
Copied!

bob@desktop:~/isaac/sdk\$ bazel run packages/detect_net/evaluation:detect_net_cask_evaluation


The evaluation results are stored as a JSON in the specified results directory. Under the “results” tag, you find the number of frames that were evaluated, the precisions/recalls/and confusion matrices per IOU, and a list of the outlier indices. The outliers are determined for a certain IOU threshold specified in the config file by outlier_iou_area_threshold. The three types of outliers that are extracted are images with false positives, false negatives, and large bounding box errors (the threshold for “large” is a parameter in the config file large_bbox_iou_area_min).

Finally, the mAP and mAR values across all the evaluated frames are computed and output at the end of the results section. The KPI_mAP and KPI_mAR values are computed over all IOUs over all the classes, as done in the COCO 2017 challenge. The KPI_mAP_lowest_IOU and KPI_mAR_lowest_IOU values are computed over a single IOU (specifically, the lowest one - 0.5 by default) over all the classes, as done in the PASCAL VOC2007 challenge. The final KPI_pass output is true if all of the four KPI values meet the thresholds specified in the config file.

This evaluation pipeline is provided as a tool to aid in improving model performance. If evaluation results do not meet standards, consider modifying the training data to better reflect the distribution of the data used for evaluation. The outlier results can aid in error analysis: Investigate the failure cases and outliers on which the model performs poorly, and use these cases to inspire the training scene in simulation. Also, consider re-training with data from multiple simulation environments (scenes) and then adjusting the amount of training data per scene based on the outlier analysis.