NVIDIA TAO Toolkit v2.0
NVIDIA TAO Release tlt.20

Preparing the Input Data Structure

This section provides instructions on preparing your data for use by the Transfer Learning Toolkit (TLT).

Classification expects a directory of images with the following structure, where each class has its own directory with the class name. The naming convention for train/val/test can be different because the path of each set is individually specified in the spec file. See the Specification File for Classification section for more information.

Copy
Copied!
            

|--dataset_root: |--train |--audi: |--1.jpg |--2.jpg |--bmw: |--01.jpg |--02.jpg |--val |--audi: |--3.jpg |--4.jpg |--bmw: |--03.jpg |--04.jpg |--test |--audi: |--5.jpg |--6.jpg |--bmw: |--05.jpg |--06.jpg

The object detection apps in TLT expect data in KITTI file format for training and evaluation. For DetectNet_v2, SSD, DSSD, YOLOv3, and FasterRCNN, this data is converted to TFRecords for training. TFRecords help iterate faster through the data. The steps to convert the data for TFRecords are covered in the Conversion to TFRecords section.

KITTI file format

Using the KITTI format requires data to be organized in this structure:

Copy
Copied!
            

. |--dataset root |-- images |-- 000000.jpg |-- 000001.jpg . . |-- xxxxxx.jpg |-- labels |-- 000000.txt |-- 000001.txt . . |-- xxxxxx.txt |-- kitti_seq_to_map.json

Here’s a description of the structure:

  • The images directory contains the images to train on.

  • The labels directory contains the labels to the corresponding images. Details of this file are included in the Label Files section.

    Note

    The images and labels have the same file id’s before the extension. The image to label correspondence is maintained using this file name.

  • The kitti_seq_to_map.json file contains a sequence to frame id mapping for the frames in the images directory. This is an optional file, and is useful if the data needs to be split into N folds sequence wise. In case the data is to be split into a random 80:20 train:val split, then this file may be ignored.

Note

All the images and labels in the training dataset should be of the same resolution. For DetectNet_v2, SSD, DSSD, YOLOv3 and FasterRCNN notebooks, the tlt-train tool does not support training on images of multiple resolutions, or resizing images during training. All of the images must be resized offline to the final training size and the corresponding bounding boxes must be scaled accordingly.


Label Files

A KITTI format label file is a simple text file containing one line per object. Each line has multiple fields. Here is a description of these fields:

Num elements

Parameter name

Description

Type

Range

Example

1

Class names

The class to which the object belongs.

String

N/A

Person, car, Road_Sign

1

Truncation

How much of the object has left image boundaries.

Float

0.0, 0.1

0.0

1

Occlusion

Occlusion state [ 0 = fully visible, 1 = partly visible, 2 = largely occluded, 3 = unknown].

Integer

[0,3]

2

1

Alpha

Observation Angle of object

Float

[-pi, pi]

0.146

4

Bounding box coordinates: [xmin, ymin, xmax, ymax]

Location of the object in the image

Float(0 based index)

[0 to image width],[0 to image_height], [top_left, image_width], [bottom_right, image_height]

100 120 180 160

3

3-D dimension

Height, width, length of the object (in meters)

Float

N/A

1.65, 1.67, 3.64

3

Location

3-D object location x, y, z in camera coordinates (in meters)

Float

N/A

-0.65,1.71, 46.7

1

Rotation_y

Rotation ry around the Y-axis in camera coordinates

Float

[-pi, pi]

-1.59

The sum of the total number of elements per object is 15. Here is a sample text file:

Copy
Copied!
            

car 0.00 0 -1.58 587.01 173.33 614.12 200.12 1.65 1.67 3.64 -0.65 1.71 46.70 -1.59 cyclist 0.00 0 -2.46 665.45 160.00 717.93 217.99 1.72 0.47 1.65 2.45 1.35 22.10 -2.35 pedestrian 0.00 2 0.21 423.17 173.67 433.17 224.03 1.60 0.38 0.30 -5.87 1.63 23.11 -0.03

This indicates that in the image there are 3 objects with parameters mentioned as above. Currently, for detection the toolkit only requires the class name and bbox coordinates fields to be populated. This is because the TLT training pipe supports training only for class and bbox coordinates. The remaining fields may be set to 0. Here is a sample file for a custom annotated dataset:

Copy
Copied!
            

car 0.00 0 0.00 587.01 173.33 614.12 200.12 0.00 0.00 0.00 0.00 0.00 0.00 0.00 cyclist 0.00 0 0.00 665.45 160.00 717.93 217.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 pedestrian 0.00 0 0.00 423.17 173.67 433.17 224.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00


Sequence Mapping File

This is an optional JSON file that captures the mapping between the frames in the images directory and the names of video sequences from which these frames were extracted. This information is needed while doing an N-fold split of the dataset. This way frames from one sequence don’t repeat in other folds and one of the folds for could be used for validation. Here’s an example of the json dictionary file.

Copy
Copied!
            

{ "video_sequence_name": [list of strings(frame idx)] }

Here’s an example of a kitti_seq_to_frames.json file with a sample dataset with six sequences:

Copy
Copied!
            

{ "2011_09_28_drive_0165_sync": ["003193", "003185", "002857", "001864", "003838", "007320", "003476", "007308", "000337", "004165", "006573"], "2011_09_28_drive_0191_sync": ["005724", "002529", "004136", "005746"], "2011_09_28_drive_0179_sync": ["005107", "002485", "006089", "000695"], "2011_09_26_drive_0079_sync": ["005421", "000673", "002064", "000783", "003068"], "2011_09_28_drive_0035_sync": ["005540", "002424", "004949", "004996", "003969"], "2011_09_28_drive_0117_sync": ["007150", "003797", "002554", "001509"] }


The SSD, DSSD, YOLOv3, FasterRCNN, and DetectNet_v2 apps, as mentioned in the Data Input for Object Detection section, require KITTI format data to be converted to TFRecords. To do so, the Transfer Learning Toolkit includes the tlt-dataset-convert tool. This tool requires a configuration file as input. Configuration file details and sample usage examples are included in the following sections.

Configuration File for Dataset Converter

The dataio conversion tool takes a spec file as input to define the parameters required to convert a KITTI format data to the TFRecords that the detection models ingest. This is a prototxt format file with two global parameters:

  • kitti_config field: This is a nested prototxt configuration with multiple input parameters.

  • image_directory_path: Path to the dataset root. This image_dir_name is appended to this path to get the input images, and must be the same path as mentioned in the experiment spec file

Here are descriptions of the configurable parameters for the kitti_config field:

Parameter

Datatype

Default

Description

Supported Values

root_directory_path

string

Path to the dataset root directory

image_dir_name

string

Relative path to the directory containing images from the path in root_directory_path

label_dir_name

string

Relative path to the directory containing labels from the path in root_directory_path

partition_mode

string

The method employed when partitioning the data to multiple folds. Two methods are supported: Random partitioning: Where the data is divided in to 2 folds namely, train and val. This mode requires that the val_split parameter be set. Sequence-wise partitioning: Where the data is divided into n partitions (defined by num _partitionsparameter) based on the number of sequences available.

random sequence

num_partitions

int

2 (if partition_mode is random)

Number of partitions to split the data (N folds). This field is ignored when the partition model is set to random, as by default only 2 partitions are generated. Val and train. In sequence mode the data is split into n-folds. The number of partitions is ideally lesser than the total number of sequences in the kitti_sequence_to_frames_file.

n=2 for random partition n< number of sequences in the kitti_sequence_to_frames_file

image_extension

str

“.png”

The extension of the images in the image_dir_name parameter.

.png .jpg .jpeg

val_split

float

20

Percentage of data to be separated for validation. This only works under “random” partition mode. This partition is available in fold 0 of the TFrecords generated. Please set the validation fold to 0 in the dataset_config.

0-100

kitti_sequence_to_frames_file

str

Name of the kitti sequence to frame mapping file. This file must be present within the dataset root as mentioned in the root_directory _path.

num_shards

int

10

Number of shards per fold.

1-20

A sample configuration file to convert the pascal voc dataset with 80% training data and 20 % validation data is shown below. This assumes that the data has been converted to KITTI format and is available for ingestion in the root directory path.

Copy
Copied!
            

kitti_config { root_directory_path: "/workspace/tlt-experiments/data/VOCtrainval_11-May-2012/VOCdevkit/VOC2012" image_dir_name: "JPEGImages_kitti/test" label_dir_name: "Annotations_kitti/test" image_extension: ".jpg" partition_mode: "random" num_partitions: 2 val_split: 20 num_shards: 10 } image_directory_path: "/workspace/tlt-experiments/data/VOCtrainval_11-May-2012/VOCdevkit/VOC2012"


Sample Usage of the Dataset Converter Tool

KITTI is the accepted dataset format for image detection. The KITTI dataset must be converted to the TFRecord file format before passing to detection training. Use this command to do the conversion:

Copy
Copied!
            

tlt-dataset-convert [-h] -d DATASET_EXPORT_SPEC -o OUTPUT_FILENAME [-f VALIDATION_FOLD]

You can use these optional arguments:

  • -h, --help: Show this help message and exit

  • -d, --dataset-export-spec: Path to the detection dataset spec containing the config for exporting .tfrecord files.

  • -o output_filename: Output file name.

  • -f, –validation-fold: Indicate the validation fold in 0-based indexing. This is required when modifying the training set but otherwise optional.

Here’s an example of using the command with the dataset:

Copy
Copied!
            

tlt-dataset-convert -d <path_to_tfrecords_conversion_spec> -o <path_to_output_tfrecords>

Output log from executing tlt-dataset-convert:

Copy
Copied!
            

Using TensorFlow backend. 2019-07-16 01:30:59,073 - iva.detectnet_v2.dataio.build_converter - INFO - Instantiating a kitti converter 2019-07-16 01:30:59,243 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Num images in Train: 10786 Val: 2696 2019-07-16 01:30:59,243 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Validation data in partition 0. Hence, while choosing the validation set during training choose validation_fold 0. 2019-07-16 01:30:59,251 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 0 /usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:265: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default. 2019-07-16 01:31:01,226 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 1 . . sheep: 242 bottle: 205 .. boat: 171 car: 418 2019-07-16 01:31:20,772 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 0 .. 2019-07-16 01:32:40,338 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 9 2019-07-16 01:32:49,063 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Wrote the following numbers of objects: sheep: 695 .. car: 1770 2019-07-16 01:32:49,064 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Cumulative object statistics 2019-07-16 01:32:49,064 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Wrote the following numbers of objects: sheep: 937 .. car: 2188 2019-07-16 01:32:49,064 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Class map. Label in GT: Label in tfrecords file sheep: sheep .. boat: boat For the dataset_config in the experiment_spec, please use labels in the tfrecords file, while writing the classmap. 2019-07-16 01:32:49,064 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Tfrecords generation complete.

Note

The tlt-dataset-convert tool updates the class names in the KITTI formatted data files to lowercase alphabets. Therefore, please do make sure to use the updated lowercase class names in the dataset_config section under target class mapping, when configuring a training experiment. Using incorrect class names here, can lead invalid training experiments with 0 mAP.

Note

When using the tool to create separate tfrecords for evaluation, which may be defined under the dataset_config using the parameter validation_data_source, we advise you to set partition_mode to random with 2 partitions, and an arbitrary val_split (1-100). The dataloader takes care of traversing through all the folds and generating the mAP accordingly.


Instance segmentation expects directories of images for training or validation and annotation files in COCO format. The naming convention for train/val split can be different, because the path of each set is individually specified in the data preparation script in the IPython notebook example. Image data and the corresponding annotation file is then converted to TFRecords for training.

COCO format for Instance Segmentation

Using the COCO format requires data to be organized in this structure:

Copy
Copied!
            

annotation{ "id": int, "image_id": int, "category_id": int, "segmentation": RLE or [polygon], "area": float, "bbox": [x,y,width,height], "iscrowd": 0 or 1, } image{ "id": int, "width": int, "height": int, "file_name": str, "license": int, "flickr_url": str, "coco_url": str, "date_captured": datetime, } categories[{ "id": int, "name": str, "supercategory": str, }]

An example COCO annotation file is shown below:

Copy
Copied!
            

"annotations": [{"segmentation": [[510.66,423.01,511.72,420.03,510.45,416.0,510.34,413.02,510.77,410.26,510.77,407.5,510.34,405.16,511.51,402.83,511.41,400.49,510.24,398.16,509.39,397.31,504.61,399.22,502.17,399.64,500.89,401.66,500.47,402.08,499.09,401.87,495.79,401.98,490.59,401.77,488.79,401.77,485.39,398.58,483.9,397.31,481.56,396.35,478.48,395.93,476.68,396.03,475.4,396.77,473.92,398.79,473.28,399.96,473.49,401.87,474.56,403.47,473.07,405.59,473.39,407.71,476.68,409.41,479.23,409.73,481.56,410.69,480.4,411.85,481.35,414.93,479.86,418.65,477.32,420.03,476.04,422.58,479.02,422.58,480.29,423.01,483.79,419.93,486.66,416.21,490.06,415.57,492.18,416.85,491.65,420.24,492.82,422.9,493.56,424.39,496.43,424.6,498.02,423.01,498.13,421.31,497.07,420.03,497.07,415.15,496.33,414.51,501.1,411.96,502.06,411.32,503.02,415.04,503.33,418.12,501.1,420.24,498.98,421.63,500.47,424.39,505.03,423.32,506.2,421.31,507.69,419.5,506.31,423.32,510.03,423.01,510.45,423.01]],"area": 702.1057499999998,"iscrowd": 0,"image_id": 289343,"bbox": [473.07,395.93,38.65,28.67],"category_id": 18,"id": 1768}], "images": [{"license": 1,"file_name": "000000407646.jpg","coco_url": "http://images.cocodataset.org/val2017/000000407646.jpg","height": 400,"width": 500,"date_captured": "2013-11-23 03:58:53","flickr_url": "http://farm4.staticflickr.com/3110/2855627782_17b93a684e_z.jpg","id": 407646}], "categories": [{"supercategory": "person","id": 1,"name": "person"},{"supercategory": "vehicle","id": 2,"name": "bicycle"},{"supercategory": "vehicle","id": 3,"name": "car"},{"supercategory": "vehicle","id": 4,"name": "motorcycle"}]

For more details, please check COCO format. A COCO dataset preparation script is provided in the TLT container, which automatically downloads and converts the dataset to TFRecords. In the MaskRCNN notebook, you can run the script as follows:

Copy
Copied!
            

download_and_preprocess_coco.sh $data_dir

When using a custom dataset, you should follow the COCO format closely and convert the dataset to TFRecords using the following command (refer to L68-75 in download_and_preprocess_coco.sh for more detail).

Copy
Copied!
            

python create_coco_tf_record.py --logtostderr --include_masks --train_image_dir=$TRAIN_IMAGE_DIR --val_image_dir=$VAL_IMAGE_DIR --train_object_annotations_file=$TRAIN_COCO_ANNOTATION_FILE --val_object_annotations_file=$VAL_ANNOTATION_FILE --output_dir=$OUTPUT_DIR


© Copyright 2020, NVIDIA. Last updated on Nov 18, 2020.