Preparing the Input Data Structure

This section provides instructions on preparing your data for use by the Transfer Learning Toolkit (TLT).

Data Input for Classification

Classification expects a directory of images with the following structure, where each class has its own directory with the class name. The naming convention for train/val/test can be different because the path of each set is individually specified in the spec file. See the Specification File for Classification section for more information.


Data Input for Object Detection

The object detection apps in TLT expect data in KITTI file format for training and evaluation. For DetectNet_v2, SSD, DSSD, YOLOv3, and FasterRCNN, this data is converted to TFRecords for training. TFRecords help iterate faster through the data. The steps to convert the data for TFRecords are covered in the Conversion to TFRecords section.

KITTI file format

Using the KITTI format requires data to be organized in this structure:

|--dataset root
  |-- images
      |-- 000000.jpg
      |-- 000001.jpg
      |-- xxxxxx.jpg
  |-- labels
      |-- 000000.txt
      |-- 000001.txt
      |-- xxxxxx.txt
  |-- kitti_seq_to_map.json

Here’s a description of the structure:

  • The images directory contains the images to train on.

  • The labels directory contains the labels to the corresponding images. Details of this file are included in the Label Files section.


    The images and labels have the same file id’s before the extension. The image to label correspondence is maintained using this file name.

  • The kitti_seq_to_map.json file contains a sequence to frame id mapping for the frames in the images directory. This is an optional file, and is useful if the data needs to be split into N folds sequence wise. In case the data is to be split into a random 80:20 train:val split, then this file may be ignored.


All the images and labels in the training dataset should be of the same resolution. For DetectNet_v2, SSD, DSSD, YOLOv3 and FasterRCNN notebooks, the tlt-train tool does not support training on images of multiple resolutions, or resizing images during training. All of the images must be resized offline to the final training size and the corresponding bounding boxes must be scaled accordingly.

Label Files

A KITTI format label file is a simple text file containing one line per object. Each line has multiple fields. Here is a description of these fields:

Num elements

Parameter name






Class names

The class to which the object belongs.



Person, car, Road_Sign



How much of the object has left image boundaries.


0.0, 0.1




Occlusion state [ 0 = fully visible, 1 = partly visible, 2 = largely occluded, 3 = unknown].






Observation Angle of object


[-pi, pi]



Bounding box coordinates: [xmin, ymin, xmax, ymax]

Location of the object in the image

Float(0 based index)

[0 to image width],[0 to image_height], [top_left, image_width], [bottom_right, image_height]

100 120 180 160


3-D dimension

Height, width, length of the object (in meters)



1.65, 1.67, 3.64



3-D object location x, y, z in camera coordinates (in meters)



-0.65,1.71, 46.7



Rotation ry around the Y-axis in camera coordinates


[-pi, pi]


The sum of the total number of elements per object is 15. Here is a sample text file:

car 0.00 0 -1.58 587.01 173.33 614.12 200.12 1.65 1.67 3.64 -0.65 1.71 46.70 -1.59
cyclist 0.00 0 -2.46 665.45 160.00 717.93 217.99 1.72 0.47 1.65 2.45 1.35 22.10 -2.35
pedestrian 0.00 2 0.21 423.17 173.67 433.17 224.03 1.60 0.38 0.30 -5.87 1.63 23.11 -0.03

This indicates that in the image there are 3 objects with parameters mentioned as above. Currently, for detection the toolkit only requires the class name and bbox coordinates fields to be populated. This is because the TLT training pipe supports training only for class and bbox coordinates. The remaining fields may be set to 0. Here is a sample file for a custom annotated dataset:

car 0.00 0 0.00 587.01 173.33 614.12 200.12 0.00 0.00 0.00 0.00 0.00 0.00 0.00
cyclist 0.00 0 0.00 665.45 160.00 717.93 217.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00
pedestrian 0.00 0 0.00 423.17 173.67 433.17 224.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Sequence Mapping File

This is an optional JSON file that captures the mapping between the frames in the images directory and the names of video sequences from which these frames were extracted. This information is needed while doing an N-fold split of the dataset. This way frames from one sequence don’t repeat in other folds and one of the folds for could be used for validation. Here’s an example of the json dictionary file.

  "video_sequence_name": [list of strings(frame idx)]

Here’s an example of a kitti_seq_to_frames.json file with a sample dataset with six sequences:

  "2011_09_28_drive_0165_sync": ["003193", "003185", "002857", "001864", "003838",
  "007320", "003476", "007308", "000337", "004165", "006573"],
  "2011_09_28_drive_0191_sync": ["005724", "002529", "004136", "005746"],
  "2011_09_28_drive_0179_sync": ["005107", "002485", "006089", "000695"],
  "2011_09_26_drive_0079_sync": ["005421", "000673", "002064", "000783", "003068"],
  "2011_09_28_drive_0035_sync": ["005540", "002424", "004949", "004996", "003969"],
  "2011_09_28_drive_0117_sync": ["007150", "003797", "002554", "001509"]

Conversion to TFRecords

The SSD, DSSD, YOLOv3, FasterRCNN, and DetectNet_v2 apps, as mentioned in the Data Input for Object Detection section, require KITTI format data to be converted to TFRecords. To do so, the Transfer Learning Toolkit includes the tlt-dataset-convert tool. This tool requires a configuration file as input. Configuration file details and sample usage examples are included in the following sections.

Configuration File for Dataset Converter

The dataio conversion tool takes a spec file as input to define the parameters required to convert a KITTI format data to the TFRecords that the detection models ingest. This is a prototxt format file with two global parameters:

  • kitti_config field: This is a nested prototxt configuration with multiple input parameters.

  • image_directory_path: Path to the dataset root. This image_dir_name is appended to this path to get the input images, and must be the same path as mentioned in the experiment spec file

Here are descriptions of the configurable parameters for the kitti_config field:





Supported Values



Path to the dataset root directory



Relative path to the directory containing images from the path in root_directory_path



Relative path to the directory containing labels from the path in root_directory_path



The method employed when partitioning the data to multiple folds. Two methods are supported: Random partitioning: Where the data is divided in to 2 folds namely, train and val. This mode requires that the val_split parameter be set. Sequence-wise partitioning: Where the data is divided into n partitions (defined by num _partitionsparameter) based on the number of sequences available.

random sequence



2 (if partition_mode is random)

Number of partitions to split the data (N folds). This field is ignored when the partition model is set to random, as by default only 2 partitions are generated. Val and train. In sequence mode the data is split into n-folds. The number of partitions is ideally lesser than the total number of sequences in the kitti_sequence_to_frames_file.

n=2 for random partition n< number of sequences in the kitti_sequence_to_frames_file




The extension of the images in the image_dir_name parameter.

.png .jpg .jpeg




Percentage of data to be separated for validation. This only works under “random” partition mode. This partition is available in fold 0 of the TFrecords generated. Please set the validation fold to 0 in the dataset_config.




Name of the kitti sequence to frame mapping file. This file must be present within the dataset root as mentioned in the root_directory _path.




Number of shards per fold.


A sample configuration file to convert the pascal voc dataset with 80% training data and 20 % validation data is shown below. This assumes that the data has been converted to KITTI format and is available for ingestion in the root directory path.

kitti_config {
  root_directory_path: "/workspace/tlt-experiments/data/VOCtrainval_11-May-2012/VOCdevkit/VOC2012"
  image_dir_name: "JPEGImages_kitti/test"
  label_dir_name: "Annotations_kitti/test"
  image_extension: ".jpg"
  partition_mode: "random"
  num_partitions: 2
  val_split: 20
  num_shards: 10
image_directory_path: "/workspace/tlt-experiments/data/VOCtrainval_11-May-2012/VOCdevkit/VOC2012"

Sample Usage of the Dataset Converter Tool

KITTI is the accepted dataset format for image detection. The KITTI dataset must be converted to the TFRecord file format before passing to detection training. Use this command to do the conversion:

tlt-dataset-convert [-h] -d DATASET_EXPORT_SPEC -o OUTPUT_FILENAME
                         [-f VALIDATION_FOLD]

You can use these optional arguments:

  • -h, --help: Show this help message and exit

  • -d, --dataset-export-spec: Path to the detection dataset spec containing the config for exporting .tfrecord files.

  • -o output_filename: Output file name.

  • -f, –validation-fold: Indicate the validation fold in 0-based indexing. This is required when modifying the training set but otherwise optional.

Here’s an example of using the command with the dataset:

tlt-dataset-convert -d <path_to_tfrecords_conversion_spec> -o <path_to_output_tfrecords>

Output log from executing tlt-dataset-convert:

Using TensorFlow backend.
2019-07-16 01:30:59,073 - iva.detectnet_v2.dataio.build_converter - INFO - Instantiating a kitti converter
2019-07-16 01:30:59,243 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Num images in
Train: 10786    Val: 2696
2019-07-16 01:30:59,243 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Validation data in partition 0. Hence, while choosing the validation set during training choose validation_fold 0.
2019-07-16 01:30:59,251 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 0
/usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/dataio/ VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
2019-07-16 01:31:01,226 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 1
. .
sheep: 242
bottle: 205
boat: 171
car: 418
2019-07-16 01:31:20,772 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 0
2019-07-16 01:32:40,338 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 9
2019-07-16 01:32:49,063 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
sheep: 695
car: 1770

2019-07-16 01:32:49,064 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Cumulative object statistics
2019-07-16 01:32:49,064 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
sheep: 937
car: 2188
2019-07-16 01:32:49,064 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Class map.
Label in GT: Label in tfrecords file
sheep: sheep

boat: boat
For the dataset_config in the experiment_spec, please use labels in the tfrecords file, while writing the classmap.

2019-07-16 01:32:49,064 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Tfrecords generation complete.


The tlt-dataset-convert tool updates the class names in the KITTI formatted data files to lowercase alphabets. Therefore, please do make sure to use the updated lowercase class names in the dataset_config section under target class mapping, when configuring a training experiment. Using incorrect class names here, can lead invalid training experiments with 0 mAP.


When using the tool to create separate tfrecords for evaluation, which may be defined under the dataset_config using the parameter validation_data_source, we advise you to set partition_mode to random with 2 partitions, and an arbitrary val_split (1-100). The dataloader takes care of traversing through all the folds and generating the mAP accordingly.

Data Input for Instance Segmentation

Instance segmentation expects directories of images for training or validation and annotation files in COCO format. The naming convention for train/val split can be different, because the path of each set is individually specified in the data preparation script in the IPython notebook example. Image data and the corresponding annotation file is then converted to TFRecords for training.

COCO format for Instance Segmentation

Using the COCO format requires data to be organized in this structure:

"id": int,
"image_id": int,
"category_id": int,
"segmentation": RLE or [polygon],
"area": float,
"bbox": [x,y,width,height],
"iscrowd": 0 or 1,

"id": int,
"width": int,
"height": int,
"file_name": str,
"license": int,
"flickr_url": str,
"coco_url": str,
"date_captured": datetime,

"id": int,
"name": str,
"supercategory": str,

An example COCO annotation file is shown below:

"annotations": [{"segmentation": [[510.66,423.01,511.72,420.03,510.45,416.0,510.34,413.02,510.77,410.26,510.77,407.5,510.34,405.16,511.51,402.83,511.41,400.49,510.24,398.16,509.39,397.31,504.61,399.22,502.17,399.64,500.89,401.66,500.47,402.08,499.09,401.87,495.79,401.98,490.59,401.77,488.79,401.77,485.39,398.58,483.9,397.31,481.56,396.35,478.48,395.93,476.68,396.03,475.4,396.77,473.92,398.79,473.28,399.96,473.49,401.87,474.56,403.47,473.07,405.59,473.39,407.71,476.68,409.41,479.23,409.73,481.56,410.69,480.4,411.85,481.35,414.93,479.86,418.65,477.32,420.03,476.04,422.58,479.02,422.58,480.29,423.01,483.79,419.93,486.66,416.21,490.06,415.57,492.18,416.85,491.65,420.24,492.82,422.9,493.56,424.39,496.43,424.6,498.02,423.01,498.13,421.31,497.07,420.03,497.07,415.15,496.33,414.51,501.1,411.96,502.06,411.32,503.02,415.04,503.33,418.12,501.1,420.24,498.98,421.63,500.47,424.39,505.03,423.32,506.2,421.31,507.69,419.5,506.31,423.32,510.03,423.01,510.45,423.01]],"area": 702.1057499999998,"iscrowd": 0,"image_id": 289343,"bbox": [473.07,395.93,38.65,28.67],"category_id": 18,"id": 1768}],
"images": [{"license": 1,"file_name": "000000407646.jpg","coco_url": "","height": 400,"width": 500,"date_captured": "2013-11-23 03:58:53","flickr_url": "","id": 407646}],
"categories": [{"supercategory": "person","id": 1,"name": "person"},{"supercategory": "vehicle","id": 2,"name": "bicycle"},{"supercategory": "vehicle","id": 3,"name": "car"},{"supercategory": "vehicle","id": 4,"name": "motorcycle"}]

For more details, please check COCO format. A COCO dataset preparation script is provided in the TLT container, which automatically downloads and converts the dataset to TFRecords. In the MaskRCNN notebook, you can run the script as follows: $data_dir

When using a custom dataset, you should follow the COCO format closely and convert the dataset to TFRecords using the following command (refer to L68-75 in for more detail).