Data Input for Object Detection¶

The object detection apps in TLT expect data in KITTI file format for training and evaluation.

KITTI Format¶

Using the KITTI format requires data to be organized in this structure:

.
|--dataset root
  |-- images
      |-- 000000.jpg
      |-- 000001.jpg
            .
            .
      |-- xxxxxx.jpg
  |-- labels
      |-- 000000.txt
      |-- 000001.txt
            .
            .
      |-- xxxxxx.txt
  |-- kitti_seq_to_map.json

Here’s a description of the structure:

The images directory contains the images to train on.
The labels directory contains the labels to the corresponding images. Details of this file are included in the Label Files section.

Note

The images and labels have the same file IDs before the extension. The image to label correspondence is maintained using this file name.
The kitti_seq_to_map.json file contains a sequence to frame ID mapping for the frames in the images directory. This is an optional file and is useful if the data needs to be split into N folds sequence wise. In case the data is to be split into a random 80:20 train:val split, then this file may be ignored.

Label Files¶

A KITTI format label file is a simple text file containing one line per object. Each line has multiple fields. Here is a description of these fields:

Num elements	Parameter name	Description	Type	Range	Example
1	Class names	The class to which the object belongs.	String	N/A	Person, car, Road_Sign
1	Truncation	How much of the object has left image boundaries.	Float	0.0, 0.1	0.0
1	Occlusion	Occlusion state [ 0 = fully visible, 1 = partly visible, 2 = largely occluded, 3 = unknown].	Integer	[0,3]	2
1	Alpha	Observation Angle of object	Float	[-pi, pi]	0.146
4	Bounding box coordinates: [xmin, ymin, xmax, ymax]	Location of the object in the image	Float(0 based index)	[0 to image width],[0 to image_height], [top_left, image_width], [bottom_right, image_height]	100 120 180 160
3	3-D dimension	Height, width, length of the object (in meters)	Float	N/A	1.65, 1.67, 3.64
3	Location	3-D object location x, y, z in camera coordinates (in meters)	Float	N/A	-0.65,1.71, 46.7
1	Rotation_y	Rotation ry around the Y-axis in camera coordinates	Float	[-pi, pi]	-1.59

The sum of the total number of elements per object is 15. Here is a sample text file:

car 0.00 0 -1.58 587.01 173.33 614.12 200.12 1.65 1.67 3.64 -0.65 1.71 46.70 -1.59
cyclist 0.00 0 -2.46 665.45 160.00 717.93 217.99 1.72 0.47 1.65 2.45 1.35 22.10 -2.35
pedestrian 0.00 2 0.21 423.17 173.67 433.17 224.03 1.60 0.38 0.30 -5.87 1.63 23.11 -0.03

This indicates that in the image there are 3 objects with parameters as mentioned above. Currently, for detection the toolkit only requires the class name and bbox coordinates fields to be populated. This is because the TLT training pipe supports training only for class and bbox coordinates. The remaining fields may be set to 0. Here is a sample file for a custom annotated dataset:

car 0.00 0 0.00 587.01 173.33 614.12 200.12 0.00 0.00 0.00 0.00 0.00 0.00 0.00
cyclist 0.00 0 0.00 665.45 160.00 717.93 217.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00
pedestrian 0.00 0 0.00 423.17 173.67 433.17 224.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Sequence Mapping File¶

This is an optional JSON file that captures the mapping between the frames in the images directory and the names of video sequences from which these frames were extracted. This information is needed while doing an N-fold split of the dataset. This way frames from one sequence don’t repeat in other folds and one of the folds could be used for validation. Here’s an example of the json dictionary file.

{
  "video_sequence_name": [list of strings(frame idx)]
}

Here’s an example of a kitti_seq_to_frames.json file with a sample dataset with six sequences:

{
  "2011_09_28_drive_0165_sync": ["003193", "003185", "002857", "001864", "003838",
  "007320", "003476", "007308", "000337", "004165", "006573"],
  "2011_09_28_drive_0191_sync": ["005724", "002529", "004136", "005746"],
  "2011_09_28_drive_0179_sync": ["005107", "002485", "006089", "000695"],
  "2011_09_26_drive_0079_sync": ["005421", "000673", "002064", "000783", "003068"],
  "2011_09_28_drive_0035_sync": ["005540", "002424", "004949", "004996", "003969"],
  "2011_09_28_drive_0117_sync": ["007150", "003797", "002554", "001509"]
}