Data Input for Object Detection

The object detection apps in TLT expect data in KITTI file format for training and evaluation.

KITTI Format

Using the KITTI format requires data to be organized in this structure:

.
|--dataset root
  |-- images
      |-- 000000.jpg
      |-- 000001.jpg
            .
            .
      |-- xxxxxx.jpg
  |-- labels
      |-- 000000.txt
      |-- 000001.txt
            .
            .
      |-- xxxxxx.txt
  |-- kitti_seq_to_map.json

Here’s a description of the structure:

  • The images directory contains the images to train on.

  • The labels directory contains the labels to the corresponding images. Details of this file are included in the Label Files section.

    Note

    The images and labels have the same file IDs before the extension. The image to label correspondence is maintained using this file name.

  • The kitti_seq_to_map.json file contains a sequence to frame ID mapping for the frames in the images directory. This is an optional file and is useful if the data needs to be split into N folds sequence wise. In case the data is to be split into a random 80:20 train:val split, then this file may be ignored.

Label Files

A KITTI format label file is a simple text file containing one line per object. Each line has multiple fields. Here is a description of these fields:

Num elements

Parameter name

Description

Type

Range

Example

1

Class names

The class to which the object belongs.

String

N/A

Person, car, Road_Sign

1

Truncation

How much of the object has left image boundaries.

Float

0.0, 0.1

0.0

1

Occlusion

Occlusion state [ 0 = fully visible, 1 = partly visible, 2 = largely occluded, 3 = unknown].

Integer

[0,3]

2

1

Alpha

Observation Angle of object

Float

[-pi, pi]

0.146

4

Bounding box coordinates: [xmin, ymin, xmax, ymax]

Location of the object in the image

Float(0 based index)

[0 to image width],[0 to image_height], [top_left, image_width], [bottom_right, image_height]

100 120 180 160

3

3-D dimension

Height, width, length of the object (in meters)

Float

N/A

1.65, 1.67, 3.64

3

Location

3-D object location x, y, z in camera coordinates (in meters)

Float

N/A

-0.65,1.71, 46.7

1

Rotation_y

Rotation ry around the Y-axis in camera coordinates

Float

[-pi, pi]

-1.59

The sum of the total number of elements per object is 15. Here is a sample text file:

car 0.00 0 -1.58 587.01 173.33 614.12 200.12 1.65 1.67 3.64 -0.65 1.71 46.70 -1.59
cyclist 0.00 0 -2.46 665.45 160.00 717.93 217.99 1.72 0.47 1.65 2.45 1.35 22.10 -2.35
pedestrian 0.00 2 0.21 423.17 173.67 433.17 224.03 1.60 0.38 0.30 -5.87 1.63 23.11 -0.03

This indicates that in the image there are 3 objects with parameters as mentioned above. Currently, for detection the toolkit only requires the class name and bbox coordinates fields to be populated. This is because the TLT training pipe supports training only for class and bbox coordinates. The remaining fields may be set to 0. Here is a sample file for a custom annotated dataset:

car 0.00 0 0.00 587.01 173.33 614.12 200.12 0.00 0.00 0.00 0.00 0.00 0.00 0.00
cyclist 0.00 0 0.00 665.45 160.00 717.93 217.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00
pedestrian 0.00 0 0.00 423.17 173.67 433.17 224.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Sequence Mapping File

This is an optional JSON file that captures the mapping between the frames in the images directory and the names of video sequences from which these frames were extracted. This information is needed while doing an N-fold split of the dataset. This way frames from one sequence don’t repeat in other folds and one of the folds could be used for validation. Here’s an example of the json dictionary file.

{
  "video_sequence_name": [list of strings(frame idx)]
}

Here’s an example of a kitti_seq_to_frames.json file with a sample dataset with six sequences:

{
  "2011_09_28_drive_0165_sync": ["003193", "003185", "002857", "001864", "003838",
  "007320", "003476", "007308", "000337", "004165", "006573"],
  "2011_09_28_drive_0191_sync": ["005724", "002529", "004136", "005746"],
  "2011_09_28_drive_0179_sync": ["005107", "002485", "006089", "000695"],
  "2011_09_26_drive_0079_sync": ["005421", "000673", "002064", "000783", "003068"],
  "2011_09_28_drive_0035_sync": ["005540", "002424", "004949", "004996", "003969"],
  "2011_09_28_drive_0117_sync": ["007150", "003797", "002554", "001509"]
}