Data Input for Object Detection =============================== .. _data_input_for_object_detection: The object detection apps in TLT expect data in KITTI file format for training and evaluation. KITTI Format ------------ Using the KITTI format requires data to be organized in this structure: .. code:: . |--dataset root |-- images |-- 000000.jpg |-- 000001.jpg . . |-- xxxxxx.jpg |-- labels |-- 000000.txt |-- 000001.txt . . |-- xxxxxx.txt |-- kitti_seq_to_map.json Here's a description of the structure: * The images directory contains the images to train on. * The labels directory contains the labels to the corresponding images. Details of this file are included in the :ref:`Label Files` section. .. Note:: The images and labels have the same file IDs before the extension. The image to label correspondence is maintained using this file name. * The :code:`kitti_seq_to_map.json` file contains a sequence to frame ID mapping for the frames in the images directory. This is an optional file and is useful if the data needs to be split into N folds sequence wise. In case the data is to be split into a random 80:20 train:val split, then this file may be ignored. .. Note::For DetectNet_v2, FasterRCNN, the :code:`train` tool does not support training on images of multiple resolutions, or resizing images during training. All of the images must be resized offline to the final training size and the corresponding bounding boxes must be scaled accordingly. Online resizing is supported for other detection model architectures. Label Files ^^^^^^^^^^^ .. _label_files: A KITTI format label file is a simple text file containing one line per object. Each line has multiple fields. Here is a description of these fields: +------------------+----------------------------------------------------+----------------------------------------------------------------------------------------------+----------------------+-----------------------------------------------------------------------------------------------+------------------------+ | **Num elements** | **Parameter name** | **Description** | **Type** | **Range** | **Example** | +==================+====================================================+==============================================================================================+======================+===============================================================================================+========================+ | 1 | Class names | The class to which the object belongs. | String | N/A | Person, car, Road_Sign | +------------------+----------------------------------------------------+----------------------------------------------------------------------------------------------+----------------------+-----------------------------------------------------------------------------------------------+------------------------+ | 1 | Truncation | How much of the object has left image boundaries. | Float | 0.0, 0.1 | 0.0 | +------------------+----------------------------------------------------+----------------------------------------------------------------------------------------------+----------------------+-----------------------------------------------------------------------------------------------+------------------------+ | 1 | Occlusion | Occlusion state [ 0 = fully visible, 1 = partly visible, 2 = largely occluded, 3 = unknown]. | Integer | [0,3] | 2 | +------------------+----------------------------------------------------+----------------------------------------------------------------------------------------------+----------------------+-----------------------------------------------------------------------------------------------+------------------------+ | 1 | Alpha | Observation Angle of object | Float | [-pi, pi] | 0.146 | +------------------+----------------------------------------------------+----------------------------------------------------------------------------------------------+----------------------+-----------------------------------------------------------------------------------------------+------------------------+ | 4 | Bounding box coordinates: [xmin, ymin, xmax, ymax] | Location of the object in the image | Float(0 based index) | [0 to image width],[0 to image_height], [top_left, image_width], [bottom_right, image_height] | 100 120 | | | | | | | 180 160 | +------------------+----------------------------------------------------+----------------------------------------------------------------------------------------------+----------------------+-----------------------------------------------------------------------------------------------+------------------------+ | 3 | 3-D dimension | Height, width, length of the object (in meters) | Float | N/A | 1.65, 1.67, 3.64 | +------------------+----------------------------------------------------+----------------------------------------------------------------------------------------------+----------------------+-----------------------------------------------------------------------------------------------+------------------------+ | 3 | Location | 3-D object location x, y, z in camera coordinates (in meters) | Float | N/A | -0.65,1.71, 46.7 | +------------------+----------------------------------------------------+----------------------------------------------------------------------------------------------+----------------------+-----------------------------------------------------------------------------------------------+------------------------+ | 1 | Rotation_y | Rotation ry around the Y-axis in camera coordinates | Float | [-pi, pi] | -1.59 | +------------------+----------------------------------------------------+----------------------------------------------------------------------------------------------+----------------------+-----------------------------------------------------------------------------------------------+------------------------+ The sum of the total number of elements per object is 15. Here is a sample text file: .. code:: car 0.00 0 -1.58 587.01 173.33 614.12 200.12 1.65 1.67 3.64 -0.65 1.71 46.70 -1.59 cyclist 0.00 0 -2.46 665.45 160.00 717.93 217.99 1.72 0.47 1.65 2.45 1.35 22.10 -2.35 pedestrian 0.00 2 0.21 423.17 173.67 433.17 224.03 1.60 0.38 0.30 -5.87 1.63 23.11 -0.03 This indicates that in the image there are 3 objects with parameters as mentioned above. Currently, for detection the toolkit only requires the class name and bbox coordinates fields to be populated. This is because the TLT training pipe supports training only for class and bbox coordinates. The remaining fields may be set to 0. Here is a sample file for a custom annotated dataset: .. code:: car 0.00 0 0.00 587.01 173.33 614.12 200.12 0.00 0.00 0.00 0.00 0.00 0.00 0.00 cyclist 0.00 0 0.00 665.45 160.00 717.93 217.99 0.00 0.00 0.00 0.00 0.00 0.00 0.00 pedestrian 0.00 0 0.00 423.17 173.67 433.17 224.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Sequence Mapping File ^^^^^^^^^^^^^^^^^^^^^ This is an optional JSON file that captures the mapping between the frames in the :code:`images` directory and the names of video sequences from which these frames were extracted. This information is needed while doing an N-fold split of the dataset. This way frames from one sequence don't repeat in other folds and one of the folds could be used for validation. Here's an example of the json dictionary file. .. code:: { "video_sequence_name": [list of strings(frame idx)] } Here's an example of a :code:`kitti_seq_to_frames.json` file with a sample dataset with six sequences: .. code:: { "2011_09_28_drive_0165_sync": ["003193", "003185", "002857", "001864", "003838", "007320", "003476", "007308", "000337", "004165", "006573"], "2011_09_28_drive_0191_sync": ["005724", "002529", "004136", "005746"], "2011_09_28_drive_0179_sync": ["005107", "002485", "006089", "000695"], "2011_09_26_drive_0079_sync": ["005421", "000673", "002064", "000783", "003068"], "2011_09_28_drive_0035_sync": ["005540", "002424", "004949", "004996", "003969"], "2011_09_28_drive_0117_sync": ["007150", "003797", "002554", "001509"] }