Free Space Segmentation

The goal of the free space Deep Neural Network (DNN) is to segment images into classes of interest like drivable space and obstacles. The input of the DNN is a monocular image, and the output is pixel-wise segmentation. This package makes it easy to train a free space DNN in simulation and use it to perform real-world inference. While this modular package can power various applications, this document illustrates the workflow with free space segmentation for indoors and sidewalk segmentation for outdoors.

This documentation first describes how to quickly start with inference and training. Details regarding data sources, network architecture, multi-GPU training, and application layouts are presented afterward.

One potential use case of free space segmentation is obstacle avoidance using a monocular camera. The costmap, or obstacle map for the robot’s environment, can be created from various sources such as Lidar and depth information from the camera. Fusing information from different sensors can fine tune the costmap and make the robot’s obstacle-avoidance more robust. The free space determined by the path segmentation model can be projected onto the real world coordinate system and used as input information for obstacle avoidance.

Quick Start

Inference

To begin, enter the command below:

bob@desktop:~/isaac$ bazel run packages/freespace_dnn/apps:freespace_dnn_inference_image -- --config inference:packages/freespace_dnn/apps/freespace_dnn_inference_medium_warehouse_tensorrt.config.json

Then open the Sight web interface at http://localhost:3000/. You should see inference being performed on a sample image with a pre-trained model for multi-class segmentation in the warehouse. The green, yellow, blue, and red colors on the left represent floor (freespace), obstacles, wall, and lane lines respectively.

Tip

To perform inference using TensorFlow instead of TensorRT, use the freespace_dnn_inference_medium_warehouse_tensorflow.config.json file instead of freespace_dnn_inference_medium_warehouse_tensorrt.config.json in the command above. The result should look the same, but TensorRT provides better performance.

../../../_images/inference_warehouse.jpg

To perform inference with a model trained for indoors, change the color_filename parameter in the freespace_dnn_inference_image.app.json file to ./external/path_segmentation_images/sidewalk1.png and run the command below:

bob@desktop:~/isaac$ bazel run packages/freespace_dnn/apps:freespace_dnn_inference_image -- --config inference:packages/freespace_dnn/apps/freespace_dnn_inference_sidewalk_tensorrt.config.json

In Sight, you should see inference being performed on a sample image with a pre-trained model for sidewalk segmentation. The green, yellow, blue, and red colors on the left represent road, curb, sidewalk, and grass respectively.

../../../_images/inference_sidewalk.jpg

To perform inference with the model trained for indoors, change the color_filename parameter in the freespace_dnn_inference_image.app.json file to ./external/path_segmentation_images/indoor1.png and run the command below:

bob@desktop:~/isaac$ bazel run packages/freespace_dnn/apps:freespace_dnn_inference_image -- --config inference:packages/freespace_dnn/apps/freespace_dnn_inference_indoor_tensorflow.config.json

This time, free space and other space are represented with black and red colors respectively.

../../../_images/inference_indoor.jpg

If we look at the application file used above, located at packages/freespace_dnn/apps/freespace_dnn_inference_image.app.json, it has a very simple structure. The inference subgraph is fed an image that is read from the path specified with the color_filename parameter. There are four applications located at packages/freespace_dnn/apps/. You can use them by simply changing the image source as listed in the table below:

Application Name

Image source

freespace_dnn_inference_image

Image on disk

freespace_dnn_inference_replay

Video on disk

freespace_dnn_inference_v4l2

Camera

freespace_dnn_inference_unity3d

Simulation with Unity3D

Various other applications can be created by employing the inference subgraph and adding desired components like robot wheel drivers and obstacle avoidance.

To use the inference apps with a Unity scene, you also have to provide the config file that defines the camera teleportation parameters. You can do this by passing the path to the config file when you run the application. For example, the command to run the inference app with the warehouse scene would be:

bob@desktop:~/isaac$ bazel run packages/freespace_dnn/apps/freespace_dnn_inference_unity3d -- --config inference:packages/freespace_dnn/apps/freespace_dnn_inference_medium_warehouse_tensorrt.config.json,packages/freespace_dnn/apps/freespace_dnn_inference_unity3d_medium_warehouse.config.json

Training

To get started with training, type the command below:

bob@desktop:~/isaac$ bazel run packages/freespace_dnn/apps:freespace_dnn_training

This will launch a TensorFlow instance and train it with the labeled images received over TCP.

As explained later, labeled images can be sourced from real data captured and labeled using Isaac SDK or public data sets, but we encourage training in simulation for “unlimited” labeled data.

Isaac currently supports the Unity 3D engine for training the freespace DNN. To train with simulation in Unity 3D, launch scenario 3 of the medium warehouse scene by running the following command:

bob@desktop:/<isaac_sim_unity3d_binary_dir>$ ./sample.x86_64 --scene medium_warehouse --scenario 3

After the scene has loaded, you can press the “C” key to disable the main camera. This increases the simulation framerate, allowing for faster data generation.

Once training starts, Tensorflow will periodically output logs and checkpoints, as well as the following files to /tmp/path_segmentation by default:

  • .meta file: Denotes the graph structure of the model.

  • .data file: Stores the values of all saved variables.

  • .index file: Stores the list of variable names and shapes.

To view the training progress on Tensorboard, run the following command and open http://localhost:6006 in a browser.

tensorboard --logdir=/tmp/path_segmentation

Once the training is complete, serialize the most recent checkpoint as a protobuf file with the following command:

bob@desktop:~/isaac$ python3 packages/freespace_dnn/apps/freespace_dnn_training_freeze_model.py --checkpoint_dir /tmp/path_segmentation --output_nodename prediction/truediv --output_filename model.pb --output_onnx_filename model.onnx

Using the model.onnx file that is generated, create a config file that looks like the packages/freespace_dnn/apps/freespace_dnn_inference_medium_warehouse_tensorrt.config.json file. You are now ready to perform inference.

Data

Simulated Data

Being able to generate unlimited data points through simulation is a powerful asset, bridging the “reality gap” that separates simulated robotics from real experiments. Simulators offer a variety of features that make this possible, namely domain randomization and teleportation.

Domain randomization attempts to bridge the reality gap through improved availability of unbiased data. Domain-randomized training data makes the model more robust in responding to different lighting conditions, floor textures, and random objects in the field of view during inference.

Domain randomization can be achieved in several ways:

  • Light randomization: Change the color and intensity of lights

  • Material randomization: Apply different substance materials over desired surfaces

  • Texture randomization: Apply different textures to the materials

  • Color randomization: Apply different colors to the materials

  • Material properties: Vary material properties such as roughness, metallicity, and specularity. This can change the friction, reflective and refractive properties, and other characteristics of the surface.

Teleportation allows you to randomly sample camera poses within a certain range (translation and rotation) to capture data from different heights and angles.

Setting Up Communication With the Simulator

The Isaac SDK and simulator communicate using a pub/sub architecture: Data is passed back and forth between the two processes by setting up TCP publishers on the side where the data is created and TCP subscribers on the side where the data is ingested.

For Unity 3D simulation, the application that publishes the ground truth data is packages/navsim/apps/navsim.app.json. This is directly loaded by the medium_warehouse scene in NavSim.

The application publishes the sensor data to a user-defined port using a TcpPublisher. This data is used by the training application. The training application in turn sends teleportation commands to the NavSim application, which are received through a TcpSubscriber node.

Domain randomization using Substance is available by default in the Unity scene. This allows you to apply different materials to the meshes in the scene, apply random poses to the actors, etc. The ground is labeled as “floor” with a per-pixel index of 1.

Real data from public datasets

Datasets like MSCoco and ADE20K provide per-pixel segmentation data for multiple classes. Relevant classes for path segmentation include pavement, road, carpet, earth, ground, and multiple types of floors. The images are captured in various environments with multiple camera angles and degrees of occlusions, making them good candidates for training a model with better generalization. Both MSCoco and ADE20K were used in the training of the binary segmentation model for indoor freespace.

Real data for freespace segmentation with autonomous data collection

Isaac provides the means to autonomously create data from reality to train models for detecting traversable ground spaces.

Autonomous Data Collection

Autonomous data collection using a robot is a great way to introduce training data under real conditions. This can be broadly split into 2 workflows:

  • Path planning through the map

  • Monitoring robot displacement

Path Planning Through the Map
  • TravellingSalesman: This codelet plots waypoints over the freely traversable space in the map and calculates the shortest path. Each waypoint denotes a 2D point on the map.

Note

The travelling salesmen path reflects only the graphed path through the waypoints. It does not take into account the reachability of space and may draw paths over unreachable areas when visualized.

  • MoveAndScan: This takes a list of 2D waypoints as input and expands them to include multiple orientations. The number of orientations included for each 2D location is user-defined. Hence, if there are N waypoints in the map and M orientations, the output is a list of NxM poses.

  • FollowPath: This takes a list of poses as input and publishes each pose (or waypoint) to the GoTo codelet as a goal. This enables the robot to move to each of the waypoints in order.

../../../_images/autonomous_navigation.jpg
Monitoring Robot Displacement
  • NavigationMonitor: Continously monitors the linear and angular displacement of the robot. If the displacement is greater than a user-defined threshold, it publishes a RobotStateProto message, which contains the current pose, current speed, and displacement since last update. In this context, the NavigationMonitor codelet mainly acts as a signal to regulate when a pair of proto messages can be logged by the Recorder functionality.

  • Throttle: Regulates one signal with respect to another. In this case, it regulates the camera input with respect to the RobotStateProto output from NavigationMonitor. The main purpose of the Throttle component is to make sure that data is collected at intervals to prevent the log size from getting inflated too quickly.

../../../_images/navigation_monitor.jpg

Data Annotation

Data annotation is a time-consuming task when performed manually, especially in the case of semantic segmentation, where per-pixel labels are required. Automating this process for data collected in reality can save a significant amount of time.

Estimation
  • RgbdSuperpixels: Computes superpixel clustering for an RGB-D image using a single-pass clustering algorithm that assigns every pixel to a local cluster based on similarity in color and depth.

  • RgbdSuperpixelFreespace: Labels every superpixel as either free space or an obstacle. The superpixels are transformed into the ground-coordinate frame assuming that the ground plane conforms to the equation Z = 0.

  • SuperpixelImageLabelling: Creates a pixel-wise segmentation of the original camera based on the superpixel labeling.

../../../_images/ground_truth_generation.jpg

Running the application

The application located at apps/carter/autonomous_data_collection/carter_data_collection.app.json can be used to collect color and depth images from a mapped environment using a Carter robot equipped with a Intel RealSense camera. This reference application can be used as the basis for other applications. To collect the data, deploy the application to the Carter robot by running the following command:

bob@desktop:~/isaac$ ./engine/build/deploy.sh -h <robot_ip> -p //apps/carter/autonomous_data_collection:carter_data_collection-pkg -d jetpack44 --remote_user <username_on_nano>

where <robot_ip> is the IP address of the robot and <username_on_robot> is your username on the robot. If a username is not specified with the --remote_user option, the default username, “nvidia”, is used.

After deployment, run the application with the following steps:

  1. Log in to the robot (via SSH).

  2. Navigate to the directory where the app is deployed, which is ~/deploy/<user> by default.

  3. Run the application after modifying the following command:

    ./apps/carter/autonomous_data_collection --config "apps/carter/robots/carter_1.config.json,apps/assets/maps/nvidia_R_180306.config.json" --graph="apps/assets/maps/nvidia_R_180306.graph.json"
    

The robot should plot a waypoint graph over the map, navigate to each point, and turn a complete circle. The NavigationMonitor codelet monitors the displacement and enables logging only at certain intervals.

Head over to the Sight web interface at http://<robot_ip>:3000/ and click Record on the Record Control Panel. This will the save color and depth images to file.

You can then replay this log using the packages/freespace_dnn/apps/freespace_dnn_data_annotation.subgraph.json file to label the traversable space. This subgraph produces the ground truth data, and can be connected to the training subgraph directly.

../../../_images/annotated_data_in_training_subgraph.jpg

Network Architecture

For binary segmentation, Isaac SDK uses U-Net because it satisfies the following criteria:

  • It is easily trainable on a small dataset.

  • It is able to train fast, for a short inference time.

  • It supports TensorRT inference.

  • It has a compatible licence, so it can be fully integrated into Isaac SDK.

U-Net is an end-to-end fully convolutional network (FCN) (i.e. it only contains convolutional layers and no dense layers).

U-Net can support both binary and multiclass segmentation. The only difference is in the activation of the last layer, which is Sigmoid for binary segmentation and Softmax for multiclass segmentation.

Training the Network

../../../_images/training.jpg

Multi-GPU training

On multi-GPU host systems, parallelizing the workload on all GPUs can be a powerful asset. Parallelism in Tensorflow can be divided into two types:

  • Data parallelism: Data is distributed across multiple GPUs or host machines.

  • Model parallelism: The model itself is split across multiple machines or GPUs. For example, a single layer can be fit into the memory of a single machine (or GPU), and forward and back propagation involves communication of output from one host (or GPU) to another.

Tensorflow supports data parallelism through the MirroredStrategyModule library, which mirrors the model graph on each of the GPUs and hence can accept independent sets of data on each GPU for training.

Message Types

Messages are of the following types:

  • TensorProto: Defines an n-dimensional tensor that forms the basis of images and tensor pairs for training.

  • TensorListProto: Defines a list of TensorProto messages, which are mainly used to pass around tensors.

  • ColorCameraProto: Holds a color image and camera-intrinsic information.

  • SegmentationCameraProto: Holds an image containing the class label for every pixel in the image. It also contains camera-intrinsic information, similar to ColorCameraProto.

Codelets

  • TcpSubscriber: Used by the training application to receive data from the simulator. Two TcpSubscribers are used in this example, each receiving a color image and detection label from the simulation.

  • ColorCameraEncoderCpu: Takes in ColorCameraProto and outputs a downsampled image stored in a 3D tensor (WxHx3). The tensor is published as a TensorListProto containing only one tensor. The codelet also supports downsampling, which reduces the image to a smaller, user-defined size.

  • SegmentationEncoder: Takes in a SegmentationCameraProto and outputs a 3D tensor (WxHx1). This codelet is responsible for encoding the labeled data for semantic segmentation by assigning the probability of 1.0 to the channel index of the class in consideration and 0.0 to all other channel indices. The tensor is published as a TensorListProto containing only one tensor.

  • TensorSynchronization: Takes in two TensorListProto inputs and synchronizes them according to their acquisition time. This codelet ensures that the training code gets synchronized color-image and segmentation-label data.

  • SampleAccumulator: Takes in the training pairs (image tensor and segmentation-label tensor) as a TensorListProto and stores them in a buffer. This codelet is bound to the Python script such that the training script can directly sample from this buffer using the acquire_samples() function. The acquire_samples() function converts the TensorListProto into a list of numpy arrays with corresponding dimensions and passes that to the Python script.

  • Teleportation: Publishes RigidBody3GroupProto in a pre-defined way to randomly change the spawn location. It includes an option for providing spline parameters to perform uniform random sampling along the tangent of the spline.

Inference

../../../_images/inference.jpg

Message Types

Messages are of the following types:

  • TensorProto: Defines an n-dimensional tensor that forms the basis of images and tensor pairs for training.

  • TensorListProto: Defines a list of TensorProto. TensorProto messages are mainly used to pass around tensors.

  • ColorCameraProto: Holds a color image, including the camera-intrinsic information.

Codelets

  • CropAndDownsample: Takes a ColorCameraProto as input, along with parameters for the pixel position to start cropping, size of the crop, and size to downsample the image to. This codelet helps downsample the input image to the size required by the network while still maintaining the aspect ratio.

  • ColorCameraEncoderCpu: Takes a ColorCameraProto as input and outputs a downsampled image stored in a 3D tensor (WxHx3). The tensor is published as a TensorListProto containing only one tensor. The codelet also supports downsampling, which reduces the image to a smaller, user-defined size.

  • TensorReshape: Takes a TensorListProto as input and reshapes each TensorProto according to the user-defined size. In this context, this codelet is mainly used to add an extra dimension to the input tensor depicting the batch size since Tensorflow accepts input in the NHWC format. Consequently, the codelet is also used to remove the first dimension from the output of the neural network.

  • TensorflowInference: Loads the frozen neural network into memory and takes a TensorListProto as input to pass to the network. Uses Tensorflow to perform inference on the input image. Publishes the network output in the form of a TensorListProto.

  • TensorRTInference: Loads the frozen neural network in the form of a UFF or ONNX file into memory. Takes in a TensorListProto as input to pass to the network. Creates a TensorRT runtime engine based on the provided model and uses it to perform inference on the input image. Publishes the network output in the form of a TensorListProto.

  • TensorArgMax: Reduces the output of the neural network, which is a tensor of dimensions (W x H x C), to a tensor of dimensions (W x H). It discretizes the tensor along the channel dimension based on a user-defined threshold.