Stereo Depth DNN


NVIDIA recommends always running this application in Max Power mode.

Isaac provides StereoDNN, a depth estimation algorithm that uses a deep neural network (DNN). The algorithm is based on a deep learning model designed to calculate per-pixel depths from stereo camera footage. StereoDNN estimates disparities (depth maps) from pairs of left and right stereo images, end-to-end.

StereoDNN is trained with the Tensorflow framework. The network can be trained in supervised (by Lidar), unsupervised (photometric), and semi-supervised (photometric with Lidar/depth GT) modes. The trained model is then converted into the TensorRT inference runtime using custom plug-ins for improved efficiency.

The best, or largest model trained in testing scores 3.12% D1 error on the KITTI benchmark (#12 out of all DNNs in KITTI and #4 out of results published in December of 2017). The Isaac SDK also includes a custom inference runtime based on TensorRT and several light-weight models that can run at 10-30 FPS on GPUs, though with reduced accuracy. The fastest model runs at 10 FPS on Jetson TX2 and more than 30 FPS on Titan X.

The DNN is described in the following white paper:

The images below are the result of StereoDNN estimates of depth/disparities with the KITTI 2015 stereo benchmark. The DNN is trained in a semi-supervised way by combining Lidar groundtruth with Photometric loss.


The following images are a comparison of a mono depth DNN approach (Godard et al.) to a StereoDNN results scene. Computed and/or estimated DNN point clouds are in white, and Lidar ground truth is in green. This model has ~10 cm error at 10 m distance and ~30-40 cm error at 30-50 m. This is the best model trained on KITTI. Where StereoDNN captures building and street geometry well, monocular DNN misses that information completely.

../../../_images/sddnn-2.jpg ../../../_images/sddnn-3.jpg

The following images are examples of StereoDNN disparity maps computed on stereo RGB frames from the KITTI dataset:


Source Code

The standalone code for the DNN is available on GitHub at the following link:

The goal of this project is to enable inference for NVIDIA Stereo DNN TensorFlow models on Jetson, as well as other platforms supported by NVIDIA TensorRT library.

A demo of inference on KITTI dataset can be viewed on YouTube at the following link:

The stereo DNN is wrapped as an Isaac codelet, and is available in the Isaac repository.

Isaac Codelet

The Isaac codelet wrapping StereoDNN takes a left rectified image, a right rectified image, and both the intrinsic and extrinsic calibration of the stereo camera, and generates a depth frame of size 513x257, using the nvstereonet library from GitHub.

Running the Sample Application

The stereo_depth_dnn sample application uses a ZED stereo camera. First connect the ZED camera to the host system or the Jetson Xavier platform you are using, and then use one of the following procedure to run the application.

Note that this application only runs on the host system or the Jetson Xavier platform.

To Run the Sample Application on the Host System

Run the sample application with the following command:

bazel run //apps/samples/stereo_depth_dnn -- -d x64

To Run the Application on Jetson

To run the sample application on Jetson, first build the package on the host and then deploy it to the Jetson system. Deploy //apps/samples/stereo_depth_dnn/stereo_depth_dnn-pkg to the robot as explained in Deploying and Running on Jetson.

Then log on to the Jetson system and execute the following commands:

bob@jetson:~/$ cd deploy/bob/stereo_depth
bob@jetson:~/deploy/bob/stereo_depth$ ./apps/samples/stereo_depth_dnn/stereo_depth_dnn -d xavier

Where “bob” is your user name on the host system.

To View Output from the Application in Websight

While the application is running, open Isaac Sight in a browser by navigating to http://localhost:3000. If running the application on a Jetson platform, make sure to use the IP address of the Jetson system instead of localhost.

In Websight, a window called “color_left” shows the left input image and a window called “depth” shows the depth estimated by the algorithm.