cuVSLAM (CUDA Stereo Visual SLAM)

cuVSLAM is based on two core technologies: Visual Odometry and Simultaneous Localization and Mapping (SLAM).

Visual Odometry is a method for estimating a camera position relative to its start position. This method is iterative: at each iteration, it evaluates two consequential input frames (stereo pairs). In both frames, it finds a set of keypoints. It then matches the keypoints in these two sets to estimate the transition and relative rotation of the camera between frames.

Simultaneous Localization and Mapping (SLAM) is a method built on top of Visual Odometry. It aims to improve the quality of Visual Odometery estimations by leveraging the knowledge of previously seen parts of a trajectory. It detects if the current scene was encountered in the past (i.e. a loop in the camera movement) and runs an additional optimization procedure to tune previously obtained poses.

Along with visual data, cuVSLAM can use Inertial Measurement Unit (IMU) measurements. It performs IMU fusion, combining IMU measurements with Visual Odometry, which is useful when pure Visual Odometry is unable to estimate a pose–for example, when there is dark lighting or long solid surfaces in front of a camera. Using an IMU usually leads to a significant performance improvement when there are poor visual conditions.

cuVSLAM delivers real-time tracking performance: For example, a VGA resolution yeilds greater than 60 FPS. For the KITTI benchmark, the algorithm achieves a drift of ~1% in localization and an orientation error of 0.003 degrees/meter of motion.

cuVSLAM allows for robust tracking in various environments and use cases: indoor, outdoor, aerial, HMD, automotive, and robotics.

Visual-Inertial Odometry (VIO)

In Isaac 2.0, cuVSLAM is used for Visual-Inertial Odometry (VIO), which is a combination of Visual Odometry with optional IMU fusion.

Visual Odometry Architecture

cuVSLAM is implemented as a two-contour engine: Visual Odometry and SLAM (Simultaneous Localization and Mapping). This section describes Visual Odometry only.

Visual Odometry recovers camera poses on a smooth trajectory (odometry) that with time can differ substantially from the real trajectory in the world coordinates due to drift, but is smooth and consistent in every time span.

Visual Odometry builds a 3D map of the world as a transitory entity (a set of 3D Landmarks) that exists in the frames of the latest camera pose estimates but is abandoned beyond the visibility limits of the camera’s latest positions.

Visual Odometry is built on an architecture of two “contours”: one on the front end and one on back end.

The front end works with the frequency of the incoming images and includes 2D and 3D parts. The 2D part includes the Observations selection to pick 2D locations that are good for 2D tracking (Visual Odomtery uses the Good Features To Track criteria of the Shi-Tomasi paper) and that are well spread through the image plane of every image. For 2D tracking, Visual Odometry uses the Lucas-Kanade algorithm (compositing version, by S.Baker definition) with 3D prediction. All 2D operations are implemented on the GPU to achieve good performance. The 3D part of the front end includes triangulation of Observations to Landmarks (on left to right cameras images) using P.Lindstrom triangulation and PnP camera pose recovery.
The back end works asynchronously on a separate thread running the Sparse Bundle Adjustment algorithm as it is described in “Multiple View Geometry” by R.Hurtley and A.Zisserman.

Performance

cuVSLAM guarantees optimal tracking accuracy when stereo images are recorded at 30 or 60 FPS and IMU angular velocity and linear acceleration measurements (if available) are recorded at 200-300 Hz. The inaccuracy of Stereo VIO and SLAM is less than 1% of translation drift and ~0.03 degree/meter of angular motion error, as measured by the KITTI benchmark, which records at 10 FPS with each frame at 1382x512 resolution.

In case of severe degradation of image input–for example, if lights turn off or a bump while driving causes dramatic motion blur–IMU fusion will provide acceptable pose tracking quality for approximately < 1 seconds.

Features

Embedded Undistortion

Note

We currently recommend using rectified input images.

As all cameras have lenses, lens distortion is always present, skewing the objects in the frame. For visual-odometry tracking, the cuVSLAM library comes with a built-in undistortion algorithm, which provides a specialized, efficient way to process raw (distorted) camera images. cuVSLAM can track 2D features on distorted images and limit undistortion to selected features in floating point coordinates. This is considerably faster and more accurate than undistortion of all image pixels performed before tracking.

The following distortion model options are supported:

Pinhole distortion model: Presumes that images come undistorted (this is the case, for example, with Intel RealSense 435/455 camera). The distortion coefficients array must be empty.

It is the same as Brown model, with all coefficients equal to zero.
Brown distortion model: With three radial and two tangential distortion coefficients \((k1, k2, k3, p1, p2)\)
\[radial = 1 + k_1 * r^2 + k_2 * r^4 + k_3 * r^6\]
\[tangential = (2 * p_1 * x_n * y_n + p_2 * (r^2 + 2 * x_n^2), p_1 * (r^2 + 2 * y_n^2) + 2 * p_2 * x_n * y_n)\]
Rational Polynomial distortion model: With six radial and two tangential distortion coefficients \((k1, k2, k3, k4, k5, k6, p1, p2)\)
\[radial = \frac{1 + k_1 * r^2 + k_2 * r^4 + k_3 * r^6}{1 + k_4 * r^2 + k_5 * r^4 + k_6 * r^6}\]
\[tangential = (2 * p_1 * x_n * y_n + p_2 * (r^2 + 2 * x_n^2), p_1 * (r^2 + 2 * y_n^2) + 2 * p_2 * x_n * y_n)\]
Fisheye (wide-angle) distortion model: With four radial distortion coefficients \((k1, k2, k3, k4)\)
\[radial = \arctan(r) * (1 + k_1 * \arctan^2(r) + k_2 * \arctan^4(r) + k_3 * \arctan^6(r) + k_4 * \arctan^8(r))\]
\[tangential = (0, 0)\]

For all distortion models, 3D points are projected as follows:

\[(u, v) = (c_x, c_y) + diag(f_x, f_y) * (radial * (x_n, y_n) + tangential)\]

Where:

\((u, v)\) - projected point

\((x, y, z)\) - 3D point

\((c_x, c_y)\) - principal (central) point

\((f_x, f_y)\) - focal length

\[x_n = \frac{x}{z}\]
\[y_n = \frac{y}{z}\]
\[r^2 = (x_n)^2 + (y_n)^2\]

For rectified images, we recommend setting the VSLAM horizontal parameter to true. For unrectified images, the horizontal parameter must be false.

Embedded Denoising

If your application or environment produces noisy images due to low-light conditions, cuVSLAM may select too many incorrect feature points. In this case, set the denoising parameter of the VSLAM component to true for denoising with increased tracking speed and accuracy.

Inertial Measurement Unit (IMU) Fusion

Stereo VIO can use measurements obtained from an IMU that is rigidly mounted on a camera rig or the robot base frame. If Visual Odometry fails due to severe degradation of image input, positional tracking will proceed on the IMU input for a duration of up to one second. IMU Fusion ensures seamless pose updates as long as video input interruptions last for less than one second.

To disable IMU fusion, set the imu_fusion parameter of the VSLAM component to false

Planar Constraint

If the camera path is limited to a 2D plane (for example, the robot is limited to an office with a flat floor), and the camera optical axis is aligned horizontally, the quality of cuVSLAM may benefit from constraining the allowed camera movement. To enable planar constraint, set the planar parameter of the VSLAM component to true.

VSLAM Component

Isaac 2.0 includes the cuVSLAM stereo tracker as a dynamic library wrapped with a GXF component. The VSLAM component receives a pair of input images, camera calibration parameters, and IMU measurements (if available). If visual tracking is successful, the component publishes the visual odometry pose of the left camera relative to the starting pose frame as a Pose3d message with a timestamp equal to the timestamp of the left frame.

If visual tracking is lost, publication of the left camera pose is interrupted until tracking is recovered. After recovery of visual tracking, publication of the left camera pose is resumed, but there’s no guarantee that the estimated camera pose will correspond to the actual pose of the left camera in the world frame. The application using Visual Odometry must detect the interruption in camera pose updates and launch an external re-localization algorithm.

The VSLAM component publishes Visual Odometry poses of the left camera via the tx_odom_pose transmitter. Visual Odometry messages include the Pose3d and Covariance Matrix in a 6x6 gxf::Tensor.

For visualization, it publishes left camera 2D keypoints to tx_keypoints and 3D landmarks to tx_landmarks.