Compatibility


CUDA Compatibility
CUDA Compatibility document describes the use of new CUDA toolkit components on systems with older base installations.

Monitoring & Management


NVML API Reference Guide
The NVIDIA Management Library reference.
Multi-Process Service
The Multi-Process Service (MPS) is an alternative, binary-compatible implementation of the CUDA Application Programming Interface (API). The MPS runtime architecture is designed to transparently enable co-operative multi-process CUDA applications, typically MPI jobs, to utilize Hyper-Q capabilities on the latest NVIDIA (Kepler-based) Tesla and Quadro GPUs
Driver Persistence
Any interactions with NVIDIA GPUs require that an instance of the kernel mode driver be running. This driver may be persistent in some environments and transient in others. This document describes the default driver behavior and options for modifying that behavior.

Health & Diagnostics


Healthmon User Guide
Nvidia-healthmon is the system administrator and cluster manager's tool for detecting and troubleshooting common problems affecting NVIDIATesla GPUs in a high performance computing environments. Nvidia-healthmon focuses on software and system configuration issues, with only limited hardware diagnostic capabilities.
NVIDIA Validation Suite User Guide
NVVS is the system administrator and cluster manager's tool for detecting and troubleshooting common problems affecting NVIDIATesla GPUs in a high performance computing environments. NVVS focuses on software and system configuration issues, diagnostics, topological concerns, and relative performance.
HW Field Diag
The HW field diag is the comprehensive tool for verifying GPU HW integrity in the field, and a required piece of the RMA process.
RMA Process
A standardized process must be followed to identify products that qualify for RMA. This document provides an overview of that process.
Dynamic Page Retirement
The NVIDIA driver supports "retiring" framebuffer pages that contain bad memory cells. This is called "dynamic page retirement" and is done automatically for cells that are degrading in quality. This feature can improve the longevity of an otherwise good board and and is thus an important resiliency feature on supported products, especially in HPC and enterprise environments.
NVIDIA GPU Memory Error Management
This document describes the new memory error recovery features introduced in the NVIDIA® 100 GPU and NVIDIA 800 GPU.
XID Errors
This document explains what Xid messages are, and is intended to assist system administrators, developers, and FAEs in understanding the meaning behind these messages as an aid in analyzing and resolving GPU-related problems.
NVIDIA GPU Debug Guidelines
This document provides GPU error debug and diagnosis guidelines, and is intended to assist system administrators, developers and FAEs get servers back up and running as quickly as possible.