Monitoring & Management

NVML API Reference Guide
The NVIDIA Management Library reference.
Multi-Process Service
The Multi-Process Service (MPS) is an alternative, binary-compatible implementation of the CUDA Application Programming Interface (API). The MPS runtime architecture is designed to transparently enable co-operative multi-process CUDA applications, typically MPI jobs, to utilize Hyper-Q capabilities on the latest NVIDIA (Kepler-based) Tesla and Quadro GPUs
Driver Persistence
Any interactions with NVIDIA GPUs require that an instance of the kernel mode driver be running. This driver may be persistent in some environments and transient in others. This document describes the default driver behavior and options for modifying that behavior.

Health & Diagnostics

NVIDIA Validation Suite User Guide
NVVS is the system administrator and cluster manager's tool for detecting and troubleshooting common problems affecting NVIDIATesla GPUs in a high performance computing environments. NVVS focuses on software and system configuration issues, diagnostics, topological concerns, and relative performance.
HW Field Diag
The HW field diag is the comprehensive tool for verifying GPU HW integrity in the field, and a required piece of the RMA process.
RMA Process
A standardized process must be followed to identify products that qualify for RMA. This document provides an overview of that process.
Dynamic Page Retirement
The NVIDIA driver supports "retiring" of bad framebuffer memory cells, by retiring the page the cell belongs to. This is called "dynamic page retirement" and is done automatically for cells that are degrading in quality. This feature can improve the longevity of an otherwise good board and and is thus an important resiliency feature on supported products, especially in HPC and enterprise environments.
XID Errors
This document explains what Xid messages are, and is intended to assist system administrators, developers, and FAEs in understanding the meaning behind these messages as an aid in analyzing and resolving GPU-related problems.