Contents
- 1. Overview
- 2. Getting Started with Fabric Manager
- 2.1. Basic Components
- 2.2. NVSwitch and NVLink Initialization
- 2.3. Supported Platforms
- 2.4. Supported Deployment Models
- 2.5. Other NVIDIA Software Packages
- 2.6. Installation
- 2.7. Managing the Fabric Manager Service
- 2.8. Fabric Manager Startup Options
- 2.9. Fabric Manager Service File
- 2.10. Running Fabric Manager as Non-Root
- 2.11. Fabric Manager Config Options
- 3. Bare Metal Mode
- 4. Virtualization Models
- 5. Fabric Manager SDK
- 5.1. Data Structures
- 5.2. Initializing the Fabric Manager API Interface
- 5.3. Shutting Down the Fabric Manager API Interface
- 5.4. Connect to Running the Fabric Manager Instance
- 5.5. Disconnect from Running the Fabric Manager Instance
- 5.6. Getting Supported Partitions
- 5.7. Activate a GPU Partition
- 5.8. Activate a GPU Partition with Virtual Functions
- 5.9. Deactivate a GPU Partition
- 5.10. Set Activated Partition List after a Fabric Manager Restart
- 5.11. Get the NVLink Failed Devices
- 5.12. Get Unsupported Partitions
- 6. Full Passthrough Virtualization Model
- 6.1. Supported Virtual Machine Configurations
- 6.2. Virtual Machines with 16 GPUs
- 6.3. Virtual Machines with Eight GPUS
- 6.4. Virtual Machines with Four GPUS
- 6.5. Virtual Machines with Two GPUs
- 6.6. Virtual Machine with One GPU
- 6.7. Other Requirements
- 6.8. Hypervisor Sequences
- 6.9. Monitoring Errors
- 6.10. Limitations
- 7. Shared NVSwitch Virtualization Model
- 7.1. Software Stack
- 7.2. Guest VM to Service VM Interaction
- 7.3. Preparing the Service Virtual Machine
- 7.4. FM Shared Library and APIs
- 7.5. Fabric Manager Resiliency
- 7.6. Service Virtual Machine Life Cycle Management
- 7.7. Guest Virtual Machine Life Cycle Management
- 7.8. Error Handling
- 7.9. Interoperability With a Multi-Instance GPU
- 8. vGPU Virtualization Model
- 9. Supported High Availability Modes
- 9.1. Common Terms
- 9.2. GPU Access NVLink Failure
- 9.3. Trunk NVLink Failure
- 9.4. NVSwitch Failure
- 9.5. GPU Failure
- 9.6. Manual Degradation
- 9.6.1. GPU Exclusion
- 9.6.1.1. GPU Exclusion Flow
- 9.6.1.2. Running Application Error Handling
- 9.6.1.3. Diagnosing GPU Failures
- 9.6.1.4. In-Band GPU Exclude Mechanism
- 9.6.1.5. Kernel Module Parameters
- 9.6.1.6. Adding/Removing a GPU from the Exclude Candidate List
- 9.6.1.7. Listing Excluded GPUs
- 9.6.1.8. nvidia-smi
- 9.6.1.9. Procfs
- 9.6.1.10. Out-of-Band Query
- 9.6.1.11. Running GPU Exclusion Scripts
- 9.6.1.12. Bare Metal and vGPU Configurations
- 9.6.1.13. Full Passthrough Virtualized Configurations
- 9.6.1.14. Shared NVSwitch Virtualization Configurations
- 9.6.2. NVSwitch Exclusion
- 9.6.1. GPU Exclusion
- 10. NVLink Topology
- 11. GPU Partitions
- 12. Resiliency
- 13. Error Handling
- 13.1. FM Initialization Errors
- 13.2. Partition Life Cycle Errors
- 13.3. Runtime NVSwitch Errors
- 13.4. Non-Fatal NVSwitch SXid Errors
- 13.5. Fatal NVSwitch SXid Errors
- 13.6. Always Fatal NVSwitch SXid Errors
- 13.7. Other Notable NVSwitch SXid Errors
- 13.8. High Availability Mode Comparison
- 13.9. GPU/VM/System Reset Capabilities and Limitations
- 14. Notices