NVIDIA Mission Control Software Systems Administration Guide#
NVIDIA Mission Control#
NVIDIA Mission Control
- Overview
- Mission Control Software Stack
- Node and Category Management
- Slurm Workload Management
- NVIDIA Run:ai Installation
- Adding and Removing Nodes from Run:ai or Slurm
- Observability Software
- Connecting to NVIDIA Mission Control Autonomous Hardware Recovery
- GB300 (Limited Feature Support)
- GB200
- Dashboard
- Cluster Validation
- Health Checks & Alerts
- Firmware Upgrades with NVIDIA Mission Control autonomous hardware recovery
- NVIDIA Mission Control autonomous hardware recovery Break/Fix Workflow
- NVIDIA Mission Control autonomous hardware recovery Break/Fix Post RMA
- NVIDIA Mission Control autonomous hardware recovery Domain Triage
- NVIDIA Mission Control autonomous hardware recovery Break/Fix Switch Post RMA
- B200
- Dashboard
- Cluster Validation
- Automated Health Checks with NVIDIA Mission Control autonomous hardware recovery
- Firmware Upgrades with NVIDIA Mission Control autonomous hardware recovery
- NVIDIA Mission Control autonomous hardware recovery Break/Fix Workflow
- NVIDIA Mission Control autonomous hardware recovery Break/Fix Post RMA
- NVIDIA Mission Control autonomous hardware recovery Domain Triage
- NVIDIA Mission Control autonomous hardware recovery Break/Fix Switch Post RMA
- Out-of-Band Management
- NVLink Partition Management
- NVLink Management Software (NMX + NetQ)
- Leak Detection
- Backups
Power Reservation Steering#
Power Reservation Steering
- Introduction
- Concepts and Components
- Installation
- Advanced Configuration
- Metrics
- Troubleshooting
- FAQ
- Static vs. dynamic power
- Who should configure the PDN?
- Why can’t I submit one large job across all nodes?
- Example 1: Configuring a PD for a GB200/GB300 NVL72 rack
- Example 2: Power budget adjustment and verification for two GB200 NVL72 racks
- Will PRS PDs update when a node is removed from BCM or a category?
- Will PRS PDs update when a new node is added to a category?
- What is the lifecycle of updates to the PRS config server?
- Does PRS reflect autoscaler changes automatically?
- When is the CPU included in PRS-managed devices?
- What does it mean to stop or start a PD?
Autonomous Job Recovery#
Autonomous Job Recovery
- Introduction
- Installation Process
- Configuration
- Accessing Clusters
- Accessing Dashboards
- Monitoring and Logs
- Grafana Cloud Setup
- Installing and Upgrading AJR
- Example Commands
- Viewing Job Details
- Accessing the Cockpit
- AJR Job Monitoring
- Confirming AJR is Operational
- How-to: Toggle Dry-Run Mode
- Debugging Common Issues
Workload Power Profile Solution (WPPS)