NVIDIA Mission Control Software with GB200 NVL72 Systems Administration Guide#
NVIDIA Mission Control#
NVIDIA Mission Control
- Overview
- Mission Control Software Stack
- Node and Category Management
- Slurm Workload Management
- Observability Software
- Connecting to NVIDIA Mission Control autonomous hardware recovery
- Cluster Validation
- Health Checks & Alerts
- Firmware Upgrades with NVIDIA Mission Control autonomous hardware recovery
- NVIDIA Mission Control autonomous hardware recovery Break/Fix Workflow
- NVIDIA Mission Control autonomous hardware recovery Break/Fix Post RMA
- NVIDIA Mission Control autonomous hardware recovery Domain Triage
- NVIDIA Mission Control autonomous hardware recovery Break/Fix Switch Post RMA
- Out-of-Band Management
- High-Speed Fabric Management
- Leak Detection
- Backups
Power Reservation Steering#
Power Reservation Steering
- Introduction
- Concepts and Components
- Installation
- Advanced Configuration
- Metrics
- Troubleshooting
- FAQ
- What is node static power?
- Who should configure the PDN?
- Why can’t I submit one large job across all nodes?
- Example: Configuring a PD for a GB200 NVL72 rack
- Will PRS PDs update when a node is removed from BCM or a category?
- Will PRS PDs update when a new node is added to a category?
- What is the lifecycle of updates to the PRS config server?
- Does PRS reflect autoscaler changes automatically?
- When is the CPU included in PRS-managed devices?
- What does it mean to stop or start a PD?
Autonomous Recovery Engine#
Autonomous Recovery Engine
- Introduction
- Installation Process
- Configuration
- Accessing Clusters
- Accessing Dashboards
- Monitoring and Logs
- Grafana Cloud Setup
- Installing and Upgrading ARE
- Example Commands
- Viewing Job Details
- Accessing the Cockpit
- ARE Job Monitoring
- Confirming ARE is Operational
- How-to: Toggle Dry-Run Mode
- Debugging Common Issues
Workload Power Profile Solution (WPPS)