NVIDIA DGX Cloud Slurm Documentation#
Slurm on DGX Cloud
- 1. Onboarding Quick Start Guide
- 2. Cluster Administration Guide
- 2.1. NVIDIA DGX Cloud Overview
- 2.2. Overview of Your Cluster
- 2.3. Overview of Cluster Administration
- 2.4. Security of DGX Cloud
- 2.5. DGX Cloud Cluster Onboarding
- 2.6. Accessing Your DGX Cloud Cluster as Admins
- 2.7. Administering Your DGX Cloud Cluster
- 2.7.1. Managing Users and Admins
- 2.7.1.1. Adding Cluster Admins Via cmsh (Cluster Owner Only)
- 2.7.1.2. Adding Cluster Users Via cmsh (Cluster Owner Only)
- 2.7.1.3. (Optional) Elevating a User to a Cluster Admin (Cluster Owner Only)
- 2.7.1.4. Enable an Alternative Cluster Owner Account (Cluster Owner Only)
- 2.7.1.5. Adding Users Via Base View
- 2.7.1.6. Creating and Configuring Lustre Shared Storage for Cluster Admins and Users
- 2.7.1.7. Updating Password
- 2.7.2. Managing Home Directories
- 2.7.3. Managing Lustre Storage
- 2.7.4. Managing the Slurm Cluster
- 2.7.5. Managing Users on NGC
- 2.7.1. Managing Users and Admins
- 2.8. Troubleshooting
- 2.9. Requesting Modifications To Your DGX Cloud Cluster
- 2.10. Resolving Security Bulletins
- 3. Cluster User Guide
- 3.1. NVIDIA DGX Cloud Overview
- 3.2. Overview of Your Cluster
- 3.3. Accessing Your DGX Cloud Cluster
- 3.4. Overview of Working in Your DGX Cloud Cluster
- 3.5. Setting Up to Run Jobs
- 3.6. Running Example Jobs
- 3.7. Example Single-Node Interactive Bash Job
- 3.8. Moving Data Into Your DGX Cloud Cluster
- 3.9. Managing Jobs
- 4. Workload Examples
- 4.1. PyTorch and Hugging Face Accelerate with DeepSpeed on DGX Cloud
- 4.1.1. Overview
- 4.1.2. Prerequisites
- 4.1.3. Preparing a Customized Container Image
- 4.1.4. Running on Slurm
- 4.1.5. Enabling Slurm Commands
- 4.1.6. Pulling Code From the ALMA Repository
- 4.1.7. Running the LoRA Batch Script
- 4.1.8. Monitoring the Job
- 4.1.9. Checking Multi-Node Function
- 4.1.10. Reference Results
- 4.2. NeMo Framework on DGX Cloud
- 4.3. Video Classification and ASR with Hugging Face Accelerate on DGX Cloud
- 4.1. PyTorch and Hugging Face Accelerate with DeepSpeed on DGX Cloud
Appendix
Notices
Documentation Feedback