Workload Examples
1. PyTorch and HuggingFace Accelerate with DeepSpeed on DGX Cloud
2. NeMo Framework on DGX Cloud
3. Video Classification and ASR with HuggingFace Accelerate on DGX Cloud
Notices
Notices
Documentation Feedback
Email Us!
NVIDIA DGX Cloud Workload Examples
»
NVIDIA DGX Cloud Workload Examples
NVIDIA DGX Cloud Workload Examples
Workload Examples
1. PyTorch and HuggingFace Accelerate with DeepSpeed on DGX Cloud
1.1. Overview
1.2. Prerequisites
1.2.1. Preparing a Customized Container Image
1.2.1.1. Creating a Container Image in a Local Linux Machine
1.2.1.2. Setting Up Your Cluster Workspace
1.2.1.2.1. Option 1: Push to an Accessible Container Registry
1.2.1.2.2. Option 2: Convert to a SquashFS File and Upload to the Slurm Cluster
1.3. Running on Slurm
1.3.1. Enabling Slurm Commands
1.3.2. Pulling Code From the ALMA Repository
1.3.3. Running the LoRA Batch Script
1.3.4. Monitoring the Job
1.3.5. Checking Multi-Node Function
1.3.6. Reference Results
2. NeMo Framework on DGX Cloud
2.1. Overview
2.2. Prerequisites
2.3. Setting Up Your Cluster Workspace
2.3.1. Authenticating with NGC
2.3.2. Pulling the NeMo Framework repository
2.3.3. Configuring NeMo Framework
2.4. Running a Training Job on Slurm Using Synthetic Data
2.4.1. Configuring the Training Job
2.4.2. Running the Training Job
2.4.3. Monitoring the Training Job
2.4.4. Interpreting Training Performance
3. Video Classification and ASR with HuggingFace Accelerate on DGX Cloud
3.1. Overview
3.2. Prerequisites
3.2.1. Preparing a Customized Container Image
3.2.1.1. Creating a Container Image in a Local Linux Machine
3.2.1.2. Setting Up Your Cluster Workspace
3.2.1.2.1. Option 1: Push to an Accessible Container Registry
3.2.1.2.2. Option 2: Convert to a SquashFS File and Upload to the Slurm Cluster
3.3. Running on Slurm
3.3.1. Enabling Slurm Commands
3.3.2. Use Case 1: Fine-Tuning a Video Classification Model with Slurm
3.3.2.1. Workspace and Video Dataset Preparation
3.3.2.2. Training Script
3.3.2.3. Batch Submission Script
3.3.2.4. Training Steps, Epochs, and Time
3.3.3. Use Case 2: QLoRA Fine-Tuning of ASR with Slurm
3.3.3.1. Workspace and ASR Dataset Preparation
3.3.3.2. Training Script
3.3.3.3. Batch Submission Script
3.3.3.4. Training Steps, Epochs, and Time
Notices
Notices
Notice
Trademarks
Copyright
Documentation Feedback
Email Us!