Nemotron 3 Nano
This tutorial walks through the complete setup for distributed training of Nemotron 3 Nano 30B across multiple nodes using Slurm and Ray.
Goal: Train Nemotron 3 Nano 30B on 2 nodes using GRPO with proper multi-node Ray cluster coordination.
In this section, you will:
- Set up the Nemotron 3 Nano 30B training environment
- Download and prepare the training dataset
- Configure the launch script for multi-node coordination
- Submit and monitor the multi-node training job
Prerequisites
Before starting, complete the NeMo RL GRPO tutorial to understand the NeMo RL training workflow and GRPO fundamentals.
You’ll also need:
- ✅ Access to Slurm cluster with enroot/pyxis container support
- ✅ Access to NeMo RL container:
nvcr.io/nvidia/nemo-rl:v0.4.0.nemotron_3_nano - ✅ Understanding of Ray distributed computing framework
- ✅ Sufficient storage space (~110GB for model, data, and cache; checkpoints and logs accumulate with each run)
1. Initial Setup
1.1 Set Workspace Directory
Choose a location with sufficient space (~110GB minimum):
✅ Success Check: Directory has at least 200GB available space.
1.2 Clone the Repository
Clone the Nemotron 3 Nano v3 branch of NeMo RL:
✅ Success Check: Repository cloned with nano-v3 branch checked out.
1.3 Prepare Container Image
Option A: Use Registry Path Directly (Recommended for First Run)
Use the container directly from NVIDIA Container Registry:
This is the simplest approach but adds ~5-10 minutes to job startup time for first use.
Option B: Pre-Pull Container (Optional - For Faster Job Startup)
For faster job startup on subsequent runs, pre-pull and convert to .sqsh format:
Step 1: Get NGC API Key
- Go to https://org.ngc.nvidia.com/setup/api-keys
- Generate an API key
- Configure enroot credentials:
Step 2: Pull Container Using Sbatch
Due to head node restrictions, pull the container from a compute node:
Create pull_container.sh:
Submit the job:
Step 3: Use Local Container
Update your launch script to use the local .sqsh file:
✅ Success Check: Container file exists (~15GB) or registry path configured.
1.4 Install uv Tool
Install uv (which includes uvx) for downloading HuggingFace models and datasets:
✅ Success Check: Command shows uv version number.
1.5 Download and Process Training Data
Download and process the dataset on a compute node (head nodes have limited memory):
Create prepare_data.sh:
Submit the job:
Why use a compute node? The create_nanov3_jsonl.py script is memory-intensive and may fail on head nodes which have resource limits. Running on a compute node ensures sufficient memory.
✅ Success Check: Job completes and creates train-split.jsonl and val-split.jsonl.
1.6 Download Model
Download the Nemotron 3 Nano 30B model:
✅ Success Check: Model files downloaded (~59GB total) to model/ directory.
1.7 Verify Setup
Confirm all components are in place:
✅ Success Check: All directories and files present with correct sizes.
2. Create Launch Script
Create a launcher script that properly handles multi-node Ray coordination:
Create launch_nemotron_training.sh:
Make it executable:
Key Configuration Points:
NUM_NODES=2: 2 nodes × 8 GPUs = 16 GPUs total. For large-scale training, change toNUM_NODES=32(256 GPUs total)--chdir=/tmp: Sets a neutral working directory for the jobHF_HOMEandTRANSFORMERS_CACHE: Set to shared storage so all nodes can access model conversionsBASE_LOG_DIR: Specifies where Ray cluster logs will be written--account: Replace<your_account>with your Slurm account name--time=8:00:00: Adjust based on your cluster’s limits
✅ Success Check: Script created and executable.
3. Submit Training Job
Run the launch script from a neutral directory like /tmp to ensure consistent container working directory behavior across different cluster configurations.
Expected output:
✅ Success Check: Job submitted successfully with job ID returned.
4. Monitor Job Status
Monitor your submitted job:
Note: For job state codes (PD, R, CD, etc.), see Slurm documentation.
✅ Success Check: Job transitions from PD to R state.
5. Monitor Training Progress
5.1 Check Ray Cluster Logs
Wait 1-2 minutes for Ray cluster to initialize, then check logs:
5.2 Verify Ray Cluster Formation
Check that all Ray actors are online:
✅ Success Check: All actors online (16/16 for 2 nodes, or 256/256 for 32 nodes).
5.3 Watch Training Metrics
Monitor rollout collection progress:
Check TensorBoard logs:
✅ Success Check: Rollout percentage increasing steadily, TensorBoard events being written.
6. Troubleshooting
Issue: Job Stays in Pending (PD) State
Check the reason:
Common reasons:
(Priority): Waiting in queue for resources(Resources): Not enough nodes available(QOSMaxNodePerUserLimit): Exceeds node limit
Solution: Wait for resources, or adjust job parameters.
Issue: Ray Head Doesn’t Start
Symptom: No STARTED_RAY_HEAD file in logs directory.
Check Ray head log:
Solution: Check logs for errors related to container startup or resource allocation.
Issue: Training Crashes with Cache Errors
Symptom: FileNotFoundError mentioning run_config.yaml in ray-driver.log.
Check logs:
Root cause: Model conversion saved to local node cache, inaccessible to other nodes.
Solution: Verify shared cache directories are set in TRAINING_CMD:
7. Key Technical Details
Why Ray.sub?
Without ray.sub, each node would start its own independent Ray cluster. The ray.sub script from NeMo RL:
- Starts a Ray head on the first node
- Connects all worker nodes to that head
- Creates a unified distributed cluster
- Manages placement groups for GPU actors
Why Shared Cache?
HuggingFace Transformers converts models to Megatron format on first use:
- Without shared cache: Each node converts independently → race conditions
- With shared cache: Rank 0 converts once, all nodes share the result
8. File Structure Reference
After setup, your directory structure should look like:
Next Steps
Congratulations! You’ve successfully set up and launched Nemotron 3 Nano 30B multi-node training using Ray and Slurm.