> For clean Markdown of any page, append .md to the page URL. > For a complete documentation index, see https://docs.nvidia.com/nemo/gym/llms.txt. > For full documentation content, see https://docs.nvidia.com/nemo/gym/llms-full.txt. # Nemotron 3 Nano This tutorial walks through the complete setup for distributed training of Nemotron 3 Nano 30B across multiple nodes using Slurm and Ray. **Goal**: Train Nemotron 3 Nano 30B on 2 nodes using GRPO with proper multi-node Ray cluster coordination. **In this section, you will**: 1. Set up the Nemotron 3 Nano 30B training environment 2. Download and prepare the training dataset 3. Configure the launch script for multi-node coordination 4. Submit and monitor the multi-node training job *** ## Prerequisites Before starting, complete the [NeMo RL GRPO tutorial](/latest/training-tutorials/nemo-rl-grpo) to understand the NeMo RL training workflow and GRPO fundamentals. You'll also need: * ✅ Access to Slurm cluster with enroot/pyxis container support * ✅ Access to NeMo RL container: `nvcr.io/nvidia/nemo-rl:v0.4.0.nemotron_3_nano` * ✅ Understanding of Ray distributed computing framework * ✅ Sufficient storage space (\~110GB for model, data, and cache; checkpoints and logs accumulate with each run) *** ## 1. Initial Setup ### 1.1 Set Workspace Directory Choose a location with sufficient space (\~110GB minimum): ```bash # Set workspace directory (adjust to your cluster's large storage) # Examples: /scratch/$USER, /work/$USER, /data/$USER, /lustre/.../users/$USER WORKSPACE=/path/to/large/storage/$USER # Verify space available df -h $WORKSPACE ``` **✅ Success Check**: Directory has at least 200GB available space. *** ### 1.2 Clone the Repository Clone the Nemotron 3 Nano v3 branch of NeMo RL: ```bash cd $WORKSPACE git clone --recurse-submodules -b nano-v3 https://github.com/NVIDIA-NeMo/RL.git RL-nano-v3 cd RL-nano-v3 ``` **✅ Success Check**: Repository cloned with nano-v3 branch checked out. *** ### 1.3 Prepare Container Image **Option A: Use Registry Path Directly (Recommended for First Run)** Use the container directly from NVIDIA Container Registry: ```bash # No preparation needed - will be pulled automatically during job execution CONTAINER=docker://nvcr.io/nvidia/nemo-rl:v0.4.0.nemotron_3_nano ``` This is the simplest approach but adds \~5-10 minutes to job startup time for first use. *** **Option B: Pre-Pull Container (Optional - For Faster Job Startup)** For faster job startup on subsequent runs, pre-pull and convert to .sqsh format: **Step 1: Get NGC API Key** 1. Go to [https://org.ngc.nvidia.com/setup/api-keys](https://org.ngc.nvidia.com/setup/api-keys) 2. Generate an API key 3. Configure enroot credentials: ```bash mkdir -p ~/.config/enroot echo "machine nvcr.io login \$oauthtoken password " >> ~/.config/enroot/.credentials ``` **Step 2: Pull Container Using Sbatch** Due to head node restrictions, pull the container from a compute node: Create `pull_container.sh`: ```bash #!/bin/bash #SBATCH --job-name=enroot-import #SBATCH --account= #SBATCH --partition= #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --time=01:00:00 #SBATCH --output=enroot-import-%j.out # Set workspace directory (adjust to your cluster's large storage) WORKSPACE=/path/to/large/storage/$USER ENROOT_CACHE_PATH=$WORKSPACE/.cache/enroot enroot import -o "$WORKSPACE/nemo-rl.v0.4.0.nemotron_3_nano.sqsh" \ "docker://nvcr.io/nvidia/nemo-rl:v0.4.0.nemotron_3_nano" ``` Submit the job: ```bash sbatch pull_container.sh ``` **Step 3: Use Local Container** Update your launch script to use the local .sqsh file: ```bash CONTAINER=$WORKSPACE/nemo-rl.v0.4.0.nemotron_3_nano.sqsh ``` **✅ Success Check**: Container file exists (\~15GB) or registry path configured. ### 1.4 Install uv Tool Install uv (which includes uvx) for downloading HuggingFace models and datasets: ```bash # Install uv curl -LsSf https://astral.sh/uv/install.sh | sh # Add to PATH (uv installs to ~/.local/bin) export PATH="$HOME/.local/bin:$PATH" # Verify installation uvx --version ``` **✅ Success Check**: Command shows uv version number. *** ### 1.5 Download and Process Training Data Download and process the dataset on a compute node (head nodes have limited memory): Create `prepare_data.sh`: ```bash #!/bin/bash #SBATCH --job-name=prepare-data #SBATCH --account= #SBATCH --partition= #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --time=00:20:00 #SBATCH --output=prepare-data-%j.out # Set workspace directory (adjust to your cluster's large storage) WORKSPACE=/path/to/large/storage/$USER # Data directory DATA_DIR=${WORKSPACE}/RL-nano-v3/data/ # Download dataset uvx --from huggingface-hub hf download nvidia/Nemotron-3-Nano-RL-Training-Blend \ --repo-type dataset \ --local-dir ${DATA_DIR} # Fill in placeholders chmod +x ${DATA_DIR}/create_nanov3_jsonl.py ${DATA_DIR}/create_nanov3_jsonl.py --input ${DATA_DIR}/train.jsonl --output ${DATA_DIR}/train-full.jsonl # Split: reserve last 1000 rows for validation head -n -1000 ${DATA_DIR}/train-full.jsonl > ${DATA_DIR}/train-split.jsonl tail -n 1000 ${DATA_DIR}/train-full.jsonl > ${DATA_DIR}/val-split.jsonl # Verify split wc -l ${DATA_DIR}/train-split.jsonl ${DATA_DIR}/val-split.jsonl ``` Submit the job: ```bash # If dataset requires authentication, export HF_TOKEN: # sbatch --export=HF_TOKEN prepare_data.sh # Otherwise: sbatch prepare_data.sh ``` **Why use a compute node?** The `create_nanov3_jsonl.py` script is memory-intensive and may fail on head nodes which have resource limits. Running on a compute node ensures sufficient memory. **✅ Success Check**: Job completes and creates `train-split.jsonl` and `val-split.jsonl`. *** ### 1.6 Download Model Download the Nemotron 3 Nano 30B model: ```bash cd $WORKSPACE/RL-nano-v3 uvx --from huggingface-hub hf download nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \ --repo-type model \ --local-dir model ``` **✅ Success Check**: Model files downloaded (\~59GB total) to `model/` directory. *** ### 1.7 Verify Setup Confirm all components are in place: ```bash cd $WORKSPACE/RL-nano-v3 # Check directory structure ls -lh # Expected: data/, model/, examples/, nemo_rl/, etc. # Check data files ls -lh data/train-split.jsonl data/val-split.jsonl # Check model size and key files du -sh model/ # Expected: ~59GB # Verify essential model files exist ls model/config.json model/*.safetensors # Should show config.json and 13 safetensors files ``` **✅ Success Check**: All directories and files present with correct sizes. *** ## 2. Create Launch Script Create a launcher script that properly handles multi-node Ray coordination: ```bash cd $WORKSPACE/RL-nano-v3 ``` Create `launch_nemotron_training.sh`: ```bash #!/bin/bash # Nemotron 3 Nano 30B Multi-Node Training Launcher # Configuration HOST_BASE=$WORKSPACE # Or your preferred base directory NUM_NODES=2 # Change to 32 for large-scale training # Paths DATA_DIR=${HOST_BASE}/RL-nano-v3/data MODEL_CHECKPOINT=${HOST_BASE}/RL-nano-v3/model CONFIG_PATH=${HOST_BASE}/RL-nano-v3/examples/nemo_gym/grpo_nanov3.yaml LOG_DIR=${HOST_BASE}/RL-nano-v3/logs CKPT_DIR=${HOST_BASE}/RL-nano-v3/checkpoints CACHE_DIR=${HOST_BASE}/RL-nano-v3/.cache # Training command with shared cache directory TRAINING_CMD="cd ${HOST_BASE}/RL-nano-v3 && \ mkdir -p ${LOG_DIR} ${CKPT_DIR} ${CACHE_DIR} && \ export HF_HOME=${CACHE_DIR}/huggingface && \ export TRANSFORMERS_CACHE=${CACHE_DIR}/huggingface && \ uv run examples/nemo_gym/run_grpo_nemo_gym.py \ --config ${CONFIG_PATH} \ policy.model_name=${MODEL_CHECKPOINT} \ data.train_jsonl_fpath=${DATA_DIR}/train-split.jsonl \ data.validation_jsonl_fpath=${DATA_DIR}/val-split.jsonl \ ++logger.log_dir=${LOG_DIR} \ logger.wandb_enabled=False \ logger.tensorboard_enabled=True \ ++checkpointing.enabled=True \ ++checkpointing.checkpoint_dir=${CKPT_DIR} \ cluster.num_nodes=${NUM_NODES} \ cluster.gpus_per_node=8" echo "Submitting ${NUM_NODES}-node training job..." echo "Using BASE_LOG_DIR: ${HOST_BASE}/nemoRL/nemo-rl" # Submit job BASE_LOG_DIR=${HOST_BASE}/nemoRL/nemo-rl \ COMMAND="$TRAINING_CMD" \ CONTAINER="docker://nvcr.io/nvidia/nemo-rl:v0.4.0.nemotron_3_nano" \ MOUNTS="${HOST_BASE}:${HOST_BASE}" \ sbatch \ --nodes=${NUM_NODES} \ --account= \ --job-name=nemotron-nano-30b \ --time=8:00:00 \ --gres=gpu:8 \ --chdir=/tmp \ --output=${LOG_DIR}/slurm-%j.out \ --error=${LOG_DIR}/slurm-%j.err \ ${HOST_BASE}/nemoRL/nemo-rl/ray.sub ``` Make it executable: ```bash chmod +x launch_nemotron_training.sh ``` **Key Configuration Points:** * `NUM_NODES=2`: 2 nodes × 8 GPUs = **16 GPUs total**. For large-scale training, change to `NUM_NODES=32` (256 GPUs total) * `--chdir=/tmp`: Sets a neutral working directory for the job * `HF_HOME` and `TRANSFORMERS_CACHE`: Set to shared storage so all nodes can access model conversions * `BASE_LOG_DIR`: Specifies where Ray cluster logs will be written * `--account`: Replace `` with your Slurm account name * `--time=8:00:00`: Adjust based on your cluster's limits **✅ Success Check**: Script created and executable. *** ## 3. Submit Training Job **Run the launch script from a neutral directory like `/tmp`** to ensure consistent container working directory behavior across different cluster configurations. ```bash # Run from /tmp for best compatibility cd /tmp bash $WORKSPACE/RL-nano-v3/launch_nemotron_training.sh ``` **Expected output:** ``` Submitting 2-node training job... Using BASE_LOG_DIR: /home/user/nemoRL/nemo-rl Submitted batch job 9453356 ``` **✅ Success Check**: Job submitted successfully with job ID returned. *** ## 4. Monitor Job Status Monitor your submitted job: ```bash # Check if job is running (replace JOBID with your job number) squeue --job=JOBID # Detailed status squeue --job=JOBID -o "%.18i %.9P %.30j %.8u %.2t %.10M %.6D %R" ``` **Note:** For job state codes (PD, R, CD, etc.), see [Slurm documentation](https://slurm.schedmd.com/squeue.html#SECTION_JOB-STATE-CODES). **✅ Success Check**: Job transitions from `PD` to `R` state. *** ## 5. Monitor Training Progress ### 5.1 Check Ray Cluster Logs Wait 1-2 minutes for Ray cluster to initialize, then check logs: ```bash # Set your job ID JOBID=your_job_id # Check if Ray head started ls $WORKSPACE/nemoRL/nemo-rl/${JOBID}-logs/ # Verify Ray head is ready cat $WORKSPACE/nemoRL/nemo-rl/${JOBID}-logs/STARTED_RAY_HEAD # File should exist if Ray initialized successfully # View training execution (most important) tail -100 $WORKSPACE/nemoRL/nemo-rl/${JOBID}-logs/ray-driver.log # Follow training progress live tail -f $WORKSPACE/nemoRL/nemo-rl/${JOBID}-logs/ray-driver.log ``` *** ### 5.2 Verify Ray Cluster Formation Check that all Ray actors are online: ```bash # For 2-node job (2 nodes × 8 GPUs = 16 actors) grep "Number of actors online: 16/16" $WORKSPACE/RL-nano-v3/logs/slurm-${JOBID}.out # For 32-node job, look for: 256/256 ``` **✅ Success Check**: All actors online (16/16 for 2 nodes, or 256/256 for 32 nodes). *** ### 5.3 Watch Training Metrics Monitor rollout collection progress: ```bash # Watch rollout collection tail -f $WORKSPACE/nemoRL/nemo-rl/${JOBID}-logs/ray-driver.log | grep "Collecting rollouts" # Example output: # Collecting rollouts: 21%|██ | 428/2048 [02:01<05:42, 4.73it/s] # Collecting rollouts: 25%|██▌ | 512/2048 [03:15<08:12, 3.12it/s] ``` Check TensorBoard logs: ```bash # List experiment directories ls -ltr $WORKSPACE/RL-nano-v3/logs/ # Check TensorBoard events find $WORKSPACE/RL-nano-v3/logs/exp_*/tensorboard/ -name "*.tfevents.*" -mmin -5 ``` **✅ Success Check**: Rollout percentage increasing steadily, TensorBoard events being written. *** ## 6. Troubleshooting ### Issue: Job Stays in Pending (PD) State **Check the reason:** ```bash squeue --job=JOBID -o "%.18i %.9P %.30j %.8u %.2t %.10M %.6D %R" ``` **Common reasons:** * `(Priority)`: Waiting in queue for resources * `(Resources)`: Not enough nodes available * `(QOSMaxNodePerUserLimit)`: Exceeds node limit **Solution:** Wait for resources, or adjust job parameters. *** ### Issue: Ray Head Doesn't Start **Symptom:** No `STARTED_RAY_HEAD` file in logs directory. **Check Ray head log:** ```bash cat $WORKSPACE/nemoRL/nemo-rl/JOBID-logs/ray-head.log ``` **Solution:** Check logs for errors related to container startup or resource allocation. *** ### Issue: Training Crashes with Cache Errors **Symptom:** `FileNotFoundError` mentioning `run_config.yaml` in ray-driver.log. **Check logs:** ```bash grep "FileNotFoundError.*run_config.yaml" $WORKSPACE/nemoRL/nemo-rl/JOBID-logs/ray-driver.log ``` **Root cause:** Model conversion saved to local node cache, inaccessible to other nodes. **Solution:** Verify shared cache directories are set in `TRAINING_CMD`: ```bash export HF_HOME=${CACHE_DIR}/huggingface export TRANSFORMERS_CACHE=${CACHE_DIR}/huggingface ``` *** ## 7. Key Technical Details ### Why Ray.sub? Without `ray.sub`, each node would start its own independent Ray cluster. The `ray.sub` script from NeMo RL: 1. Starts a Ray head on the first node 2. Connects all worker nodes to that head 3. Creates a unified distributed cluster 4. Manages placement groups for GPU actors ### Why Shared Cache? HuggingFace Transformers converts models to Megatron format on first use: * Without shared cache: Each node converts independently → race conditions * With shared cache: Rank 0 converts once, all nodes share the result *** ## 8. File Structure Reference After setup, your directory structure should look like: ``` $HOME/ ├── RL-nano-v3/ # Project root │ ├── data/ │ │ ├── train-split.jsonl # Training data │ │ └── val-split.jsonl # Validation data │ ├── model/ # Nemotron 3 Nano 30B model (~63GB) │ ├── examples/nemo_gym/ │ │ ├── grpo_nanov3.yaml # Training config │ │ └── run_grpo_nemo_gym.py # Training script │ ├── logs/ # Training outputs │ │ ├── exp_*/ # Experiment directories │ │ ├── slurm-*.out # Slurm stdout │ │ └── slurm-*.err # Slurm stderr │ ├── checkpoints/ # Saved model checkpoints │ ├── .cache/ # Shared HuggingFace cache │ └── launch_nemotron_training.sh # Launch script └── nemoRL/nemo-rl/ ├── ray.sub # Ray orchestration script └── JOBID-logs/ # Ray cluster logs ├── STARTED_RAY_HEAD # Ray ready sentinel ├── ray-head.log # Head node log ├── ray-worker-*.log # Worker node logs └── ray-driver.log # Training execution log ``` *** ## Next Steps Congratulations! You've successfully set up and launched Nemotron 3 Nano 30B multi-node training using Ray and Slurm. Browse available environments on GitHub to find other training options. Create your own resources server with custom tools and verification logic.