Nemotron 3 Nano 30B#
This tutorial walks through the complete setup for distributed training of Nemotron 3 Nano 30B across multiple nodes using Slurm and Ray.
Goal: Train Nemotron 3 Nano 30B on 2 nodes using GRPO with proper multi-node Ray cluster coordination.
In this section, you will:
Set up the Nemotron 3 Nano 30B training environment
Download and prepare the training dataset
Configure the launch script for multi-node coordination
Submit and monitor the multi-node training job
Prerequisites#
Before starting, complete the NeMo RL GRPO tutorial to understand the NeMo RL training workflow and GRPO fundamentals.
You’ll also need:
✅ Access to Slurm cluster with enroot/pyxis container support
✅ Access to NeMo RL container:
nvcr.io/nvidia/nemo-rl:v0.4.0.nemotron_3_nano✅ Understanding of Ray distributed computing framework
✅ Sufficient storage space (~110GB for model, data, and cache; checkpoints and logs accumulate with each run)
1. Initial Setup#
1.1 Set Workspace Directory#
Choose a location with sufficient space (~110GB minimum):
# Set workspace directory (adjust to your cluster's large storage)
# Examples: /scratch/$USER, /work/$USER, /data/$USER, /lustre/.../users/$USER
WORKSPACE=/path/to/large/storage/$USER
# Verify space available
df -h $WORKSPACE
✅ Success Check: Directory has at least 200GB available space.
1.2 Clone the Repository#
Clone the Nemotron 3 Nano v3 branch of NeMo RL:
cd $WORKSPACE
git clone --recurse-submodules -b nano-v3 https://github.com/NVIDIA-NeMo/RL.git RL-nano-v3
cd RL-nano-v3
✅ Success Check: Repository cloned with nano-v3 branch checked out.
1.3 Prepare Container Image#
Option A: Use Registry Path Directly (Recommended for First Run)
Use the container directly from NVIDIA Container Registry:
# No preparation needed - will be pulled automatically during job execution
CONTAINER=docker://nvcr.io/nvidia/nemo-rl:v0.4.0.nemotron_3_nano
This is the simplest approach but adds ~5-10 minutes to job startup time for first use.
Option B: Pre-Pull Container (Optional - For Faster Job Startup)
For faster job startup on subsequent runs, pre-pull and convert to .sqsh format:
Step 1: Get NGC API Key
Go to https://org.ngc.nvidia.com/setup/api-keys
Generate an API key
Configure enroot credentials:
mkdir -p ~/.config/enroot
echo "machine nvcr.io login \$oauthtoken password <YOUR_API_KEY>" >> ~/.config/enroot/.credentials
Step 2: Pull Container Using Sbatch
Due to head node restrictions, pull the container from a compute node:
Create pull_container.sh:
#!/bin/bash
#SBATCH --job-name=enroot-import
#SBATCH --account=<your_account>
#SBATCH --partition=<partition_name>
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=01:00:00
#SBATCH --output=enroot-import-%j.out
# Set workspace directory (adjust to your cluster's large storage)
WORKSPACE=/path/to/large/storage/$USER
ENROOT_CACHE_PATH=$WORKSPACE/.cache/enroot
enroot import -o "$WORKSPACE/nemo-rl.v0.4.0.nemotron_3_nano.sqsh" \
"docker://nvcr.io/nvidia/nemo-rl:v0.4.0.nemotron_3_nano"
Submit the job:
sbatch pull_container.sh
Step 3: Use Local Container
Update your launch script to use the local .sqsh file:
CONTAINER=$WORKSPACE/nemo-rl.v0.4.0.nemotron_3_nano.sqsh
✅ Success Check: Container file exists (~15GB) or registry path configured.
1.4 Install uv Tool#
Install uv (which includes uvx) for downloading HuggingFace models and datasets:
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Add to PATH (uv installs to ~/.local/bin)
export PATH="$HOME/.local/bin:$PATH"
# Verify installation
uvx --version
✅ Success Check: Command shows uv version number.
1.5 Download and Process Training Data#
Download and process the dataset on a compute node (head nodes have limited memory):
Create prepare_data.sh:
#!/bin/bash
#SBATCH --job-name=prepare-data
#SBATCH --account=<your_account>
#SBATCH --partition=<partition_name>
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:20:00
#SBATCH --output=prepare-data-%j.out
# Set workspace directory (adjust to your cluster's large storage)
WORKSPACE=/path/to/large/storage/$USER
# Data directory
DATA_DIR=${WORKSPACE}/RL-nano-v3/data/
# Download dataset
uvx --from huggingface-hub hf download nvidia/Nemotron-3-Nano-RL-Training-Blend \
--repo-type dataset \
--local-dir ${DATA_DIR}
# Fill in placeholders
chmod +x ${DATA_DIR}/create_nanov3_jsonl.py
${DATA_DIR}/create_nanov3_jsonl.py --input ${DATA_DIR}/train.jsonl --output ${DATA_DIR}/train-full.jsonl
# Split: reserve last 1000 rows for validation
head -n -1000 ${DATA_DIR}/train-full.jsonl > ${DATA_DIR}/train-split.jsonl
tail -n 1000 ${DATA_DIR}/train-full.jsonl > ${DATA_DIR}/val-split.jsonl
# Verify split
wc -l ${DATA_DIR}/train-split.jsonl ${DATA_DIR}/val-split.jsonl
Submit the job:
# If dataset requires authentication, export HF_TOKEN:
# sbatch --export=HF_TOKEN prepare_data.sh
# Otherwise:
sbatch prepare_data.sh
Note
Why use a compute node? The create_nanov3_jsonl.py script is memory-intensive and may fail on head nodes which have resource limits. Running on a compute node ensures sufficient memory.
✅ Success Check: Job completes and creates train-split.jsonl and val-split.jsonl.
1.6 Download Model#
Download the Nemotron 3 Nano 30B model:
cd $WORKSPACE/RL-nano-v3
uvx --from huggingface-hub hf download nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
--repo-type model \
--local-dir model
✅ Success Check: Model files downloaded (~59GB total) to model/ directory.
1.7 Verify Setup#
Confirm all components are in place:
cd $WORKSPACE/RL-nano-v3
# Check directory structure
ls -lh
# Expected: data/, model/, examples/, nemo_rl/, etc.
# Check data files
ls -lh data/train-split.jsonl data/val-split.jsonl
# Check model size and key files
du -sh model/
# Expected: ~59GB
# Verify essential model files exist
ls model/config.json model/*.safetensors
# Should show config.json and 13 safetensors files
✅ Success Check: All directories and files present with correct sizes.
2. Create Launch Script#
Create a launcher script that properly handles multi-node Ray coordination:
cd $WORKSPACE/RL-nano-v3
Create launch_nemotron_training.sh:
#!/bin/bash
# Nemotron 3 Nano 30B Multi-Node Training Launcher
# Configuration
HOST_BASE=$WORKSPACE # Or your preferred base directory
NUM_NODES=2 # Change to 32 for large-scale training
# Paths
DATA_DIR=${HOST_BASE}/RL-nano-v3/data
MODEL_CHECKPOINT=${HOST_BASE}/RL-nano-v3/model
CONFIG_PATH=${HOST_BASE}/RL-nano-v3/examples/nemo_gym/grpo_nanov3.yaml
LOG_DIR=${HOST_BASE}/RL-nano-v3/logs
CKPT_DIR=${HOST_BASE}/RL-nano-v3/checkpoints
CACHE_DIR=${HOST_BASE}/RL-nano-v3/.cache
# Training command with shared cache directory
TRAINING_CMD="cd ${HOST_BASE}/RL-nano-v3 && \
mkdir -p ${LOG_DIR} ${CKPT_DIR} ${CACHE_DIR} && \
export HF_HOME=${CACHE_DIR}/huggingface && \
export TRANSFORMERS_CACHE=${CACHE_DIR}/huggingface && \
uv run examples/nemo_gym/run_grpo_nemo_gym.py \
--config ${CONFIG_PATH} \
policy.model_name=${MODEL_CHECKPOINT} \
data.train_jsonl_fpath=${DATA_DIR}/train-split.jsonl \
data.validation_jsonl_fpath=${DATA_DIR}/val-split.jsonl \
++logger.log_dir=${LOG_DIR} \
logger.wandb_enabled=False \
logger.tensorboard_enabled=True \
++checkpointing.enabled=True \
++checkpointing.checkpoint_dir=${CKPT_DIR} \
cluster.num_nodes=${NUM_NODES} \
cluster.gpus_per_node=8"
echo "Submitting ${NUM_NODES}-node training job..."
echo "Using BASE_LOG_DIR: ${HOST_BASE}/nemoRL/nemo-rl"
# Submit job
BASE_LOG_DIR=${HOST_BASE}/nemoRL/nemo-rl \
COMMAND="$TRAINING_CMD" \
CONTAINER="docker://nvcr.io/nvidia/nemo-rl:v0.4.0.nemotron_3_nano" \
MOUNTS="${HOST_BASE}:${HOST_BASE}" \
sbatch \
--nodes=${NUM_NODES} \
--account=<your_account> \
--job-name=nemotron-nano-30b \
--time=8:00:00 \
--gres=gpu:8 \
--chdir=/tmp \
--output=${LOG_DIR}/slurm-%j.out \
--error=${LOG_DIR}/slurm-%j.err \
${HOST_BASE}/nemoRL/nemo-rl/ray.sub
Make it executable:
chmod +x launch_nemotron_training.sh
Tip
Key Configuration Points:
NUM_NODES=2: 2 nodes × 8 GPUs = 16 GPUs total. For large-scale training, change toNUM_NODES=32(256 GPUs total)--chdir=/tmp: Sets a neutral working directory for the jobHF_HOMEandTRANSFORMERS_CACHE: Set to shared storage so all nodes can access model conversionsBASE_LOG_DIR: Specifies where Ray cluster logs will be written--account: Replace<your_account>with your Slurm account name--time=8:00:00: Adjust based on your cluster’s limits
✅ Success Check: Script created and executable.
3. Submit Training Job#
Important
Run the launch script from a neutral directory like /tmp to ensure consistent container working directory behavior across different cluster configurations.
# Run from /tmp for best compatibility
cd /tmp
bash $WORKSPACE/RL-nano-v3/launch_nemotron_training.sh
Expected output:
Submitting 2-node training job...
Using BASE_LOG_DIR: /home/user/nemoRL/nemo-rl
Submitted batch job 9453356
✅ Success Check: Job submitted successfully with job ID returned.
4. Monitor Job Status#
Monitor your submitted job:
# Check if job is running (replace JOBID with your job number)
squeue --job=JOBID
# Detailed status
squeue --job=JOBID -o "%.18i %.9P %.30j %.8u %.2t %.10M %.6D %R"
Note: For job state codes (PD, R, CD, etc.), see Slurm documentation.
✅ Success Check: Job transitions from PD to R state.
5. Monitor Training Progress#
5.1 Check Ray Cluster Logs#
Wait 1-2 minutes for Ray cluster to initialize, then check logs:
# Set your job ID
JOBID=your_job_id
# Check if Ray head started
ls $WORKSPACE/nemoRL/nemo-rl/${JOBID}-logs/
# Verify Ray head is ready
cat $WORKSPACE/nemoRL/nemo-rl/${JOBID}-logs/STARTED_RAY_HEAD
# File should exist if Ray initialized successfully
# View training execution (most important)
tail -100 $WORKSPACE/nemoRL/nemo-rl/${JOBID}-logs/ray-driver.log
# Follow training progress live
tail -f $WORKSPACE/nemoRL/nemo-rl/${JOBID}-logs/ray-driver.log
5.2 Verify Ray Cluster Formation#
Check that all Ray actors are online:
# For 2-node job (2 nodes × 8 GPUs = 16 actors)
grep "Number of actors online: 16/16" $WORKSPACE/RL-nano-v3/logs/slurm-${JOBID}.out
# For 32-node job, look for: 256/256
✅ Success Check: All actors online (16/16 for 2 nodes, or 256/256 for 32 nodes).
5.3 Watch Training Metrics#
Monitor rollout collection progress:
# Watch rollout collection
tail -f $WORKSPACE/nemoRL/nemo-rl/${JOBID}-logs/ray-driver.log | grep "Collecting rollouts"
# Example output:
# Collecting rollouts: 21%|██ | 428/2048 [02:01<05:42, 4.73it/s]
# Collecting rollouts: 25%|██▌ | 512/2048 [03:15<08:12, 3.12it/s]
Check TensorBoard logs:
# List experiment directories
ls -ltr $WORKSPACE/RL-nano-v3/logs/
# Check TensorBoard events
find $WORKSPACE/RL-nano-v3/logs/exp_*/tensorboard/ -name "*.tfevents.*" -mmin -5
✅ Success Check: Rollout percentage increasing steadily, TensorBoard events being written.
6. Troubleshooting#
Issue: Job Stays in Pending (PD) State#
Check the reason:
squeue --job=JOBID -o "%.18i %.9P %.30j %.8u %.2t %.10M %.6D %R"
Common reasons:
(Priority): Waiting in queue for resources(Resources): Not enough nodes available(QOSMaxNodePerUserLimit): Exceeds node limit
Solution: Wait for resources, or adjust job parameters.
Issue: Ray Head Doesn’t Start#
Symptom: No STARTED_RAY_HEAD file in logs directory.
Check Ray head log:
cat $WORKSPACE/nemoRL/nemo-rl/JOBID-logs/ray-head.log
Solution: Check logs for errors related to container startup or resource allocation.
Issue: Training Crashes with Cache Errors#
Symptom: FileNotFoundError mentioning run_config.yaml in ray-driver.log.
Check logs:
grep "FileNotFoundError.*run_config.yaml" $WORKSPACE/nemoRL/nemo-rl/JOBID-logs/ray-driver.log
Root cause: Model conversion saved to local node cache, inaccessible to other nodes.
Solution: Verify shared cache directories are set in TRAINING_CMD:
export HF_HOME=${CACHE_DIR}/huggingface
export TRANSFORMERS_CACHE=${CACHE_DIR}/huggingface
7. Key Technical Details#
Why Ray.sub?#
Without ray.sub, each node would start its own independent Ray cluster. The ray.sub script from NeMo RL:
Starts a Ray head on the first node
Connects all worker nodes to that head
Creates a unified distributed cluster
Manages placement groups for GPU actors
8. File Structure Reference#
After setup, your directory structure should look like:
$HOME/
├── RL-nano-v3/ # Project root
│ ├── data/
│ │ ├── train-split.jsonl # Training data
│ │ └── val-split.jsonl # Validation data
│ ├── model/ # Nemotron 3 Nano 30B model (~63GB)
│ ├── examples/nemo_gym/
│ │ ├── grpo_nanov3.yaml # Training config
│ │ └── run_grpo_nemo_gym.py # Training script
│ ├── logs/ # Training outputs
│ │ ├── exp_*/ # Experiment directories
│ │ ├── slurm-*.out # Slurm stdout
│ │ └── slurm-*.err # Slurm stderr
│ ├── checkpoints/ # Saved model checkpoints
│ ├── .cache/ # Shared HuggingFace cache
│ └── launch_nemotron_training.sh # Launch script
└── nemoRL/nemo-rl/
├── ray.sub # Ray orchestration script
└── JOBID-logs/ # Ray cluster logs
├── STARTED_RAY_HEAD # Ray ready sentinel
├── ray-head.log # Head node log
├── ray-worker-*.log # Worker node logs
└── ray-driver.log # Training execution log
Next Steps#
Congratulations! You’ve successfully set up and launched Nemotron 3 Nano 30B multi-node training using Ray and Slurm.
Browse available environments on GitHub to find other training options.
Create your own resources server with custom tools and verification logic.