Nemotron 3 Nano

View as Markdown

This tutorial walks through the complete setup for distributed training of Nemotron 3 Nano 30B across multiple nodes using Slurm and Ray.

Goal: Train Nemotron 3 Nano 30B on 2 nodes using GRPO with proper multi-node Ray cluster coordination.

In this section, you will:

  1. Set up the Nemotron 3 Nano 30B training environment
  2. Download and prepare the training dataset
  3. Configure the launch script for multi-node coordination
  4. Submit and monitor the multi-node training job

Prerequisites

Before starting, complete the NeMo RL GRPO tutorial to understand the NeMo RL training workflow and GRPO fundamentals.

You’ll also need:

  • ✅ Access to Slurm cluster with enroot/pyxis container support
  • ✅ Access to NeMo RL container: nvcr.io/nvidia/nemo-rl:v0.4.0.nemotron_3_nano
  • ✅ Understanding of Ray distributed computing framework
  • ✅ Sufficient storage space (~110GB for model, data, and cache; checkpoints and logs accumulate with each run)

1. Initial Setup

1.1 Set Workspace Directory

Choose a location with sufficient space (~110GB minimum):

$# Set workspace directory (adjust to your cluster's large storage)
># Examples: /scratch/$USER, /work/$USER, /data/$USER, /lustre/.../users/$USER
>WORKSPACE=/path/to/large/storage/$USER
>
># Verify space available
>df -h $WORKSPACE

✅ Success Check: Directory has at least 200GB available space.


1.2 Clone the Repository

Clone the Nemotron 3 Nano v3 branch of NeMo RL:

$cd $WORKSPACE
$git clone --recurse-submodules -b nano-v3 https://github.com/NVIDIA-NeMo/RL.git RL-nano-v3
$cd RL-nano-v3

✅ Success Check: Repository cloned with nano-v3 branch checked out.


1.3 Prepare Container Image

Option A: Use Registry Path Directly (Recommended for First Run)

Use the container directly from NVIDIA Container Registry:

$# No preparation needed - will be pulled automatically during job execution
$CONTAINER=docker://nvcr.io/nvidia/nemo-rl:v0.4.0.nemotron_3_nano

This is the simplest approach but adds ~5-10 minutes to job startup time for first use.


Option B: Pre-Pull Container (Optional - For Faster Job Startup)

For faster job startup on subsequent runs, pre-pull and convert to .sqsh format:

Step 1: Get NGC API Key

  1. Go to https://org.ngc.nvidia.com/setup/api-keys
  2. Generate an API key
  3. Configure enroot credentials:
$mkdir -p ~/.config/enroot
$echo "machine nvcr.io login \$oauthtoken password <YOUR_API_KEY>" >> ~/.config/enroot/.credentials

Step 2: Pull Container Using Sbatch

Due to head node restrictions, pull the container from a compute node:

Create pull_container.sh:

$#!/bin/bash
$#SBATCH --job-name=enroot-import
$#SBATCH --account=<your_account>
$#SBATCH --partition=<partition_name>
$#SBATCH --nodes=1
$#SBATCH --ntasks=1
$#SBATCH --time=01:00:00
$#SBATCH --output=enroot-import-%j.out
$
$# Set workspace directory (adjust to your cluster's large storage)
>WORKSPACE=/path/to/large/storage/$USER
>
>ENROOT_CACHE_PATH=$WORKSPACE/.cache/enroot
>
>enroot import -o "$WORKSPACE/nemo-rl.v0.4.0.nemotron_3_nano.sqsh" \
> "docker://nvcr.io/nvidia/nemo-rl:v0.4.0.nemotron_3_nano"

Submit the job:

$sbatch pull_container.sh

Step 3: Use Local Container

Update your launch script to use the local .sqsh file:

$CONTAINER=$WORKSPACE/nemo-rl.v0.4.0.nemotron_3_nano.sqsh

✅ Success Check: Container file exists (~15GB) or registry path configured.

1.4 Install uv Tool

Install uv (which includes uvx) for downloading HuggingFace models and datasets:

$# Install uv
$curl -LsSf https://astral.sh/uv/install.sh | sh
$
$# Add to PATH (uv installs to ~/.local/bin)
$export PATH="$HOME/.local/bin:$PATH"
$
$# Verify installation
$uvx --version

✅ Success Check: Command shows uv version number.


1.5 Download and Process Training Data

Download and process the dataset on a compute node (head nodes have limited memory):

Create prepare_data.sh:

$#!/bin/bash
$#SBATCH --job-name=prepare-data
$#SBATCH --account=<your_account>
$#SBATCH --partition=<partition_name>
$#SBATCH --nodes=1
$#SBATCH --ntasks=1
$#SBATCH --time=00:20:00
$#SBATCH --output=prepare-data-%j.out
$
$# Set workspace directory (adjust to your cluster's large storage)
>WORKSPACE=/path/to/large/storage/$USER
>
># Data directory
>DATA_DIR=${WORKSPACE}/RL-nano-v3/data/
>
># Download dataset
>uvx --from huggingface-hub hf download nvidia/Nemotron-3-Nano-RL-Training-Blend \
> --repo-type dataset \
> --local-dir ${DATA_DIR}
>
># Fill in placeholders
>chmod +x ${DATA_DIR}/create_nanov3_jsonl.py
>${DATA_DIR}/create_nanov3_jsonl.py --input ${DATA_DIR}/train.jsonl --output ${DATA_DIR}/train-full.jsonl
>
># Split: reserve last 1000 rows for validation
>head -n -1000 ${DATA_DIR}/train-full.jsonl > ${DATA_DIR}/train-split.jsonl
>tail -n 1000 ${DATA_DIR}/train-full.jsonl > ${DATA_DIR}/val-split.jsonl
>
># Verify split
>wc -l ${DATA_DIR}/train-split.jsonl ${DATA_DIR}/val-split.jsonl

Submit the job:

$# If dataset requires authentication, export HF_TOKEN:
$# sbatch --export=HF_TOKEN prepare_data.sh
$
$# Otherwise:
$sbatch prepare_data.sh

Why use a compute node? The create_nanov3_jsonl.py script is memory-intensive and may fail on head nodes which have resource limits. Running on a compute node ensures sufficient memory.

✅ Success Check: Job completes and creates train-split.jsonl and val-split.jsonl.


1.6 Download Model

Download the Nemotron 3 Nano 30B model:

$cd $WORKSPACE/RL-nano-v3
$
$uvx --from huggingface-hub hf download nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
> --repo-type model \
> --local-dir model

✅ Success Check: Model files downloaded (~59GB total) to model/ directory.


1.7 Verify Setup

Confirm all components are in place:

$cd $WORKSPACE/RL-nano-v3
$
$# Check directory structure
$ls -lh
$# Expected: data/, model/, examples/, nemo_rl/, etc.
$
$# Check data files
$ls -lh data/train-split.jsonl data/val-split.jsonl
$
$# Check model size and key files
$du -sh model/
$# Expected: ~59GB
$
$# Verify essential model files exist
$ls model/config.json model/*.safetensors
$# Should show config.json and 13 safetensors files

✅ Success Check: All directories and files present with correct sizes.


2. Create Launch Script

Create a launcher script that properly handles multi-node Ray coordination:

$cd $WORKSPACE/RL-nano-v3

Create launch_nemotron_training.sh:

$#!/bin/bash
$# Nemotron 3 Nano 30B Multi-Node Training Launcher
$
$# Configuration
$HOST_BASE=$WORKSPACE # Or your preferred base directory
$NUM_NODES=2 # Change to 32 for large-scale training
$
$# Paths
$DATA_DIR=${HOST_BASE}/RL-nano-v3/data
$MODEL_CHECKPOINT=${HOST_BASE}/RL-nano-v3/model
$CONFIG_PATH=${HOST_BASE}/RL-nano-v3/examples/nemo_gym/grpo_nanov3.yaml
$LOG_DIR=${HOST_BASE}/RL-nano-v3/logs
$CKPT_DIR=${HOST_BASE}/RL-nano-v3/checkpoints
$CACHE_DIR=${HOST_BASE}/RL-nano-v3/.cache
$
$# Training command with shared cache directory
$TRAINING_CMD="cd ${HOST_BASE}/RL-nano-v3 && \
>mkdir -p ${LOG_DIR} ${CKPT_DIR} ${CACHE_DIR} && \
>export HF_HOME=${CACHE_DIR}/huggingface && \
>export TRANSFORMERS_CACHE=${CACHE_DIR}/huggingface && \
>uv run examples/nemo_gym/run_grpo_nemo_gym.py \
> --config ${CONFIG_PATH} \
> policy.model_name=${MODEL_CHECKPOINT} \
> data.train_jsonl_fpath=${DATA_DIR}/train-split.jsonl \
> data.validation_jsonl_fpath=${DATA_DIR}/val-split.jsonl \
> ++logger.log_dir=${LOG_DIR} \
> logger.wandb_enabled=False \
> logger.tensorboard_enabled=True \
> ++checkpointing.enabled=True \
> ++checkpointing.checkpoint_dir=${CKPT_DIR} \
> cluster.num_nodes=${NUM_NODES} \
> cluster.gpus_per_node=8"
$
$echo "Submitting ${NUM_NODES}-node training job..."
$echo "Using BASE_LOG_DIR: ${HOST_BASE}/nemoRL/nemo-rl"
$
$# Submit job
$BASE_LOG_DIR=${HOST_BASE}/nemoRL/nemo-rl \
>COMMAND="$TRAINING_CMD" \
>CONTAINER="docker://nvcr.io/nvidia/nemo-rl:v0.4.0.nemotron_3_nano" \
>MOUNTS="${HOST_BASE}:${HOST_BASE}" \
>sbatch \
> --nodes=${NUM_NODES} \
> --account=<your_account> \
> --job-name=nemotron-nano-30b \
> --time=8:00:00 \
> --gres=gpu:8 \
> --chdir=/tmp \
> --output=${LOG_DIR}/slurm-%j.out \
> --error=${LOG_DIR}/slurm-%j.err \
> ${HOST_BASE}/nemoRL/nemo-rl/ray.sub

Make it executable:

$chmod +x launch_nemotron_training.sh

Key Configuration Points:

  • NUM_NODES=2: 2 nodes × 8 GPUs = 16 GPUs total. For large-scale training, change to NUM_NODES=32 (256 GPUs total)
  • --chdir=/tmp: Sets a neutral working directory for the job
  • HF_HOME and TRANSFORMERS_CACHE: Set to shared storage so all nodes can access model conversions
  • BASE_LOG_DIR: Specifies where Ray cluster logs will be written
  • --account: Replace <your_account> with your Slurm account name
  • --time=8:00:00: Adjust based on your cluster’s limits

✅ Success Check: Script created and executable.


3. Submit Training Job

Run the launch script from a neutral directory like /tmp to ensure consistent container working directory behavior across different cluster configurations.

$# Run from /tmp for best compatibility
$cd /tmp
$bash $WORKSPACE/RL-nano-v3/launch_nemotron_training.sh

Expected output:

Submitting 2-node training job...
Using BASE_LOG_DIR: /home/user/nemoRL/nemo-rl
Submitted batch job 9453356

✅ Success Check: Job submitted successfully with job ID returned.


4. Monitor Job Status

Monitor your submitted job:

$# Check if job is running (replace JOBID with your job number)
$squeue --job=JOBID
$
$# Detailed status
$squeue --job=JOBID -o "%.18i %.9P %.30j %.8u %.2t %.10M %.6D %R"

Note: For job state codes (PD, R, CD, etc.), see Slurm documentation.

✅ Success Check: Job transitions from PD to R state.


5. Monitor Training Progress

5.1 Check Ray Cluster Logs

Wait 1-2 minutes for Ray cluster to initialize, then check logs:

$# Set your job ID
$JOBID=your_job_id
$
$# Check if Ray head started
$ls $WORKSPACE/nemoRL/nemo-rl/${JOBID}-logs/
$
$# Verify Ray head is ready
$cat $WORKSPACE/nemoRL/nemo-rl/${JOBID}-logs/STARTED_RAY_HEAD
$# File should exist if Ray initialized successfully
$
$# View training execution (most important)
$tail -100 $WORKSPACE/nemoRL/nemo-rl/${JOBID}-logs/ray-driver.log
$
$# Follow training progress live
$tail -f $WORKSPACE/nemoRL/nemo-rl/${JOBID}-logs/ray-driver.log

5.2 Verify Ray Cluster Formation

Check that all Ray actors are online:

$# For 2-node job (2 nodes × 8 GPUs = 16 actors)
$grep "Number of actors online: 16/16" $WORKSPACE/RL-nano-v3/logs/slurm-${JOBID}.out
$
$# For 32-node job, look for: 256/256

✅ Success Check: All actors online (16/16 for 2 nodes, or 256/256 for 32 nodes).


5.3 Watch Training Metrics

Monitor rollout collection progress:

$# Watch rollout collection
$tail -f $WORKSPACE/nemoRL/nemo-rl/${JOBID}-logs/ray-driver.log | grep "Collecting rollouts"
$
$# Example output:
$# Collecting rollouts: 21%|██ | 428/2048 [02:01<05:42, 4.73it/s]
$# Collecting rollouts: 25%|██▌ | 512/2048 [03:15<08:12, 3.12it/s]

Check TensorBoard logs:

$# List experiment directories
$ls -ltr $WORKSPACE/RL-nano-v3/logs/
$
$# Check TensorBoard events
$find $WORKSPACE/RL-nano-v3/logs/exp_*/tensorboard/ -name "*.tfevents.*" -mmin -5

✅ Success Check: Rollout percentage increasing steadily, TensorBoard events being written.


6. Troubleshooting

Issue: Job Stays in Pending (PD) State

Check the reason:

$squeue --job=JOBID -o "%.18i %.9P %.30j %.8u %.2t %.10M %.6D %R"

Common reasons:

  • (Priority): Waiting in queue for resources
  • (Resources): Not enough nodes available
  • (QOSMaxNodePerUserLimit): Exceeds node limit

Solution: Wait for resources, or adjust job parameters.


Issue: Ray Head Doesn’t Start

Symptom: No STARTED_RAY_HEAD file in logs directory.

Check Ray head log:

$cat $WORKSPACE/nemoRL/nemo-rl/JOBID-logs/ray-head.log

Solution: Check logs for errors related to container startup or resource allocation.


Issue: Training Crashes with Cache Errors

Symptom: FileNotFoundError mentioning run_config.yaml in ray-driver.log.

Check logs:

$grep "FileNotFoundError.*run_config.yaml" $WORKSPACE/nemoRL/nemo-rl/JOBID-logs/ray-driver.log

Root cause: Model conversion saved to local node cache, inaccessible to other nodes.

Solution: Verify shared cache directories are set in TRAINING_CMD:

$export HF_HOME=${CACHE_DIR}/huggingface
$export TRANSFORMERS_CACHE=${CACHE_DIR}/huggingface

7. Key Technical Details

Why Ray.sub?

Without ray.sub, each node would start its own independent Ray cluster. The ray.sub script from NeMo RL:

  1. Starts a Ray head on the first node
  2. Connects all worker nodes to that head
  3. Creates a unified distributed cluster
  4. Manages placement groups for GPU actors

Why Shared Cache?

HuggingFace Transformers converts models to Megatron format on first use:

  • Without shared cache: Each node converts independently → race conditions
  • With shared cache: Rank 0 converts once, all nodes share the result

8. File Structure Reference

After setup, your directory structure should look like:

$HOME/
├── RL-nano-v3/ # Project root
│ ├── data/
│ │ ├── train-split.jsonl # Training data
│ │ └── val-split.jsonl # Validation data
│ ├── model/ # Nemotron 3 Nano 30B model (~63GB)
│ ├── examples/nemo_gym/
│ │ ├── grpo_nanov3.yaml # Training config
│ │ └── run_grpo_nemo_gym.py # Training script
│ ├── logs/ # Training outputs
│ │ ├── exp_*/ # Experiment directories
│ │ ├── slurm-*.out # Slurm stdout
│ │ └── slurm-*.err # Slurm stderr
│ ├── checkpoints/ # Saved model checkpoints
│ ├── .cache/ # Shared HuggingFace cache
│ └── launch_nemotron_training.sh # Launch script
└── nemoRL/nemo-rl/
├── ray.sub # Ray orchestration script
└── JOBID-logs/ # Ray cluster logs
├── STARTED_RAY_HEAD # Ray ready sentinel
├── ray-head.log # Head node log
├── ray-worker-*.log # Worker node logs
└── ray-driver.log # Training execution log

Next Steps

Congratulations! You’ve successfully set up and launched Nemotron 3 Nano 30B multi-node training using Ray and Slurm.