Nemotron 3 Nano 30B#

This tutorial walks through the complete setup for distributed training of Nemotron 3 Nano 30B across multiple nodes using Slurm and Ray.

Goal: Train Nemotron 3 Nano 30B on 2 nodes using GRPO with proper multi-node Ray cluster coordination.

In this section, you will:

Set up the Nemotron 3 Nano 30B training environment
Download and prepare the training dataset
Configure the launch script for multi-node coordination
Submit and monitor the multi-node training job

Prerequisites#

Before starting, complete the NeMo RL GRPO tutorial to understand the NeMo RL training workflow and GRPO fundamentals.

You’ll also need:

✅ Access to Slurm cluster with enroot/pyxis container support
✅ Access to NeMo RL container: nvcr.io/nvidia/nemo-rl:v0.4.0.nemotron_3_nano
✅ Understanding of Ray distributed computing framework
✅ Sufficient storage space (~110GB for model, data, and cache; checkpoints and logs accumulate with each run)

1. Initial Setup#

1.1 Set Workspace Directory#

Choose a location with sufficient space (~110GB minimum):

# Set workspace directory (adjust to your cluster's large storage)
# Examples: /scratch/$USER, /work/$USER, /data/$USER, /lustre/.../users/$USER
WORKSPACE=/path/to/large/storage/$USER

# Verify space available
df -h $WORKSPACE

✅ Success Check: Directory has at least 200GB available space.

1.2 Clone the Repository#

Clone the Nemotron 3 Nano v3 branch of NeMo RL:

cd $WORKSPACE
git clone --recurse-submodules -b nano-v3 https://github.com/NVIDIA-NeMo/RL.git RL-nano-v3
cd RL-nano-v3

✅ Success Check: Repository cloned with nano-v3 branch checked out.

1.3 Prepare Container Image#

Option A: Use Registry Path Directly (Recommended for First Run)

Use the container directly from NVIDIA Container Registry:

# No preparation needed - will be pulled automatically during job execution
CONTAINER=docker://nvcr.io/nvidia/nemo-rl:v0.4.0.nemotron_3_nano

This is the simplest approach but adds ~5-10 minutes to job startup time for first use.

Option B: Pre-Pull Container (Optional - For Faster Job Startup)

For faster job startup on subsequent runs, pre-pull and convert to .sqsh format:

Step 1: Get NGC API Key

Go to https://org.ngc.nvidia.com/setup/api-keys
Generate an API key
Configure enroot credentials:

mkdir -p ~/.config/enroot
echo "machine nvcr.io login \$oauthtoken password <YOUR_API_KEY>" >> ~/.config/enroot/.credentials

Step 2: Pull Container Using Sbatch

Due to head node restrictions, pull the container from a compute node:

Create pull_container.sh:

#!/bin/bash
#SBATCH --job-name=enroot-import
#SBATCH --account=<your_account>
#SBATCH --partition=<partition_name>
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=01:00:00
#SBATCH --output=enroot-import-%j.out

# Set workspace directory (adjust to your cluster's large storage)
WORKSPACE=/path/to/large/storage/$USER

ENROOT_CACHE_PATH=$WORKSPACE/.cache/enroot

enroot import -o "$WORKSPACE/nemo-rl.v0.4.0.nemotron_3_nano.sqsh" \
    "docker://nvcr.io/nvidia/nemo-rl:v0.4.0.nemotron_3_nano"

Submit the job:

sbatch pull_container.sh

Step 3: Use Local Container

Update your launch script to use the local .sqsh file:

CONTAINER=$WORKSPACE/nemo-rl.v0.4.0.nemotron_3_nano.sqsh

✅ Success Check: Container file exists (~15GB) or registry path configured.

1.4 Install uv Tool#

Install uv (which includes uvx) for downloading HuggingFace models and datasets:

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Add to PATH (uv installs to ~/.local/bin)
export PATH="$HOME/.local/bin:$PATH"

# Verify installation
uvx --version

✅ Success Check: Command shows uv version number.

1.5 Download and Process Training Data#

Download and process the dataset on a compute node (head nodes have limited memory):

Create prepare_data.sh:

#!/bin/bash
#SBATCH --job-name=prepare-data
#SBATCH --account=<your_account>
#SBATCH --partition=<partition_name>
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:20:00
#SBATCH --output=prepare-data-%j.out

# Set workspace directory (adjust to your cluster's large storage)
WORKSPACE=/path/to/large/storage/$USER

# Data directory
DATA_DIR=${WORKSPACE}/RL-nano-v3/data/

# Download dataset
uvx --from huggingface-hub hf download nvidia/Nemotron-3-Nano-RL-Training-Blend \
    --repo-type dataset \
    --local-dir ${DATA_DIR}

# Fill in placeholders
chmod +x ${DATA_DIR}/create_nanov3_jsonl.py
${DATA_DIR}/create_nanov3_jsonl.py --input ${DATA_DIR}/train.jsonl --output ${DATA_DIR}/train-full.jsonl

# Split: reserve last 1000 rows for validation
head -n -1000 ${DATA_DIR}/train-full.jsonl > ${DATA_DIR}/train-split.jsonl
tail -n 1000 ${DATA_DIR}/train-full.jsonl > ${DATA_DIR}/val-split.jsonl

# Verify split
wc -l ${DATA_DIR}/train-split.jsonl ${DATA_DIR}/val-split.jsonl

Submit the job:

# If dataset requires authentication, export HF_TOKEN:
# sbatch --export=HF_TOKEN prepare_data.sh

# Otherwise:
sbatch prepare_data.sh

Note

Why use a compute node? The create_nanov3_jsonl.py script is memory-intensive and may fail on head nodes which have resource limits. Running on a compute node ensures sufficient memory.

✅ Success Check: Job completes and creates train-split.jsonl and val-split.jsonl.

1.6 Download Model#

Download the Nemotron 3 Nano 30B model:

cd $WORKSPACE/RL-nano-v3

uvx --from huggingface-hub hf download nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
    --repo-type model \
    --local-dir model

✅ Success Check: Model files downloaded (~59GB total) to model/ directory.

1.7 Verify Setup#

Confirm all components are in place:

cd $WORKSPACE/RL-nano-v3

# Check directory structure
ls -lh
# Expected: data/, model/, examples/, nemo_rl/, etc.

# Check data files
ls -lh data/train-split.jsonl data/val-split.jsonl

# Check model size and key files
du -sh model/
# Expected: ~59GB

# Verify essential model files exist
ls model/config.json model/*.safetensors
# Should show config.json and 13 safetensors files

✅ Success Check: All directories and files present with correct sizes.

2. Create Launch Script#

Create a launcher script that properly handles multi-node Ray coordination:

cd $WORKSPACE/RL-nano-v3

Create launch_nemotron_training.sh:

#!/bin/bash
# Nemotron 3 Nano 30B Multi-Node Training Launcher

# Configuration
HOST_BASE=$WORKSPACE  # Or your preferred base directory
NUM_NODES=2  # Change to 32 for large-scale training

# Paths
DATA_DIR=${HOST_BASE}/RL-nano-v3/data
MODEL_CHECKPOINT=${HOST_BASE}/RL-nano-v3/model
CONFIG_PATH=${HOST_BASE}/RL-nano-v3/examples/nemo_gym/grpo_nanov3.yaml
LOG_DIR=${HOST_BASE}/RL-nano-v3/logs
CKPT_DIR=${HOST_BASE}/RL-nano-v3/checkpoints
CACHE_DIR=${HOST_BASE}/RL-nano-v3/.cache

# Training command with shared cache directory
TRAINING_CMD="cd ${HOST_BASE}/RL-nano-v3 && \
mkdir -p ${LOG_DIR} ${CKPT_DIR} ${CACHE_DIR} && \
export HF_HOME=${CACHE_DIR}/huggingface && \
export TRANSFORMERS_CACHE=${CACHE_DIR}/huggingface && \
uv run examples/nemo_gym/run_grpo_nemo_gym.py \
    --config ${CONFIG_PATH} \
    policy.model_name=${MODEL_CHECKPOINT} \
    data.train_jsonl_fpath=${DATA_DIR}/train-split.jsonl \
    data.validation_jsonl_fpath=${DATA_DIR}/val-split.jsonl \
    ++logger.log_dir=${LOG_DIR} \
    logger.wandb_enabled=False \
    logger.tensorboard_enabled=True \
    ++checkpointing.enabled=True \
    ++checkpointing.checkpoint_dir=${CKPT_DIR} \
    cluster.num_nodes=${NUM_NODES} \
    cluster.gpus_per_node=8"

echo "Submitting ${NUM_NODES}-node training job..."
echo "Using BASE_LOG_DIR: ${HOST_BASE}/nemoRL/nemo-rl"

# Submit job
BASE_LOG_DIR=${HOST_BASE}/nemoRL/nemo-rl \
COMMAND="$TRAINING_CMD" \
CONTAINER="docker://nvcr.io/nvidia/nemo-rl:v0.4.0.nemotron_3_nano" \
MOUNTS="${HOST_BASE}:${HOST_BASE}" \
sbatch \
    --nodes=${NUM_NODES} \
    --account=<your_account> \
    --job-name=nemotron-nano-30b \
    --time=8:00:00 \
    --gres=gpu:8 \
    --chdir=/tmp \
    --output=${LOG_DIR}/slurm-%j.out \
    --error=${LOG_DIR}/slurm-%j.err \
    ${HOST_BASE}/nemoRL/nemo-rl/ray.sub

Make it executable:

chmod +x launch_nemotron_training.sh

Tip

Key Configuration Points:

NUM_NODES=2: 2 nodes × 8 GPUs = 16 GPUs total. For large-scale training, change to NUM_NODES=32 (256 GPUs total)
--chdir=/tmp: Sets a neutral working directory for the job
HF_HOME and TRANSFORMERS_CACHE: Set to shared storage so all nodes can access model conversions
BASE_LOG_DIR: Specifies where Ray cluster logs will be written
--account: Replace <your_account> with your Slurm account name
--time=8:00:00: Adjust based on your cluster’s limits

✅ Success Check: Script created and executable.

3. Submit Training Job#

Important

Run the launch script from a neutral directory like /tmp to ensure consistent container working directory behavior across different cluster configurations.

# Run from /tmp for best compatibility
cd /tmp
bash $WORKSPACE/RL-nano-v3/launch_nemotron_training.sh

Expected output:

Submitting 2-node training job...
Using BASE_LOG_DIR: /home/user/nemoRL/nemo-rl
Submitted batch job 9453356

✅ Success Check: Job submitted successfully with job ID returned.

4. Monitor Job Status#

Monitor your submitted job:

# Check if job is running (replace JOBID with your job number)
squeue --job=JOBID

# Detailed status
squeue --job=JOBID -o "%.18i %.9P %.30j %.8u %.2t %.10M %.6D %R"

Note: For job state codes (PD, R, CD, etc.), see Slurm documentation.

✅ Success Check: Job transitions from PD to R state.

5. Monitor Training Progress#

5.1 Check Ray Cluster Logs#

Wait 1-2 minutes for Ray cluster to initialize, then check logs:

# Set your job ID
JOBID=your_job_id

# Check if Ray head started
ls $WORKSPACE/nemoRL/nemo-rl/${JOBID}-logs/

# Verify Ray head is ready
cat $WORKSPACE/nemoRL/nemo-rl/${JOBID}-logs/STARTED_RAY_HEAD
# File should exist if Ray initialized successfully

# View training execution (most important)
tail -100 $WORKSPACE/nemoRL/nemo-rl/${JOBID}-logs/ray-driver.log

# Follow training progress live
tail -f $WORKSPACE/nemoRL/nemo-rl/${JOBID}-logs/ray-driver.log

5.2 Verify Ray Cluster Formation#

Check that all Ray actors are online:

# For 2-node job (2 nodes × 8 GPUs = 16 actors)
grep "Number of actors online: 16/16" $WORKSPACE/RL-nano-v3/logs/slurm-${JOBID}.out

# For 32-node job, look for: 256/256

✅ Success Check: All actors online (16/16 for 2 nodes, or 256/256 for 32 nodes).

5.3 Watch Training Metrics#

Monitor rollout collection progress:

# Watch rollout collection
tail -f $WORKSPACE/nemoRL/nemo-rl/${JOBID}-logs/ray-driver.log | grep "Collecting rollouts"

# Example output:
# Collecting rollouts:  21%|██        | 428/2048 [02:01<05:42, 4.73it/s]
# Collecting rollouts:  25%|██▌       | 512/2048 [03:15<08:12, 3.12it/s]

Check TensorBoard logs:

# List experiment directories
ls -ltr $WORKSPACE/RL-nano-v3/logs/

# Check TensorBoard events
find $WORKSPACE/RL-nano-v3/logs/exp_*/tensorboard/ -name "*.tfevents.*" -mmin -5

✅ Success Check: Rollout percentage increasing steadily, TensorBoard events being written.

6. Troubleshooting#

Issue: Job Stays in Pending (PD) State#

Check the reason:

squeue --job=JOBID -o "%.18i %.9P %.30j %.8u %.2t %.10M %.6D %R"

Common reasons:

(Priority): Waiting in queue for resources
(Resources): Not enough nodes available
(QOSMaxNodePerUserLimit): Exceeds node limit

Solution: Wait for resources, or adjust job parameters.

Issue: Ray Head Doesn’t Start#

Symptom: No STARTED_RAY_HEAD file in logs directory.

Check Ray head log:

cat $WORKSPACE/nemoRL/nemo-rl/JOBID-logs/ray-head.log

Solution: Check logs for errors related to container startup or resource allocation.

Issue: Training Crashes with Cache Errors#

Symptom: FileNotFoundError mentioning run_config.yaml in ray-driver.log.

Check logs:

grep "FileNotFoundError.*run_config.yaml" $WORKSPACE/nemoRL/nemo-rl/JOBID-logs/ray-driver.log

Root cause: Model conversion saved to local node cache, inaccessible to other nodes.

Solution: Verify shared cache directories are set in TRAINING_CMD:

export HF_HOME=${CACHE_DIR}/huggingface
export TRANSFORMERS_CACHE=${CACHE_DIR}/huggingface

7. Key Technical Details#

Why Ray.sub?#

Without ray.sub, each node would start its own independent Ray cluster. The ray.sub script from NeMo RL:

Starts a Ray head on the first node
Connects all worker nodes to that head
Creates a unified distributed cluster
Manages placement groups for GPU actors

Why Shared Cache?#

HuggingFace Transformers converts models to Megatron format on first use:

Without shared cache: Each node converts independently → race conditions
With shared cache: Rank 0 converts once, all nodes share the result

8. File Structure Reference#

After setup, your directory structure should look like:

$HOME/
├── RL-nano-v3/                      # Project root
│   ├── data/
│   │   ├── train-split.jsonl       # Training data
│   │   └── val-split.jsonl         # Validation data
│   ├── model/                       # Nemotron 3 Nano 30B model (~63GB)
│   ├── examples/nemo_gym/
│   │   ├── grpo_nanov3.yaml        # Training config
│   │   └── run_grpo_nemo_gym.py    # Training script
│   ├── logs/                        # Training outputs
│   │   ├── exp_*/                  # Experiment directories
│   │   ├── slurm-*.out            # Slurm stdout
│   │   └── slurm-*.err            # Slurm stderr
│   ├── checkpoints/                # Saved model checkpoints
│   ├── .cache/                     # Shared HuggingFace cache
│   └── launch_nemotron_training.sh # Launch script
└── nemoRL/nemo-rl/
    ├── ray.sub                      # Ray orchestration script
    └── JOBID-logs/                  # Ray cluster logs
        ├── STARTED_RAY_HEAD        # Ray ready sentinel
        ├── ray-head.log            # Head node log
        ├── ray-worker-*.log        # Worker node logs
        └── ray-driver.log          # Training execution log

Next Steps#

Congratulations! You’ve successfully set up and launched Nemotron 3 Nano 30B multi-node training using Ray and Slurm.

Use Other Training Environments

Browse available environments on GitHub to find other training options.

https://github.com/NVIDIA-NeMo/Gym#-available-environments

Build a Custom Training Environment

Create your own resources server with custom tools and verification logic.

Building Environments