# NVIDIA NeMo‑RL Comprehensive Guide: Algorithms, Distributed Training, Data Pipelines, Logging, and Evaluation  

This collection consolidates the core NeMo‑RL documentation, covering everything from algorithm APIs (GRPO, DPO, SFT, loss functions) and model integration (Hugging Face, Megatron, VLM) to distributed training setup, micro‑batch handling, and sequence packing. It also details utilities for logging, profiling, environment creation, dataset handling, containerization, and evaluation—including answer parsing and benchmark-specific guides. The material serves as a one‑stop reference for building, debugging, scaling, and deploying reinforcement‑learning pipelines with NeMo‑RL.

## Setup & Installation
- [If you’re contributing to the NeMo‑RL documentation and need to build the docs locally, run doctest blocks, or update the version switcher before publishing, read this page. It also explains how to live‑build the docs and run tests in Python docstrings.](https://docs.nvidia.com/nemo/rl/latest/documentation.html.md)
- [The LLM should consult this page whenever it needs to compute a pass‑rate metric for a batch of prompts, such as during evaluation of a reinforcement‑learning policy on a prompt‑based task or when reporting success rates in a training or benchmarking pipeline.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.environments.metrics.html.md)
- [The LLM should read this page when you’re building or debugging a NeMo‑RL unit or functional tests, to confirm the required GPU count, Docker setup, and the correct test‑run scripts (e.g., `run_unit.sh` or `sft.sh`). It is also useful when interpreting the resulting JSON metric files, tracking performance trends, or performing static type checks with mypy.](https://docs.nvidia.com/nemo/rl/latest/testing.html.md)

## Local Development
- [If you want to run NeMo RL examples locally and have Ray automatically start a local cluster, simply execute `uv run examples/run_grpo_math.py`. To control which GPUs Ray uses or to run multiple clusters side‑by‑side, set `CUDA_VISIBLE_DEVICES` (e.g., `CUDA_VISIBLE_DEVICES=0,3 uv run …` or `CUDA_VISIBLE_DEVICES=1,2 uv run …`).](https://docs.nvidia.com/nemo/rl/latest/local-workstation.html.md)

## Cluster Management & Execution
- [Use this guide when you need to launch NeMo‑RL training on a Slurm‑managed HPC cluster or a Kubernetes cluster, for example to submit batched or interactive jobs, configure GPU/CPU allocation, or customize Ray environment variables before running a reinforcement‑learning experiment. It’s also useful when debugging or monitoring a head node, or when you must transfer training scripts and checkpoints into a containerized Ray cluster.](https://docs.nvidia.com/nemo/rl/latest/cluster.html.md)
- [The LLM should consult this page whenever you need to set up a consistent Python environment across Ray worker nodes for NeMo‑RL jobs, such as creating a shared virtual environment with uv, caching it, or forcing a rebuild on all nodes. Use `create_local_venv` for a single node or `create_local_venv_on_each_node` to generate the `py_executable` path that Ray workers can use.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.utils.venvs.html.md)
- [When you are about to launch a NeMo‑RL training job on multiple GPU nodes and need to create a virtual Ray cluster with custom bundle sizes, GPU allocations, or placement strategies, consult the page to set up and configure the cluster. Also refer to it when you encounter “ResourceInsufficientError” or need to re‑attach to an existing local Ray cluster before starting a new training session.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.distributed.virtual_cluster.html.md)

## Distributed Training & Parallelism
- [Read this page when you need to configure or run distributed training with NeMo‑RL – for example, to decide between DTensor (FSDP2) and Megatron backends, set up the required Hugging Face checkpoint paths, and define environment variables for shared checkpoint storage.](https://docs.nvidia.com/nemo/rl/latest/design-docs/training-backends.html.md)
- [Use this page when you’re building or troubleshooting distributed RL pipelines in NeMo and need to split, pad, or pack batches across data‑parallel ranks—e.g., preparing micro‑batches for dynamic batching, performing sequence packing with custom padding, or reordering shards to control training order across GPUs.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.distributed.batched_data_dict.html.md)
- [When building or debugging a distributed RL training job that uses data, pipeline, or tensor‑parallel axes, or when you need to map worker IDs to their coordinate positions or filter ranks by specific named axes. This page is useful for understanding how to construct, query, and slice a `NamedSharding` layout in such workloads.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.distributed.named_sharding.html.md)
- [Read this page when you need to compute stable log‑probabilities from vocab logits that are sharded across tensor‑parallel (TP) and context‑parallel (CP) workers, such as during distributed language‑model training or inference with NeMo‑RL, or when handling packed sequences in a multi‑GPU setup.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.distributed.model_utils.html.md)
- [Use this page when you need to measure and profile specific sections of a NeMo‑RL training or evaluation loop—e.g., timing data‑loading, model forward passes, or iteration steps—to analyze latency, compare different runs, or enforce time‑out limits during long experiments.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.utils.timer.html.md)

## Configuration & Utilities
- [Use this page whenever you are building or extending a NeMo‑RL training pipeline and need to load a YAML configuration that relies on Hydra‑style defaults, multiple or nested inheritance, or variable interpolation. It also explains how to resolve relative paths and apply Hydra overrides programmatically when tweaking hyper‑parameters or paths on the fly.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.utils.config.html.md)
- [When configuring a reinforcement‑learning training run that needs to log metrics, hyperparameters, or plots to external tools (Wandb, Tensorboard, MLflow), capture GPU utilisation across Ray nodes, or format nested logs for debugging, you should review this page.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.utils.logger.html.md)
- [When you need to query a GPU’s free memory, UUID, or convert a logical device ID to a physical one while initializing NVML in a NeMo RL training pipeline.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.utils.nvml.html.md)
- [When you need to estimate the theoretical compute requirements of a Hugging‑Face model on a specific GPU before training or deployment, or when you need to monitor the actual FLOPs spent during inference or batch processing to validate performance claims, this module offers utilities to convert model configs, compute theoretical TFLOPs, and track real FLOPs usage.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.utils.flops_tracker.html.md)

## Logging & Monitoring
- [When configuring a reinforcement‑learning training run that needs to log metrics, hyperparameters, or plots to external tools (Wandb, Tensorboard, MLflow), capture GPU utilisation across Ray nodes, or format nested logs for debugging, you should review this page.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.utils.logger.html.md)
- [When you need to log metrics to WandB, Tensorboard or MLflow, or when you need to enable pretty‑printed validation output or GPU usage tracking, consult this page to understand the LoggerInterface, backend implementations, and the relevant configuration options. It’s also the go‑to reference for troubleshooting distributed metric reductions and ensuring consistent logging across all enabled backends.](https://docs.nvidia.com/nemo/rl/latest/design-docs/logger.html.md)

## Profiling & Performance
- [Read this page whenever you need to enable or troubleshoot Nsight GPU profiling for NeMo‑RL Ray workers—such as turning on profiling for policy or VLLM workers, restricting the capture to specific training steps, or applying the correct `python` path patch on SLURM or local clusters. This guide also covers how to locate the generated `.nsys-rep` files and handle model‑parallel workers.](https://docs.nvidia.com/nemo/rl/latest/nsys-profiling.html.md)

## Model Integration & Conversion
- [The LLM should read this page when you’re integrating Hugging Face transformer models into a NeMo‑RL training workflow—e.g., customizing the policy network with the `nemo_rl.models.huggingface.common` submodule, troubleshooting API usage, or converting a checkpoint to the NeMo‑RL format.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.models.huggingface.html.md)
- [Read this page when you need to export a NeMo‑RL agent to a Hugging Face‑compatible checkpoint—such as when integrating the model into a Hugging Face inference pipeline or deploying it with vLLM for fast serving on a GPU‑enabled cloud instance.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.converters.huggingface.html.md)
- [The LLM should read this page whenever you’re integrating a new model into NeMo‑RL—right after the checkpoint is converted, before you run training or inference—to validate log‑probability consistency, run the diagnostic scripts, and confirm that error metrics stay below the 1.05 threshold across Hugging Face, Megatron, and vLLM backends.](https://docs.nvidia.com/nemo/rl/latest/adding-new-models.html.md)
- [Read this page when you’re building or debugging a NeMo‑RL reinforcement‑learning environment and need to import or use utilities such as metrics, rewards, or the standard Code, Math, or VLM environments, consult this page to understand the available submodules and their APIs. It is also useful when debugging or customizing environment interfaces to ensure you employ the correct functions and parameters.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.environments.html.md)
- [When you need to convert DeepSeek‑V3 FP8 weights to BF16 for fine‑tuning on Nvidia GPUs or set up the model for inference in a NeMo‑RL pipeline, the page explains the exact cloning, conversion, and configuration steps you must follow. It also covers how to adjust the `config.json` to disable unsupported features before launching a training or evaluation job.](https://docs.nvidia.com/nemo/rl/latest/guides/deepseek.html.md)
- [Read this page when you are about to train or fine‑tune a policy with Direct Preference Optimization in NeMo‑RL, or when you need to set up the DPO pipeline, validate the model, and manage checkpoints and logging for a custom preference‑based dataset.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.algorithms.dpo.html.md)
- [When you’re building or debugging the text‑generation part of a NeMo‑RL agent—e.g., configuring a custom `GenerationConfig`, selecting a `TokenizerType`, or tuning vLLM worker settings for training or evaluation—this page is the go‑to reference.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.models.generation.html.md)
- [When you are writing or refactoring a NeMo‑RL policy training script—especially if you need to set up distributed Megatron or DTensor parallelism, enable reward modeling, tweak sequence packing, or customize the optimizer/scheduler—you should read this page to see the TypedDict fields and defaults that control those behaviors. It is also handy when debugging training stalls or memory spikes caused by mis‑configured batch sizes, precision settings, or gradient clipping.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.models.policy.html.md)
- [Read this page when you’re building or debugging a VLM‑based RL pipeline in NeMo‑RL—specifically while setting up the `VLMEnvConfig`, configuring workers with `VLMVerifyWorker`, or inspecting the `step` and `global_post_process_and_metrics` logic for custom reward functions and metrics.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.environments.vlm_environment.html.md)
- [Read this page when you’re building or debugging a Code‑environment RL pipeline—such as for debugging, algorithmic reasoning, or code‑generation tasks. It explains how to set up `CodeEnvConfig`, `CodeExecutionWorker`, and how `CodeEnvironment.step` returns execution results and metrics.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.environments.code_environment.html.md)

## Training Guides
- [Use this guide when you need to launch NeMo‑RL training on a Slurm‑managed HPC cluster or a Kubernetes cluster, for example to submit batched or interactive jobs, configure GPU/CPU allocation, or customize Ray environment variables before running a reinforcement‑learning experiment. It’s also useful when debugging or monitoring a head node, or when you must transfer training scripts and checkpoints into a containerized Ray cluster.](https://docs.nvidia.com/nemo/rl/latest/cluster.html.md)
- [Use this guide when you need to launch NeMo‑RL training on a Slurm‑managed HPC cluster or a Kubernetes cluster, for example to submit batched or interactive jobs, configure GPU/CPU allocation, or customize Ray environment variables before running a reinforcement‑learning experiment. It’s also useful when debugging or monitoring a head node, or when you must transfer training scripts and checkpoints into a containerized Ray cluster.](https://docs.nvidia.com/nemo/rl/latest/cluster.html.md)
- [Use this guide when you need to launch NeMo‑RL training on a Slurm‑managed HPC cluster or a Kubernetes cluster, for example to submit batched or interactive jobs, configure GPU/CPU allocation, or customize Ray environment variables before running a reinforcement‑learning experiment. It’s also useful when debugging or monitoring a head node, or when you must transfer training scripts and checkpoints into a containerized Ray cluster.](https://docs.nvidia.com/nemo/rl/latest/cluster.html.md) *(duplicate intentional for emphasis)*
- [Use this guide when you’re preparing to run a Direct Preference Optimization experiment in NeMo RL—whether you’re converting a dataset to the required `prompt / chosen_response / rejected_response` format, overriding training hyper‑parameters (e.g., `dpo.sft_loss_weight` or `dpo.preference_average_log_probs`), or launching a job via `uv run examples/run_dpo.py` on a local or Slurm cluster. It’s also the go‑to reference for evaluating a finished DPO model and for troubleshooting common configuration or dataset‑formatting issues.](https://docs.nvidia.com/nemo/rl/latest/guides/dpo.html.md)
- [Use this guide when you need to launch NeMo‑RL training on a Slurm‑managed HPC cluster or a Kubernetes cluster, for example to submit batched or interactive jobs, configure GPU/CPU allocation, or customize Ray environment variables before running a reinforcement‑learning experiment. It’s also useful when debugging or monitoring a head node, or when you must transfer training scripts and checkpoints into a containerized Ray cluster.](https://docs.nvidia.com/nemo/rl/latest/cluster.html.md)
- [Use this guide when you need to launch NeMo‑RL training on a Slurm‑managed HPC cluster or a Kubernetes cluster, for example to submit batched or interactive jobs, configure GPU/CPU allocation, or customize Ray environment variables before running a reinforcement‑learning experiment. It’s also useful when debugging or monitoring a head node, or when you must transfer training scripts and checkpoints into a containerized Ray cluster.](https://docs.nvidia.com/nemo/rl/latest/cluster.html.md)
- [Use this guide when you need to launch NeMo‑RL training on a Slurm‑managed HPC cluster or a Kubernetes cluster, for example to submit batched or interactive jobs, configure GPU/CPU allocation, or customize Ray environment variables before running a reinforcement‑learning experiment. It’s also useful when debugging or monitoring a head node, or when you must transfer training scripts and checkpoints into a containerized Ray cluster.](https://docs.nvidia.com/nemo/rl/latest/cluster.html.md)

## Algorithms Overview
- [Read this page when you are building, extending, or debugging NeMo‑RL reinforcement‑learning code, or when you need to understand the available algorithmic submodules (GRPO, DPO, SFT, RM, loss functions, utilities, interfaces) and how to integrate or customize them in your project.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.algorithms.html.md)

## Specific Algorithms
- [Read this page whenever you need to build, launch, or troubleshoot a GRPO training run in NeMo‑RL—e.g., when configuring `GRPOConfig` or `MasterConfig`, calling `setup` to initialize the policy and cluster, or tuning async rollouts, loss functions, or validation logic for a new task.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.algorithms.grpo.html.md)
- [Read this page when you need to build, launch, or troubleshoot a GRPO training run in NeMo‑RL—e.g., when configuring `GRPOConfig` or `MasterConfig`, calling `setup` to initialize the policy and cluster, or tuning async rollouts, loss functions, or validation logic for a new task.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.algorithms.grpo.html.md)
- [Read this page when you need to build, launch, or troubleshoot a GRPO training run in NeMo‑RL—e.g., when configuring `GRPOConfig` or `MasterConfig`, calling `setup` to initialize the policy and cluster, or tuning async rollouts, loss functions, or validation logic for a new task.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.algorithms.grpo.html.md)

## Loss Functions & Interfaces
- [When you’re building or debugging a custom loss function in NeMo RL that must normalize across micro‑batches (e.g., token‑level or sequence‑level losses) or troubleshooting training discrepancies that arise from micro‑batching.](https://docs.nvidia.com/nemo/rl/latest/design-docs/loss-functions.html.md)
- [Read this page when you’re implementing or extending a loss function in a NeMo‑RL algorithm and need to know the exact signature, available `LossType` values, and how token‑ or sequence‑level losses must be normalized with `global_valid_seqs` and `global_valid_toks`.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.algorithms.interfaces.html.md)
- [Use this page when you’re building or debugging a NeMo‑RL training job that relies on policy‑gradient, PPO, DPO, or preference‑based losses—for example, to set clip ranges, KL penalties, or to decide whether to use token‑level loss and sequence‑packing wrappers. It is also the reference to consult whenever you need to interpret or log the individual loss components (e.g., preference loss, SFT loss, KL divergence) that appear in the training metrics.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.algorithms.loss_functions.html.md)

## Experience Replay & Buffers
- [Read this page whenever you’re building or debugging a NeMo‑RL training pipeline that relies on experience replay, rollouts, or custom experience buffers—for example, when you need to configure the `ExperienceReplay` class, inspect the `rollouts` submodule, or troubleshoot a failure in sampling transitions during agent training.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.experience.html.md)

## Metrics & Evaluation
- [When you need to extend or debug NeMo‑RL’s training loops—such as adding custom episode‑reward, success‑rate, or other evaluation metrics—you should read this page. It explains how metrics are stored, updated, and logged, enabling you to tailor metric calculations or fix reporting bugs.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.metrics.metrics_utils.html.md)
- [When you need to compute a pass‑rate metric for a batch of prompts, such as during evaluation of a reinforcement‑learning policy on a prompt‑based task or when reporting success rates in a training or benchmarking pipeline.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.environments.metrics.html.md)
- [**When the LLM (or the developer building an evaluation pipeline) needs to reliably extract and compare the model’s answer from its raw text output.** ...](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.evals.answer_parsing.html.md)

## Datasets & Data Processing
- [The NeMo‑RL `hf_datasets` page is useful when you’re building or extending an RL training pipeline that uses Hugging Face datasets—for example, when you need to import the `DPODataset` or `OasstDataset` into a training script, or when you’re debugging why a dataset fails to load. It’s also handy for adding a new custom dataset wrapper by inspecting the existing submodules and their public API.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.data.hf_datasets.html.md)
- [When you need to load an evaluation dataset for a NeMo‑RL experiment—such as setting up a data pipeline for a new RL task, debugging dataset loading, or preparing a benchmark evaluation script—you should consult this page.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.data.eval_datasets.html.md)
- [When you are building or debugging a NeMo‑RL training pipeline that needs to convert raw conversational data into tokenized prompts—especially for multi‑task or preference (DPO) training—the LLM should read this page to learn how to instantiate AllTaskProcessedDataset, use encode_single for prompt encoding, and apply rl_collate_fn, eval_collate_fn, preference_collate_fn, or dpo_collate_fn to batch data correctly. It is also useful when troubleshooting tokenization errors, such as verifying that the input does not contain double BOS tokens with assert_no_double_bos.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.data.datasets.html.md)
- [Use this page whenever you are building or extending a NeMo‑RL training pipeline that needs to convert raw dataset entries into the framework’s `DatumSpec` objects. For example, if you’re writing a custom `TaskDataProcessFnCallable` to load prompt files and tokenise them for a reinforcement‑learning task, the interface definitions and type annotations on this page will guide how you structure the datum dictionary and the resulting `DatumSpec`.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.data.interfaces.html.md)

## Padding & Sequence Handling
- [When preparing NeMo RL training data or inference inputs, refer to this page to enforce right‑padding—for example, when building BatchedDataDicts or validating generation outputs with `verify_right_padding`.  Also consult it when debugging padding errors that affect loss calculations or token masking.](https://docs.nvidia.com/nemo/rl/latest/design-docs/padding.html.md)
- [Use this page when you’re preparing or debugging a NeMo‑RL training pipeline that involves variable‑length sequences, so you can decide whether to enable sequence packing (with Megatron/DTensor + FlashAttention‑2) or dynamic batching, adjust the target tokens, alignment factors, and load‑balancing settings, and correctly integrate the loss wrapper for packed data.](https://docs.nvidia.com/nemo/rl/latest/design-docs/sequence-packing-and-dynamic-batching.html.md)

## Environment Interfaces & Custom Envs
- [When you’re creating a custom NeMo‑RL environment that communicates with an LLM through OpenAI‑style message logs—such as a math problem solver or a code‑debugging task—you should consult this page to implement the `EnvironmentInterface.step` method and the `global_post_process_and_metrics` hook, ensuring that your return values match the `EnvironmentReturn` named‑tuple and that metadata is correctly propagated. It’s also the go‑to reference for debugging or extending environment logic, where you need to understand the batching conventions and the expected fields (`observations`, `rewards`, `terminateds`, etc.).](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.environments.interfaces.html.md)
- [When you are building or debugging a NeMo‑RL reinforcement‑learning environment and need to import or use utilities such as metrics, rewards, or the standard Code, Math, or VLM environments, consult this page to understand the available submodules and their APIs. It is also useful when debugging or customizing environment interfaces to ensure you employ the correct functions and parameters.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.environments.html.md)
- [When you are building or debugging a NeMo‑RL training pipeline that relies on experience replay, rollouts, or custom experience buffers—for example, when you need to configure the `ExperienceReplay` class, inspect the `rollouts` submodule, or troubleshoot a failure in sampling transitions during agent training.](https://docs.nvidia.com/nemo/rl/latest/apidocs/nemo_rl/nemo_rl.experience.html.md)

## Docker & Production
- [When you need to build a reproducible NeMo‑RL container for a production or CI/CD pipeline—such as pre‑fetching worker virtual environments for isolated training jobs or ensuring all Python dependencies are cached for offline deployment—then consult this guide. Use the “release” target for full source and worker environments, or the “hermetic” target to avoid runtime package downloads.](https://docs.nvidia.com/nemo/rl/latest/docker.html.md)