Key Terminology
Essential vocabulary for model training, RL workflows, and NeMo Gym. This glossary defines terms you’ll encounter throughout the tutorials and documentation.
Rollout & Data Collection Terms
Rollout / Trajectory
Rollout (verb) refers to the process of executing a policy in an environment to generate data: stepping through the environment, taking actions, and recording what happens. Rollout (noun) is also used synonymously with trajectory: the resulting sequence of states, actions, and rewards: the ordered record of what happened. In practice, many people use “rollout” and “trajectory” interchangeably since a rollout produces exactly one trajectory.
Rollout Batch
A collection of multiple rollouts generated together, typically for the same task. Used for efficient parallel processing.
Environment
The conditions in which your model operates. Functionally, this typically refers to tools the model has access to.
Task
An input prompt paired with environment setup (tools + verification). What you want models to learn to do.
Task Instance
A single rollout attempt for a specific task. Multiple instances per task capture different approaches.
Training environment
A set of tasks that share the same environment setup compiled into a single prompt dataset.
Trace
Detailed log of a rollout including metadata for debugging or interpretability.
Data Generation Process
The complete pipeline from input prompt to scored rollout, involving rollout orchestration, model inference, tool usage, and verification.
Rollout Collection
The process of applying your data generation pipeline to input prompts at scale.
Demonstration Data
Training data format for SFT consisting of input prompts paired with successful rollouts. Shows models examples of correct behavior.
Preference Pairs
Training data format for DPO consisting of the same prompt with two different responses, where one is preferred over the other.
Architecture Terms
Policy Model
The primary LLM being trained or evaluated - the “decision-making brain” you want to improve.
Orchestration
Coordination logic that manages when to call models, which tools to use, and how to sequence multi-step operations.
Verifier
Component that scores rollouts, producing reward signals. The word “verifier” may also refer colloquially to a different definition: “training environment with verifiable rewards.”
Service Discovery
Mechanism by which distributed NeMo Gym components find and communicate with each other across machines.
Reward / Reward Signal
Numerical score (typically 0.0-1.0) indicating how well a task was accomplished.
Training Approaches
SFT (Supervised Fine-Tuning)
Training approach using examples of good model behavior. Shows successful rollouts as training data.
RL (Reinforcement Learning)
Training approach where models learn through trial-and-error interaction with environments using reward signals.
Online vs Offline Training
- Online: Model learns while interacting with environment in real-time - Offline: Model learns from pre-collected rollout data
DPO (Direct Preference Optimization)
An offline RL training approach using pairs of rollouts where one is preferred over another. Teaches better vs worse responses.
GRPO (Group Relative Policy Optimization)
Reinforcement learning algorithm that optimizes policies by comparing groups of rollouts relative to each other. Used for online RL training with language models.
Interaction Patterns
Multi-turn
Conversations spanning multiple exchanges where context and state persist across turns.
Multi-step
Complex tasks requiring agents to break problems into sequential steps, often using tools and intermediate reasoning.
Tool Use / Function Calling
Models invoking external capabilities (APIs, calculators, databases) to accomplish tasks beyond text generation.
Technical Infrastructure
Responses API
OpenAI’s standard interface for rollouts, including function calls and multi-turn conversations. NeMo Gym’s native format.
Chat Completions API
OpenAI’s simpler interface for basic LLM interactions. NeMo Gym includes middleware to convert formats.
vLLM
High-performance inference server for running open-source language models locally. Alternative to commercial APIs.