CLI Commands
This page documents the NeMo Gym command-line interface.
All functionality is exposed through a single gym entry point, organized into command groups (gym <group> <command>). ng is a drop-in alias for gym. Every group and command supports -h/--help.
The legacy ng_* / nemo_gym_* commands (such as ng_run or nemo_gym_collect_rollouts) still work but are deprecated. Each one prints a notice pointing at its gym replacement and then runs it. See Migrating from the legacy commands at the bottom of this page.
Quick Reference
Common Options
These options are shared across many commands.
The --benchmark, --resources-server, and --model-type selectors resolve a component name to its config file for you, so you do not need to know the project’s directory layout. If you mistype a name, the CLI suggests the closest match. To point at a config file directly, use --config <path> instead.
Selecting the model server
Commands that need a model (gym env start, gym eval run) configure it through four flags:
Hydra overrides (escape hatch)
The gym CLI is a thin wrapper over Gym’s Hydra config system. Standard flags (--flag value) cover the common inputs. Anything not covered by a flag can still be passed as a raw Hydra override using +key=value (add a new key) or ++key=value (add or override an existing key). Unknown overrides are forwarded to Hydra untouched, so advanced config composition is fully functional.
When a command needs overrides for keys that have no dedicated flag, you can keep the whole command in Hydra form (config paths included) or mix --flag and +key=value styles in one invocation:
General
gym --help
List all command groups and their descriptions.
gym --version
Print the NeMo Gym version along with Python, key dependency, and system information.
Discovery
Commands for discovering what you can evaluate against.
gym list benchmarks
List the benchmarks available in NeMo Gym, with their domain, agent, and configured number of repeats.
gym list environments
List the environments available under environments/, by short name (with their domain and description). The names map to environments/<name>/config.yaml and can be passed to --environment on commands like gym env start / gym eval run.
gym list agents
List the agent harnesses under responses_api_agents/, with each one’s composition pattern — composable (Pattern A: references a separate resources server, so it can be wired into a matching environment) vs self-contained (Pattern B: ships its own framework/environment) — plus its config variants.
gym search
List benchmarks whose name, agent, resources server, dataset, or domain matches a query.
Datasets
Commands for preparing, previewing, and managing datasets. By default dataset transfer commands use HuggingFace; pass --storage gitlab to target the GitLab Registry.
gym dataset upload
Upload a prepared local JSONL dataset to HuggingFace (default) or GitLab.
gym dataset download
Download a dataset from HuggingFace (default) or GitLab.
gym dataset rm
Delete a dataset from the GitLab Registry. Prompts for confirmation.
gym dataset migrate
Migrate a JSONL dataset to HuggingFace from GitLab. Use gym dataset upload if you do not want automatic GitLab deletion.
gym dataset collate
Validate and collate a dataset, generating metrics and statistics.
gym dataset render
Generate a dataset preview by materializing prompts from a raw input file and a prompt template, producing JSONL with populated responses_create_params.input for RL training.
Each input row must not already have a populated responses_create_params.input; the command applies the prompt template from --prompt-config to each row, fills in the input, and preserves the row’s other fields.
Which data-preparation command should I use?
gym dataset render— a focused, standalone step that applies a prompt template to raw rows to populateresponses_create_params.input. No servers are started. Use it when you have raw data and just need to turn it into prompt-ready rows.gym dataset collate— the full preparation pipeline for training: it can download missing datasets, validate data, and compute dataset metrics, writing train/validation splits and metrics artifacts. Use it to prepare and validate datasets for training or PR submission.
Environments
Commands for developing, running, and inspecting environments (a dataset + agent harness + resources server + model server).
gym env init
Scaffold a new resources server with template files (config, app, tests, README, data directory).
gym env resolve
Resolve the configs, flags, and overrides into a final merged config and print it. Useful for debugging configuration. Secrets are hidden.
gym env validate
Validate a config without starting Ray or any server subprocess — a fast pre-flight check that catches config mistakes (missing/malformed config_paths, unknown server cross-references, unset mandatory ??? values, schema errors) in well under a second instead of after a Ray bootstrap. Exits 0 when valid, or 1 with a clean message (no traceback) when not. A model config is not required — model interpolations resolve against a dummy model; pass one (or --model-type) if you want it validated too.
gym env packages
Each server has its own isolated virtual environment. List the packages installed in a server’s environment.
gym env test
Test resource server(s) by running their pytest suite. If no resources server is given, all of them are tested.
gym env start
Start the NeMo Gym servers (agents, models, resources) defined by the provided configs. Reads configuration from YAML files and runs each configured server in its own environment.
gym env status
Show all currently running NeMo Gym servers and their health.
Evaluation
Commands for running evaluations end to end: prepare data, collect rollouts, aggregate for sharded runs, and profile.
gym eval prepare
Prepare a benchmark’s data by running its prepare.py script and dump the result to disk.
gym eval run
Collate data, start the servers, and collect rollouts. This is the main evaluation command. By default it spins up all required servers. Pass --no-serve to collect against servers you already started with gym env start.
Generation parameters
The most common sampling parameters have dedicated flags on gym eval run: --temperature, --top-p, and --max-output-tokens. These map onto responses_create_params.temperature, responses_create_params.top_p, and responses_create_params.max_output_tokens.
Any other responses_create_params field that has no dedicated flag can be set with a raw Hydra override using the ++responses_create_params.<field> syntax. Overrides are merged into each input row’s existing responses_create_params with a shallow merge (top-level keys only):
Because the merge is shallow, setting a field inside a nested object, such as ++responses_create_params.reasoning.effort=low, replaces the row’s entire nested dictionary at that key. Other fields under the same nested object are not preserved.
Resume interrupted runs
Pass --resume to restart the same command after a crash or interruption and pick up only the rows that have not finished yet.
How it works:
- Materialized inputs. On the first run, the fully expanded input rows (after
--num-repeats,--limit,--prompt-config, and any overrides) are written to a sidecar file next to your output. The path is derived from--outputby appending_materialized_inputsto the stem — sorollouts.jsonlproducesrollouts_materialized_inputs.jsonl. - Incremental output. Successful rollouts are flushed to the main output JSONL after each completion; retriable failures go to a
<stem>_failures.jsonlsidecar, so partial progress survives a crash. - Matching. On resume, completed work is matched by
(task_index, rollout_index)against the materialized inputs, and already-completed rows are skipped. The run prints a summary such as the number of original input rows, rows already done, and rows that still need to be run. - Fallback. If either the materialized inputs or the output file is missing, resume is skipped and the run starts fresh. Without
--resume, existing output is cleared before the run.
If you change the config, schema, or data between runs, the materialized inputs become stale and resume will diff against the old expansion. Delete the *_materialized_inputs.jsonl file (and the output file) to start fresh.
gym eval aggregate
Merge sharded rollout results into a single rollouts file with aggregate metrics. Reads every JSONL file matching --input-glob, recomputes aggregate metrics over the global union of records, and writes a <output stem>_aggregate_metrics.json next to the merged rollouts. Use this to combine shards produced by gym eval run --no-serve +disable_aggregation=true.
gym eval profile
Compute a reward profile from collected rollouts. Outputs per-task statistics such as average reward, standard deviation, min/max, and pass rate, useful for filtering tasks before training by difficulty or variance. Requires rollouts collected with --num-repeats greater than 1.
Contributor Helpers
gym dev test
Run NeMo Gym’s core unit tests with coverage reporting.
Migrating from the legacy commands
The legacy ng_* and nemo_gym_* are deprecated and will be removed in the future release.
Use the tables below to find their gym replacement and update your scripts and workflows.
Command mapping
Replacing Hydra overrides with flags
The most common Hydra overrides now have dedicated flags:
Common workflows, before and after
Run any command with --help to see its full set of flags, and gym --help to list every group.