> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/gym/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/gym/_mcp/server.

# nemo_gym.train_data_utils

## Module Contents

### Classes

| Name                                                                              | Description                                                                         |
| --------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------- |
| [`Accumulator`](#nemo_gym-train_data_utils-Accumulator)                           | -                                                                                   |
| [`AvgMinMax`](#nemo_gym-train_data_utils-AvgMinMax)                               | -                                                                                   |
| [`DatasetMetrics`](#nemo_gym-train_data_utils-DatasetMetrics)                     | -                                                                                   |
| [`DatasetValidatorState`](#nemo_gym-train_data_utils-DatasetValidatorState)       | -                                                                                   |
| [`StringMetrics`](#nemo_gym-train_data_utils-StringMetrics)                       | -                                                                                   |
| [`TrainDataProcessor`](#nemo_gym-train_data_utils-TrainDataProcessor)             | -                                                                                   |
| [`TrainDataProcessorConfig`](#nemo_gym-train_data_utils-TrainDataProcessorConfig) | Prepare and validate training data, generating metrics and statistics for datasets. |

### Functions

| Name                                                                                      | Description                                                                                  |
| ----------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------- |
| [`aggregate_other_metrics`](#nemo_gym-train_data_utils-aggregate_other_metrics)           | Combines misc items (those other than response/response create params) into current metrics  |
| [`compute_sample_metrics`](#nemo_gym-train_data_utils-compute_sample_metrics)             | -                                                                                            |
| [`postprocess_other_metrics`](#nemo_gym-train_data_utils-postprocess_other_metrics)       | Aggregates metrics and merges current metrics (containing only AvgMinMax) with StringMetrics |
| [`prepare_data`](#nemo_gym-train_data_utils-prepare_data)                                 | -                                                                                            |
| [`validate_backend_credentials`](#nemo_gym-train_data_utils-validate_backend_credentials) | Check if required env variables are present for the chosen backend                           |

### API

```python
class nemo_gym.train_data_utils.Accumulator()
```

**Bases:** `BaseModel`

```python
nemo_gym.train_data_utils.Accumulator._add(
    other: typing.Self
) -> None
```

abstract

```python
nemo_gym.train_data_utils.Accumulator._aggregate() -> typing.Self
```

abstract

```python
nemo_gym.train_data_utils.Accumulator.add(
    other: typing.Self
) -> None
```

```python
nemo_gym.train_data_utils.Accumulator.aggregate() -> typing.Self
```

```python
class nemo_gym.train_data_utils.AvgMinMax()
```

**Bases:** [Accumulator](#nemo_gym-train_data_utils-Accumulator)

```python
nemo_gym.train_data_utils.AvgMinMax._add(
    other: typing.Self
) -> None
```

```python
nemo_gym.train_data_utils.AvgMinMax._aggregate() -> typing.Self
```

```python
nemo_gym.train_data_utils.AvgMinMax.observe(
    x: float
) -> None
```

```python
class nemo_gym.train_data_utils.DatasetMetrics()
```

**Bases:** [Accumulator](#nemo_gym-train_data_utils-Accumulator)

```python
nemo_gym.train_data_utils.DatasetMetrics._add(
    other: typing.Self
) -> None
```

```python
nemo_gym.train_data_utils.DatasetMetrics._aggregate() -> typing.Self
```

```python
class nemo_gym.train_data_utils.DatasetValidatorState()
```

**Bases:** `BaseModel`

```python
class nemo_gym.train_data_utils.StringMetrics()
```

**Bases:** `BaseModel`

```python
class nemo_gym.train_data_utils.TrainDataProcessor()
```

**Bases:** `BaseModel`

```python
nemo_gym.train_data_utils.TrainDataProcessor._collate_samples_single_type(
    type: nemo_gym.config_types.DatasetType,
    server_instance_configs: typing.List[nemo_gym.config_types.ServerInstanceConfig]
) -> typing.List[pathlib.Path]
```

```python
nemo_gym.train_data_utils.TrainDataProcessor._iter_dataset_lines(
    dataset_config: nemo_gym.config_types.DatasetConfig
)
```

```python
nemo_gym.train_data_utils.TrainDataProcessor._print_title(
    title: str
) -> None
```

```python
nemo_gym.train_data_utils.TrainDataProcessor._validate_aggregate_metrics(
    aggregate_metrics_dict: typing.Dict,
    metrics_fpath: pathlib.Path
) -> typing.Optional[pathlib.Path]
```

Returns the conflicting metrics fpath if invalid. Else returns None

```python
nemo_gym.train_data_utils.TrainDataProcessor._validate_samples_and_aggregate_metrics_single_dataset(
    dataset_config: nemo_gym.config_types.DatasetConfig
) -> nemo_gym.train_data_utils.DatasetValidatorState
```

```python
nemo_gym.train_data_utils.TrainDataProcessor._validate_samples_and_aggregate_metrics_single_sample(
    state: nemo_gym.train_data_utils.DatasetValidatorState,
    sample_idx: int,
    sample_dict_str: str
) -> None
```

```python
nemo_gym.train_data_utils.TrainDataProcessor.collate_samples(
    config: nemo_gym.train_data_utils.TrainDataProcessorConfig,
    server_instance_configs: typing.List[nemo_gym.config_types.ServerInstanceConfig],
    dataset_type_to_aggregate_metrics: typing.Dict[str, nemo_gym.train_data_utils.DatasetMetrics]
) -> None
```

```python
nemo_gym.train_data_utils.TrainDataProcessor.load_and_validate_server_instance_configs(
    config: nemo_gym.train_data_utils.TrainDataProcessorConfig,
    global_config_dict: omegaconf.DictConfig
) -> typing.List[nemo_gym.config_types.ServerInstanceConfig]
```

```python
nemo_gym.train_data_utils.TrainDataProcessor.load_datasets(
    config: nemo_gym.train_data_utils.TrainDataProcessorConfig,
    server_instance_configs: typing.List[nemo_gym.config_types.ServerInstanceConfig]
) -> None
```

```python
nemo_gym.train_data_utils.TrainDataProcessor.run(
    global_config_dict: omegaconf.DictConfig
)
```

See the README section "How To: Prepare and validate data for PR submission or RL training"

```python
nemo_gym.train_data_utils.TrainDataProcessor.validate_samples_and_aggregate_metrics(
    server_instance_configs: typing.List[nemo_gym.config_types.ServerInstanceConfig],
    overwrite_metrics_conflicts: bool
) -> typing.Dict[str, nemo_gym.train_data_utils.DatasetMetrics]
```

```python
class nemo_gym.train_data_utils.TrainDataProcessorConfig()
```

**Bases:** [BaseNeMoGymCLIConfig](/nemo-gym/nemo_gym/config_types#nemo_gym-config_types-BaseNeMoGymCLIConfig)

Prepare and validate training data, generating metrics and statistics for datasets.

Examples:

```python
config_paths="resources_servers/example_multi_step/configs/example_multi_step.yaml,\
responses_api_models/openai_model/configs/openai_model.yaml"
ng_prepare_data "+config_paths=[${config_paths}]"         +output_dirpath=data/example_multi_step         +mode=example_validation
```

```python
nemo_gym.train_data_utils.aggregate_other_metrics(
    metrics: typing.Dict[str, typing.Any],
    sample: typing.Dict[str, typing.Any]
) -> None
```

Combines misc items (those other than response/response create params) into current metrics

```python
nemo_gym.train_data_utils.compute_sample_metrics(
    sample_dict_str: str
) -> typing.Tuple[nemo_gym.train_data_utils.DatasetMetrics, bool]
```

```python
nemo_gym.train_data_utils.postprocess_other_metrics(
    metrics: nemo_gym.train_data_utils.DatasetMetrics,
    other_metrics: typing.Dict[str, typing.Any]
) -> None
```

Aggregates metrics and merges current metrics (containing only AvgMinMax) with StringMetrics

```python
nemo_gym.train_data_utils.prepare_data()
```

```python
nemo_gym.train_data_utils.validate_backend_credentials(
    backend: str
) -> tuple[bool, str]
```

Check if required env variables are present for the chosen backend