> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/gym/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/gym/llms-full.txt.

# nemo_gym.train_data_utils

## Module Contents

### Classes

| Name                                                                              | Description                                                                         |
| --------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------- |
| [`Accumulator`](#nemo_gym-train_data_utils-Accumulator)                           | -                                                                                   |
| [`AvgMinMax`](#nemo_gym-train_data_utils-AvgMinMax)                               | -                                                                                   |
| [`DatasetMetrics`](#nemo_gym-train_data_utils-DatasetMetrics)                     | -                                                                                   |
| [`DatasetValidatorState`](#nemo_gym-train_data_utils-DatasetValidatorState)       | -                                                                                   |
| [`StringMetrics`](#nemo_gym-train_data_utils-StringMetrics)                       | -                                                                                   |
| [`TrainDataProcessor`](#nemo_gym-train_data_utils-TrainDataProcessor)             | -                                                                                   |
| [`TrainDataProcessorConfig`](#nemo_gym-train_data_utils-TrainDataProcessorConfig) | Prepare and validate training data, generating metrics and statistics for datasets. |

### Functions

| Name                                                                                      | Description                                                                                  |
| ----------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------- |
| [`aggregate_other_metrics`](#nemo_gym-train_data_utils-aggregate_other_metrics)           | Combines misc items (those other than response/response create params) into current metrics  |
| [`compute_sample_metrics`](#nemo_gym-train_data_utils-compute_sample_metrics)             | -                                                                                            |
| [`postprocess_other_metrics`](#nemo_gym-train_data_utils-postprocess_other_metrics)       | Aggregates metrics and merges current metrics (containing only AvgMinMax) with StringMetrics |
| [`prepare_data`](#nemo_gym-train_data_utils-prepare_data)                                 | -                                                                                            |
| [`validate_backend_credentials`](#nemo_gym-train_data_utils-validate_backend_credentials) | Check if required env variables are present for the chosen backend                           |

### API

<Anchor id="nemo_gym-train_data_utils-Accumulator">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    class nemo_gym.train_data_utils.Accumulator()
    ```
  </CodeBlock>
</Anchor>

<Indent>
  **Bases:** `BaseModel`

  <ParamField path="is_aggregated" type="bool = Field(default=False, exclude=True)" />

  <Anchor id="nemo_gym-train_data_utils-Accumulator-_add">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_gym.train_data_utils.Accumulator._add(
          other: typing.Self
      ) -> None
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    <Badge>
      abstract
    </Badge>
  </Indent>

  <Anchor id="nemo_gym-train_data_utils-Accumulator-_aggregate">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_gym.train_data_utils.Accumulator._aggregate() -> typing.Self
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    <Badge>
      abstract
    </Badge>
  </Indent>

  <Anchor id="nemo_gym-train_data_utils-Accumulator-add">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_gym.train_data_utils.Accumulator.add(
          other: typing.Self
      ) -> None
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_gym-train_data_utils-Accumulator-aggregate">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_gym.train_data_utils.Accumulator.aggregate() -> typing.Self
      ```
    </CodeBlock>
  </Anchor>

  <Indent />
</Indent>

<Anchor id="nemo_gym-train_data_utils-AvgMinMax">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    class nemo_gym.train_data_utils.AvgMinMax()
    ```
  </CodeBlock>
</Anchor>

<Indent>
  **Bases:** [Accumulator](#nemo_gym-train_data_utils-Accumulator)

  <ParamField path="M2" type="float = Field(default=0, exclude=True)" />

  <ParamField path="average" type="float = Field(serialization_alias='Average', default=0)" />

  <ParamField path="max" type="float" />

  <ParamField path="mean" type="float = Field(default=0, exclude=True)" />

  <ParamField path="min" type="float" />

  <ParamField path="model_config" type="= ConfigDict(arbitrary_types_allowed=True)" />

  <ParamField path="stddev" type="float" />

  <ParamField path="total" type="int" />

  <Anchor id="nemo_gym-train_data_utils-AvgMinMax-_add">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_gym.train_data_utils.AvgMinMax._add(
          other: typing.Self
      ) -> None
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_gym-train_data_utils-AvgMinMax-_aggregate">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_gym.train_data_utils.AvgMinMax._aggregate() -> typing.Self
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_gym-train_data_utils-AvgMinMax-observe">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_gym.train_data_utils.AvgMinMax.observe(
          x: float
      ) -> None
      ```
    </CodeBlock>
  </Anchor>

  <Indent />
</Indent>

<Anchor id="nemo_gym-train_data_utils-DatasetMetrics">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    class nemo_gym.train_data_utils.DatasetMetrics()
    ```
  </CodeBlock>
</Anchor>

<Indent>
  **Bases:** [Accumulator](#nemo_gym-train_data_utils-Accumulator)

  <ParamField path="json_dumped_number_of_words" type="AvgMinMax" />

  <ParamField path="model_config" type="= ConfigDict(extra='allow')" />

  <ParamField path="number_of_examples" type="int" />

  <ParamField path="number_of_tools" type="AvgMinMax" />

  <ParamField path="number_of_turns" type="AvgMinMax" />

  <ParamField path="temperature" type="AvgMinMax" />

  <Anchor id="nemo_gym-train_data_utils-DatasetMetrics-_add">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_gym.train_data_utils.DatasetMetrics._add(
          other: typing.Self
      ) -> None
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_gym-train_data_utils-DatasetMetrics-_aggregate">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_gym.train_data_utils.DatasetMetrics._aggregate() -> typing.Self
      ```
    </CodeBlock>
  </Anchor>

  <Indent />
</Indent>

<Anchor id="nemo_gym-train_data_utils-DatasetValidatorState">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    class nemo_gym.train_data_utils.DatasetValidatorState()
    ```
  </CodeBlock>
</Anchor>

<Indent>
  **Bases:** `BaseModel`

  <ParamField path="key_counts" type="Counter = Field(default_factory=Counter)" />

  <ParamField path="metrics" type="DatasetMetrics = Field(default_factory=DatasetMetrics)" />

  <ParamField path="model_config" type="= ConfigDict(arbitrary_types_allowed=True)" />

  <ParamField path="offending_example_idxs" type="List[int] = Field(default_factory=list)" />

  <ParamField path="other_metrics" type="Dict[str, Any] = Field(default_factory=dict)" />
</Indent>

<Anchor id="nemo_gym-train_data_utils-StringMetrics">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    class nemo_gym.train_data_utils.StringMetrics()
    ```
  </CodeBlock>
</Anchor>

<Indent>
  **Bases:** `BaseModel`

  <ParamField path="total_count" type="int" />

  <ParamField path="unique_count" type="int" />
</Indent>

<Anchor id="nemo_gym-train_data_utils-TrainDataProcessor">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    class nemo_gym.train_data_utils.TrainDataProcessor()
    ```
  </CodeBlock>
</Anchor>

<Indent>
  **Bases:** `BaseModel`

  <Anchor id="nemo_gym-train_data_utils-TrainDataProcessor-_collate_samples_single_type">
    <CodeBlock links={{"nemo_gym.config_types.DatasetType":"/nemo-gym/nemo_gym/config_types#nemo_gym-config_types-DatasetType","nemo_gym.config_types.ServerInstanceConfig":"/nemo-gym/nemo_gym/config_types#nemo_gym-config_types-ServerInstanceConfig"}} showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_gym.train_data_utils.TrainDataProcessor._collate_samples_single_type(
          type: nemo_gym.config_types.DatasetType,
          server_instance_configs: typing.List[nemo_gym.config_types.ServerInstanceConfig]
      ) -> typing.List[pathlib.Path]
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_gym-train_data_utils-TrainDataProcessor-_iter_dataset_lines">
    <CodeBlock links={{"nemo_gym.config_types.DatasetConfig":"/nemo-gym/nemo_gym/config_types#nemo_gym-config_types-DatasetConfig"}} showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_gym.train_data_utils.TrainDataProcessor._iter_dataset_lines(
          dataset_config: nemo_gym.config_types.DatasetConfig
      )
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_gym-train_data_utils-TrainDataProcessor-_print_title">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_gym.train_data_utils.TrainDataProcessor._print_title(
          title: str
      ) -> None
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_gym-train_data_utils-TrainDataProcessor-_validate_aggregate_metrics">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_gym.train_data_utils.TrainDataProcessor._validate_aggregate_metrics(
          aggregate_metrics_dict: typing.Dict,
          metrics_fpath: pathlib.Path
      ) -> typing.Optional[pathlib.Path]
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Returns the conflicting metrics fpath if invalid. Else returns None
  </Indent>

  <Anchor id="nemo_gym-train_data_utils-TrainDataProcessor-_validate_samples_and_aggregate_metrics_single_dataset">
    <CodeBlock links={{"nemo_gym.config_types.DatasetConfig":"/nemo-gym/nemo_gym/config_types#nemo_gym-config_types-DatasetConfig","nemo_gym.train_data_utils.DatasetValidatorState":"#nemo_gym-train_data_utils-DatasetValidatorState"}} showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_gym.train_data_utils.TrainDataProcessor._validate_samples_and_aggregate_metrics_single_dataset(
          dataset_config: nemo_gym.config_types.DatasetConfig
      ) -> nemo_gym.train_data_utils.DatasetValidatorState
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_gym-train_data_utils-TrainDataProcessor-_validate_samples_and_aggregate_metrics_single_sample">
    <CodeBlock links={{"nemo_gym.train_data_utils.DatasetValidatorState":"#nemo_gym-train_data_utils-DatasetValidatorState"}} showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_gym.train_data_utils.TrainDataProcessor._validate_samples_and_aggregate_metrics_single_sample(
          state: nemo_gym.train_data_utils.DatasetValidatorState,
          sample_idx: int,
          sample_dict_str: str
      ) -> None
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_gym-train_data_utils-TrainDataProcessor-collate_samples">
    <CodeBlock links={{"nemo_gym.train_data_utils.TrainDataProcessorConfig":"#nemo_gym-train_data_utils-TrainDataProcessorConfig","nemo_gym.config_types.ServerInstanceConfig":"/nemo-gym/nemo_gym/config_types#nemo_gym-config_types-ServerInstanceConfig","nemo_gym.train_data_utils.DatasetMetrics":"#nemo_gym-train_data_utils-DatasetMetrics"}} showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_gym.train_data_utils.TrainDataProcessor.collate_samples(
          config: nemo_gym.train_data_utils.TrainDataProcessorConfig,
          server_instance_configs: typing.List[nemo_gym.config_types.ServerInstanceConfig],
          dataset_type_to_aggregate_metrics: typing.Dict[str, nemo_gym.train_data_utils.DatasetMetrics]
      ) -> None
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_gym-train_data_utils-TrainDataProcessor-load_and_validate_server_instance_configs">
    <CodeBlock links={{"nemo_gym.train_data_utils.TrainDataProcessorConfig":"#nemo_gym-train_data_utils-TrainDataProcessorConfig","nemo_gym.config_types.ServerInstanceConfig":"/nemo-gym/nemo_gym/config_types#nemo_gym-config_types-ServerInstanceConfig"}} showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_gym.train_data_utils.TrainDataProcessor.load_and_validate_server_instance_configs(
          config: nemo_gym.train_data_utils.TrainDataProcessorConfig,
          global_config_dict: omegaconf.DictConfig
      ) -> typing.List[nemo_gym.config_types.ServerInstanceConfig]
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_gym-train_data_utils-TrainDataProcessor-load_datasets">
    <CodeBlock links={{"nemo_gym.train_data_utils.TrainDataProcessorConfig":"#nemo_gym-train_data_utils-TrainDataProcessorConfig","nemo_gym.config_types.ServerInstanceConfig":"/nemo-gym/nemo_gym/config_types#nemo_gym-config_types-ServerInstanceConfig"}} showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_gym.train_data_utils.TrainDataProcessor.load_datasets(
          config: nemo_gym.train_data_utils.TrainDataProcessorConfig,
          server_instance_configs: typing.List[nemo_gym.config_types.ServerInstanceConfig]
      ) -> None
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_gym-train_data_utils-TrainDataProcessor-run">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_gym.train_data_utils.TrainDataProcessor.run(
          global_config_dict: omegaconf.DictConfig
      )
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    See the README section "How To: Prepare and validate data for PR submission or RL training"
  </Indent>

  <Anchor id="nemo_gym-train_data_utils-TrainDataProcessor-validate_samples_and_aggregate_metrics">
    <CodeBlock links={{"nemo_gym.config_types.ServerInstanceConfig":"/nemo-gym/nemo_gym/config_types#nemo_gym-config_types-ServerInstanceConfig","nemo_gym.train_data_utils.DatasetMetrics":"#nemo_gym-train_data_utils-DatasetMetrics"}} showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_gym.train_data_utils.TrainDataProcessor.validate_samples_and_aggregate_metrics(
          server_instance_configs: typing.List[nemo_gym.config_types.ServerInstanceConfig],
          overwrite_metrics_conflicts: bool
      ) -> typing.Dict[str, nemo_gym.train_data_utils.DatasetMetrics]
      ```
    </CodeBlock>
  </Anchor>

  <Indent />
</Indent>

<Anchor id="nemo_gym-train_data_utils-TrainDataProcessorConfig">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    class nemo_gym.train_data_utils.TrainDataProcessorConfig()
    ```
  </CodeBlock>
</Anchor>

<Indent>
  **Bases:** [BaseNeMoGymCLIConfig](/nemo-gym/nemo_gym/config_types#nemo_gym-config_types-BaseNeMoGymCLIConfig)

  Prepare and validate training data, generating metrics and statistics for datasets.

  Examples:

  <CodeBlock showLineNumbers={false}>
    ```python
    config_paths="resources_servers/example_multi_step/configs/example_multi_step.yaml,\
    responses_api_models/openai_model/configs/openai_model.yaml"
    ng_prepare_data "+config_paths=[${config_paths}]"         +output_dirpath=data/example_multi_step         +mode=example_validation
    ```
  </CodeBlock>

  <ParamField path="data_source" type="Literal['gitlab', 'huggingface']" />

  <ParamField path="in_scope_dataset_types" type="List[DatasetType]" />

  <ParamField path="mode" type="Union[Literal['train_preparation'], Literal['example_validation']]" />

  <ParamField path="output_dirpath" type="str" />

  <ParamField path="overwrite_metrics_conflicts" type="bool" />

  <ParamField path="should_download" type="bool" />
</Indent>

<Anchor id="nemo_gym-train_data_utils-aggregate_other_metrics">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_gym.train_data_utils.aggregate_other_metrics(
        metrics: typing.Dict[str, typing.Any],
        sample: typing.Dict[str, typing.Any]
    ) -> None
    ```
  </CodeBlock>
</Anchor>

<Indent>
  Combines misc items (those other than response/response create params) into current metrics
</Indent>

<Anchor id="nemo_gym-train_data_utils-compute_sample_metrics">
  <CodeBlock links={{"nemo_gym.train_data_utils.DatasetMetrics":"#nemo_gym-train_data_utils-DatasetMetrics"}} showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_gym.train_data_utils.compute_sample_metrics(
        sample_dict_str: str
    ) -> typing.Tuple[nemo_gym.train_data_utils.DatasetMetrics, bool]
    ```
  </CodeBlock>
</Anchor>

<Indent />

<Anchor id="nemo_gym-train_data_utils-postprocess_other_metrics">
  <CodeBlock links={{"nemo_gym.train_data_utils.DatasetMetrics":"#nemo_gym-train_data_utils-DatasetMetrics"}} showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_gym.train_data_utils.postprocess_other_metrics(
        metrics: nemo_gym.train_data_utils.DatasetMetrics,
        other_metrics: typing.Dict[str, typing.Any]
    ) -> None
    ```
  </CodeBlock>
</Anchor>

<Indent>
  Aggregates metrics and merges current metrics (containing only AvgMinMax) with StringMetrics
</Indent>

<Anchor id="nemo_gym-train_data_utils-prepare_data">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_gym.train_data_utils.prepare_data()
    ```
  </CodeBlock>
</Anchor>

<Indent />

<Anchor id="nemo_gym-train_data_utils-validate_backend_credentials">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_gym.train_data_utils.validate_backend_credentials(
        backend: str
    ) -> tuple[bool, str]
    ```
  </CodeBlock>
</Anchor>

<Indent>
  Check if required env variables are present for the chosen backend
</Indent>