> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.datasets.llm.megatron.helpers

## Module Contents

### Functions

| Name                                                                                            | Description                                                                                 |
| ----------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- |
| [`build_sample_idx`](#nemo_automodel-components-datasets-llm-megatron-helpers-build_sample_idx) | Build the 2-D sample index using the properly typed templated C++ function from helpers.cpp |

### API

```python
nemo_automodel.components.datasets.llm.megatron.helpers.build_sample_idx(
    sizes: numpy.ndarray,
    document_indices: numpy.ndarray,
    sequence_length: int,
    num_epochs: int,
    tokens_per_epoch: int,
    drop_last_partial_sequence: bool = True,
    add_extra_token_to_sequence: bool = True
)
```

Build the 2-D sample index using the properly typed templated C++ function from helpers.cpp

**Parameters:**

The 1-D array of document lengths

The 1-D array of document indices

The sequence length

The number of epochs

The number of tokens per epoch

Whether to omit the last partial sequence in the sample
index should it exist. Defaults to True.

Whether to build samples with sequence length
`sequence_length + 1`. Defaults to True.

**Returns:**

numpy.ndarray: The 2-D sample index