> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.training.step_scheduler

## Module Contents

### Classes

| Name                                                                                            | Description                                                           |
| ----------------------------------------------------------------------------------------------- | --------------------------------------------------------------------- |
| [`StepScheduler`](#nemo_automodel-components-training-step_scheduler-StepScheduler)             | Scheduler for managing gradient accumulation and checkpointing steps. |
| [`StepSchedulerConfig`](#nemo_automodel-components-training-step_scheduler-StepSchedulerConfig) | User-facing step scheduler configuration.                             |

### Functions

| Name                                                                                                | Description                                                    |
| --------------------------------------------------------------------------------------------------- | -------------------------------------------------------------- |
| [`_calculate_max_steps`](#nemo_automodel-components-training-step_scheduler-_calculate_max_steps)   | Calculate the maximum number of steps.                         |
| [`_calculate_num_epochs`](#nemo_automodel-components-training-step_scheduler-_calculate_num_epochs) | Calculate the number of epochs out of maximum number of steps. |

### Data

[`logger`](#nemo_automodel-components-training-step_scheduler-logger)

### API

```python
class nemo_automodel.components.training.step_scheduler.StepScheduler(
    global_batch_size: int,
    local_batch_size: int,
    dp_size: int,
    dataloader: typing.Optional[int],
    ckpt_every_steps: typing.Optional[int] = None,
    save_checkpoint_every_epoch: bool = True,
    val_every_steps: typing.Optional[int] = None,
    log_remote_every_steps: int = 1,
    loss_average_window_steps: int = 50,
    gc_every_steps: typing.Optional[int] = None,
    start_step: int = 0,
    start_epoch: int = 0,
    num_epochs: typing.Optional[int] = None,
    max_steps: typing.Optional[int] = None
)
```

**Bases:** `Stateful`

Scheduler for managing gradient accumulation and checkpointing steps.

Epoch iterator.

Returns whether this step needs to call the checkpoint saving.

Returns whether this step needs to run manual garbage collection.

Returns whether this is the last batch for this epoch.

Returns whether the current step is the final training step.

Training stops at whichever comes first: reaching `max_steps` or
exhausting the configured number of epochs (see `__iter__` and
`epochs`). `max_steps` alone is therefore not enough to detect the
end -- a small dataset can run out of epochs long before `max_steps`
is hit (e.g. `max_steps=100` with only 60 steps' worth of data). In
that case the last batch of the last epoch is the final step. Detect it
so the final checkpoint and consolidated export -- which key off this
flag (see `is_ckpt_step` and the recipes' `is_final_checkpoint`) --
are still written.

Returns whether this step should log to remote services (WandB, MLflow, etc.).

Returns whether this step needs to call the validation.

Returns whether SIGTERM was received.

```python
nemo_automodel.components.training.step_scheduler.StepScheduler.__iter__()
```

Iterates over dataloader while keeping track of counters.

**Raises:**

* `StopIteration`: If the dataloader was exhausted or max\_steps was reached.

```python
nemo_automodel.components.training.step_scheduler.StepScheduler.load_state_dict(
    s
)
```

Load the scheduler state from a dictionary.

**Parameters:**

Dictionary containing 'step' and 'epoch'.

```python
nemo_automodel.components.training.step_scheduler.StepScheduler.set_epoch(
    epoch: int
)
```

Set the epoch for the sampler.

```python
nemo_automodel.components.training.step_scheduler.StepScheduler.state_dict()
```

Get the current state of the scheduler.

**Returns:**

Current state with 'step' and 'epoch' keys.

```python
class nemo_automodel.components.training.step_scheduler.StepSchedulerConfig(
    global_batch_size: int = 32,
    num_epochs: int | None = 10,
    max_steps: int | None = None,
    ckpt_every_steps: int | None = 100,
    save_checkpoint_every_epoch: bool = True,
    val_every_steps: int | None = None,
    log_remote_every_steps: int = 1,
    loss_average_window_steps: int = 50,
    gc_every_steps: int | None = None,
    start_step: int = 0,
    start_epoch: int = 0
)
```

Dataclass

User-facing step scheduler configuration.

These fields correspond to the YAML-configurable parameters of the
training loop.  Runtime-only values (`dataloader`, `dp_size`,
`local_batch_size`) are passed separately to `build_step_scheduler`.

```python
nemo_automodel.components.training.step_scheduler.StepSchedulerConfig.build(
    dataloader: torch.utils.data.DataLoader,
    dp_group_size: int,
    local_batch_size: int
) -> nemo_automodel.components.training.step_scheduler.StepScheduler
```

Build the step scheduler.

**Parameters:**

The training dataloader.

The size of the data parallel group.

The size of the local batch.

**Returns:** `StepScheduler`

Configured StepScheduler.

```python
nemo_automodel.components.training.step_scheduler._calculate_max_steps(
    num_epochs: int,
    epoch_len: typing.Optional[int],
    default_max_steps: int = 9223372036854775807
) -> int
```

Calculate the maximum number of steps.

```python
nemo_automodel.components.training.step_scheduler._calculate_num_epochs(
    max_steps: typing.Optional[int],
    epoch_len: typing.Optional[int],
    default_num_epochs: int = 10
) -> int
```

Calculate the number of epochs out of maximum number of steps.

```python
nemo_automodel.components.training.step_scheduler.logger = logging.getLogger(__name__)
```