nemo_automodel.components.datasets.llm.eagle3#

Data helpers for minimal EAGLE-3 training.

Module Contents#

Functions#

_stack_batch

Stack a batch of pre-padded unshifted chat samples.

build_eagle3_dataloader

Build a dataloader backed by the repo’s chat formatting utilities.

build_eagle3_token_mapping

Build draft-vocab mapping tensors from supervised token frequency.

API#

nemo_automodel.components.datasets.llm.eagle3._stack_batch(
features: list[dict[str, Any]],
) dict[str, torch.Tensor]#

Stack a batch of pre-padded unshifted chat samples.

nemo_automodel.components.datasets.llm.eagle3.build_eagle3_dataloader(
*,
data_path: str,
tokenizer,
seq_length: int,
batch_size: int,
shuffle: bool,
num_workers: int = 0,
split: str | None = None,
distributed: bool = False,
shuffle_seed: int | None = 42,
) torch.utils.data.DataLoader#

Build a dataloader backed by the repo’s chat formatting utilities.

nemo_automodel.components.datasets.llm.eagle3.build_eagle3_token_mapping(
dataloader: torch.utils.data.DataLoader,
*,
target_vocab_size: int,
draft_vocab_size: int | None,
special_token_ids: list[int] | None = None,
) tuple[torch.Tensor, torch.Tensor]#

Build draft-vocab mapping tensors from supervised token frequency.

Counts are accumulated as a dense [target_vocab_size] tensor and all_reduce summed across ranks when torch.distributed is initialized, so every rank ends up with the same draft vocabulary.

Returns:

  • selected_token_ids has shape [draft_vocab_size]

  • selected_token_mask has shape [target_vocab_size]

Return type:

Tuple (selected_token_ids, selected_token_mask) where