nemo_automodel.components.datasets.llm.eagle3

View as Markdown

Data helpers for minimal EAGLE-3 training.

Module Contents

Functions

NameDescription
_broadcast_cached_idsRank 0 loads (and validates) the cached ids; broadcast the result to all ranks.
_expected_draft_vocab_sizeReturn how many ids build_eagle3_token_mapping yields for this config.
_pack_collateCollate packed rows; ragged seq_lens is 0-padded to [B, max_docs].
_stack_batchStack a batch of pre-padded unshifted chat samples.
build_eagle3_dataloaderBuild a dataloader backed by the repo’s chat formatting utilities.
build_eagle3_token_mappingBuild draft-vocab mapping tensors from supervised token frequency.
build_packed_eagle3_datasetGreedily pack variable-length chat samples into rows of packed_sequence_size.
load_eagle3_token_mappingLoad a cached draft-vocab mapping, or None if absent / incompatible.
load_or_build_eagle3_token_mappingBuild the draft-vocab mapping, reusing a cached copy at cache_path.
save_eagle3_token_mappingPersist the draft-vocab selection so future runs skip the frequency scan.

Data

logger

API

nemo_automodel.components.datasets.llm.eagle3._broadcast_cached_ids(
cache_path: str,
target_vocab_size: int,
draft_vocab_size: int | None
) -> torch.Tensor | None

Rank 0 loads (and validates) the cached ids; broadcast the result to all ranks.

Only rank 0 touches the filesystem, so the load-vs-build decision is identical on every rank even when cache_path lives on a node-local (non-shared) filesystem. This matters because build_eagle3_token_mapping issues a collective all_reduce: if some ranks loaded a cache while others rebuilt, that collective would mismatch and hang. Returns the ids (cpu, long) or None (rebuild on every rank).

nemo_automodel.components.datasets.llm.eagle3._expected_draft_vocab_size(
target_vocab_size: int,
draft_vocab_size: int | None
) -> int

Return how many ids build_eagle3_token_mapping yields for this config.

Mirrors its selection branch: a None or too-large draft_vocab_size falls back to the full target vocab.

nemo_automodel.components.datasets.llm.eagle3._pack_collate(
features: list[dict[str, typing.Any]]
) -> dict[str, torch.Tensor]

Collate packed rows; ragged seq_lens is 0-padded to [B, max_docs].

nemo_automodel.components.datasets.llm.eagle3._stack_batch(
features: list[dict[str, typing.Any]]
) -> dict[str, torch.Tensor]

Stack a batch of pre-padded unshifted chat samples.

nemo_automodel.components.datasets.llm.eagle3.build_eagle3_dataloader(
data_path: str,
tokenizer,
seq_length: int,
batch_size: int,
shuffle: bool,
num_workers: int = 0,
split: str | None = None,
distributed: bool = False,
shuffle_seed: int | None = 42,
mask_reasoning_content: bool = False,
packed_sequence_size: int = 0,
dp_mesh = None
) -> torch.utils.data.DataLoader

Build a dataloader backed by the repo’s chat formatting utilities.

packed_sequence_size > 0 (EAGLE-3 only) enables sequence packing (see :func:build_packed_eagle3_dataset), removing the padding waste of the default padding="max_length" path; == 0 keeps the original behavior.

dp_mesh (the “dp” device submesh) is required for context parallelism: the sampler then distributes by data-parallel rank so the cp_size ranks within a dp group receive the identical sample (CP shards its sequence across them). When None the sampler falls back to the full-world default (pure DP).

nemo_automodel.components.datasets.llm.eagle3.build_eagle3_token_mapping(
dataloader: torch.utils.data.DataLoader,
target_vocab_size: int,
draft_vocab_size: int | None,
special_token_ids: list[int] | None = None
) -> tuple[torch.Tensor, torch.Tensor]

Build draft-vocab mapping tensors from supervised token frequency.

Counts are accumulated as a dense [target_vocab_size] tensor and all_reduce summed across ranks when torch.distributed is initialized, so every rank ends up with the same draft vocabulary.

Returns: torch.Tensor

Tuple (selected_token_ids, selected_token_mask) where:

nemo_automodel.components.datasets.llm.eagle3.build_packed_eagle3_dataset(
source_dataset,
packed_sequence_size: int,
pad_token_id: int
) -> list[dict[str, list[int]]]

Greedily pack variable-length chat samples into rows of packed_sequence_size.

Each source sample is one document; documents are concatenated into a fixed-width row with position_ids reset per document and trailing pad folded into the final document (so seq_lens sums to the row width).

Cross-document leakage at TTT boundaries is handled by doc_remaining[t] (real tokens after slot t within its document): the trainer supervises slot t at step k to predict k+1 ahead, valid iff k < doc_remaining[t]. This masks every cross-document / into-padding supervision — packing creates many such boundaries per row.

Returns a list of packed-row dicts with keys input_ids, loss_mask, attention_mask, position_ids, doc_remaining (length packed_sequence_size) and seq_lens (per-document padded lengths).

nemo_automodel.components.datasets.llm.eagle3.load_eagle3_token_mapping(
path: str,
target_vocab_size: int,
draft_vocab_size: int | None
) -> tuple[torch.Tensor, torch.Tensor] | None

Load a cached draft-vocab mapping, or None if absent / incompatible.

The cache is keyed only on target_vocab_size and the resulting draft vocab size — it does NOT fingerprint the dataset or tokenizer. A cache built from a different dataset still loads cleanly, so a caller that changes the training data must point selected_token_ids_path at a fresh location (or delete the file). Returns None — so the caller rebuilds — when the file is missing, unreadable, or its stored vocab sizes do not match the config.

nemo_automodel.components.datasets.llm.eagle3.load_or_build_eagle3_token_mapping(
dataloader: torch.utils.data.DataLoader,
target_vocab_size: int,
draft_vocab_size: int | None,
special_token_ids: list[int] | None = None,
cache_path: str | None = None
) -> tuple[torch.Tensor, torch.Tensor]

Build the draft-vocab mapping, reusing a cached copy at cache_path.

With cache_path set, present, and compatible, loads the mapping and skips the full-dataset frequency scan build_eagle3_token_mapping performs. Otherwise builds the mapping and — on rank 0 — writes it to cache_path for next time. With cache_path=None this is exactly build_eagle3_token_mapping.

nemo_automodel.components.datasets.llm.eagle3.save_eagle3_token_mapping(
path: str,
selected_token_ids: torch.Tensor,
target_vocab_size: int
) -> None

Persist the draft-vocab selection so future runs skip the frequency scan.

Written atomically (.tmp + os.replace) so a crash mid-write never leaves a half-written file a later run would load. Only selected_token_ids is stored — selected_token_mask is fully derivable from it plus target_vocab_size.

nemo_automodel.components.datasets.llm.eagle3.logger = logging.getLogger(__name__)