nemo_automodel.components.datasets.llm.eagle3
nemo_automodel.components.datasets.llm.eagle3
Data helpers for minimal EAGLE-3 training.
Module Contents
Functions
Data
API
Rank 0 loads (and validates) the cached ids; broadcast the result to all ranks.
Only rank 0 touches the filesystem, so the load-vs-build decision is identical
on every rank even when cache_path lives on a node-local (non-shared)
filesystem. This matters because build_eagle3_token_mapping issues a
collective all_reduce: if some ranks loaded a cache while others rebuilt,
that collective would mismatch and hang. Returns the ids (cpu, long) or
None (rebuild on every rank).
Return how many ids build_eagle3_token_mapping yields for this config.
Mirrors its selection branch: a None or too-large draft_vocab_size
falls back to the full target vocab.
Collate packed rows; ragged seq_lens is 0-padded to [B, max_docs].
Stack a batch of pre-padded unshifted chat samples.
Build a dataloader backed by the repo’s chat formatting utilities.
packed_sequence_size > 0 (EAGLE-3 only) enables sequence packing (see
:func:build_packed_eagle3_dataset), removing the padding waste of the
default padding="max_length" path; == 0 keeps the original behavior.
dp_mesh (the “dp” device submesh) is required for context parallelism: the
sampler then distributes by data-parallel rank so the cp_size ranks within
a dp group receive the identical sample (CP shards its sequence across them).
When None the sampler falls back to the full-world default (pure DP).
Build draft-vocab mapping tensors from supervised token frequency.
Counts are accumulated as a dense [target_vocab_size] tensor and
all_reduce summed across ranks when torch.distributed is
initialized, so every rank ends up with the same draft vocabulary.
Returns: torch.Tensor
Tuple (selected_token_ids, selected_token_mask) where:
Greedily pack variable-length chat samples into rows of packed_sequence_size.
Each source sample is one document; documents are concatenated into a
fixed-width row with position_ids reset per document and trailing pad
folded into the final document (so seq_lens sums to the row width).
Cross-document leakage at TTT boundaries is handled by doc_remaining[t]
(real tokens after slot t within its document): the trainer supervises
slot t at step k to predict k+1 ahead, valid iff
k < doc_remaining[t]. This masks every cross-document / into-padding
supervision — packing creates many such boundaries per row.
Returns a list of packed-row dicts with keys input_ids, loss_mask,
attention_mask, position_ids, doc_remaining (length
packed_sequence_size) and seq_lens (per-document padded lengths).
Load a cached draft-vocab mapping, or None if absent / incompatible.
The cache is keyed only on target_vocab_size and the resulting draft
vocab size — it does NOT fingerprint the dataset or tokenizer. A cache built
from a different dataset still loads cleanly, so a caller that changes the
training data must point selected_token_ids_path at a fresh location (or
delete the file). Returns None — so the caller rebuilds — when the file
is missing, unreadable, or its stored vocab sizes do not match the config.
Build the draft-vocab mapping, reusing a cached copy at cache_path.
With cache_path set, present, and compatible, loads the mapping and skips
the full-dataset frequency scan build_eagle3_token_mapping performs.
Otherwise builds the mapping and — on rank 0 — writes it to cache_path
for next time. With cache_path=None this is exactly
build_eagle3_token_mapping.
Persist the draft-vocab selection so future runs skip the frequency scan.
Written atomically (.tmp + os.replace) so a crash mid-write never
leaves a half-written file a later run would load. Only selected_token_ids
is stored — selected_token_mask is fully derivable from it plus
target_vocab_size.