nemo_automodel.components.datasets.llm.eagle3#
Data helpers for minimal EAGLE-3 training.
Module Contents#
Functions#
Stack a batch of pre-padded unshifted chat samples. |
|
Build a dataloader backed by the repo’s chat formatting utilities. |
|
Build draft-vocab mapping tensors from supervised token frequency. |
API#
- nemo_automodel.components.datasets.llm.eagle3._stack_batch(
- features: list[dict[str, Any]],
Stack a batch of pre-padded unshifted chat samples.
- nemo_automodel.components.datasets.llm.eagle3.build_eagle3_dataloader(
- *,
- data_path: str,
- tokenizer,
- seq_length: int,
- batch_size: int,
- shuffle: bool,
- num_workers: int = 0,
- split: str | None = None,
- distributed: bool = False,
- shuffle_seed: int | None = 42,
Build a dataloader backed by the repo’s chat formatting utilities.
- nemo_automodel.components.datasets.llm.eagle3.build_eagle3_token_mapping(
- dataloader: torch.utils.data.DataLoader,
- *,
- target_vocab_size: int,
- draft_vocab_size: int | None,
- special_token_ids: list[int] | None = None,
Build draft-vocab mapping tensors from supervised token frequency.
Counts are accumulated as a dense
[target_vocab_size]tensor andall_reducesummed across ranks whentorch.distributedis initialized, so every rank ends up with the same draft vocabulary.- Returns:
selected_token_idshas shape[draft_vocab_size]selected_token_maskhas shape[target_vocab_size]
- Return type:
Tuple
(selected_token_ids, selected_token_mask)where