bridge.data.datasets.sft
#
Module Contents#
Classes#
Functions#
Returns the root directory for NeMo datasets, creating it if it doesn’t exist. |
|
Creates and returns a supervised fine-tuning (SFT) dataset instance. |
Data#
API#
- bridge.data.datasets.sft.DEFAULT_NEMO_CACHE_HOME#
None
- bridge.data.datasets.sft.NEMO_CACHE_HOME#
‘Path(…)’
- bridge.data.datasets.sft.DEFAULT_NEMO_DATASETS_CACHE#
None
- bridge.data.datasets.sft.NEMO_DATASETS_CACHE#
‘Path(…)’
- bridge.data.datasets.sft.DEFAULT_NEMO_MODELS_CACHE#
None
- bridge.data.datasets.sft.NEMO_MODELS_CACHE#
‘Path(…)’
- bridge.data.datasets.sft.logger#
‘getLogger(…)’
- bridge.data.datasets.sft.PREFIX_STR#
‘\x00’
- bridge.data.datasets.sft.__idx_version__#
‘0.2’
- bridge.data.datasets.sft.__idx_suffix__#
‘idx’
- bridge.data.datasets.sft.get_dataset_root(name: str) pathlib.Path #
Returns the root directory for NeMo datasets, creating it if it doesn’t exist.
- Parameters:
name (str) – The name of the dataset, used to create a subdirectory within the NeMo datasets cache.
- Returns:
The path to the dataset’s root directory.
- Return type:
Path
- bridge.data.datasets.sft.create_sft_dataset(
- path: pathlib.Path,
- tokenizer: megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer,
- seq_length: int = 2048,
- add_bos: bool = False,
- add_eos: bool = True,
- add_sep: bool = False,
- seed: int = 1234,
- label_key: str = 'output',
- answer_only_loss: bool = True,
- truncation_field: str = 'input',
- pad_to_max_length: bool = False,
- index_mapping_dir: Optional[str] = None,
- prompt_template: str = '{input} {output}',
- truncation_method: str = 'right',
- memmap_workers: int = 2,
- hf_dataset: bool = False,
- global_sample_mapping: bool = False,
- get_attention_mask_from_fusion: bool = True,
- pack_metadata_file_path: pathlib.Path = None,
- pad_cu_seqlens: bool = False,
- chat: bool = False,
- **kwargs,
Creates and returns a supervised fine-tuning (SFT) dataset instance.
This function acts as a factory for different types of SFT datasets based on the input parameters. It can create standard SFT datasets, chat-specific datasets, or packed sequence datasets.
- Parameters:
path (Path) – Path to the dataset file. For packed datasets, this should be a .npy file.
tokenizer (MegatronTokenizer) – The tokenizer to use for tokenizing the data.
seq_length (int, optional) – Maximum sequence length for each example. Defaults to 2048.
add_bos (bool, optional) – Whether to add a beginning-of-sentence token. Defaults to False.
add_eos (bool, optional) – Whether to add an end-of-sentence token. Defaults to True.
add_sep (bool, optional) – Whether to add a separation token between prompt and answer. Defaults to False.
seed (int, optional) – Random seed for data shuffling. Defaults to 1234.
label_key (str, optional) – The key in the dataset corresponding to the label/output. Defaults to “output”.
answer_only_loss (bool, optional) – If True, compute loss only on the answer part. Defaults to True.
truncation_field (str, optional) – Field(s) to truncate if the combined length exceeds
seq_length
. Comma-separated if multiple. Defaults to “input”.pad_to_max_length (bool, optional) – Whether to pad all samples to
max_seq_length
. Defaults to False.index_mapping_dir (Optional[str], optional) – Directory to store/load index mapping files. Defaults to None.
prompt_template (str, optional) – F-string template for combining input fields. Example: “{input} {output}”. Defaults to “{input} {output}”.
truncation_method (str, optional) – Method for truncation (‘left’ or ‘right’). Defaults to “right”.
memmap_workers (int, optional) – Number of workers for memory-mapped dataset loading. Defaults to 2.
hf_dataset (bool, optional) – Whether to load the dataset using HuggingFace’s
datasets
library. Defaults to False.global_sample_mapping (bool, optional) – Whether to use a global sample mapping for shuffling across all data, or shuffle within each epoch. Defaults to False.
get_attention_mask_from_fusion (bool) – if true, lets attention kernel handle creation of causal mask instead of adding it to the batch dict.
pack_metadata_file_path (Path, optional) – Path to the metadata file for packed datasets. Required if
pad_cu_seqlens
is True. Defaults to None.pad_cu_seqlens (bool, optional) – Whether to pad
cu_seqlens
for packed datasets, required for cudagraphs. Defaults to False.chat (bool, optional) – If True, creates a
GPTSFTChatDataset
. Defaults to False.**kwargs – Additional keyword arguments passed to the specific dataset class constructor.
- Returns:
An instance of the appropriate SFT dataset class.
- Return type:
- class bridge.data.datasets.sft.GPTSFTDataset(
- file_path: str,
- tokenizer: megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer,
- max_seq_length: int = 1024,
- min_seq_length: int = 1,
- pad_seq_length_to_mult: int = 16,
- add_bos: bool = False,
- add_eos: bool = True,
- add_sep: bool = False,
- sep_id: int = None,
- max_num_samples: int = None,
- seed: int = 1234,
- label_key: str = 'answer',
- answer_only_loss: bool = True,
- truncation_field: str = 'text',
- pad_to_max_length: bool = False,
- index_mapping_dir: str = None,
- prompt_template: str = None,
- virtual_tokens: int = 0,
- tokens_to_generate: int = 0,
- memmap_workers: Optional[int] = None,
- hf_dataset: bool = False,
- global_sample_mapping: bool = False,
- truncation_method: str = 'right',
- special_tokens: Optional[Mapping[str, str]] = None,
- is_test: bool = False,
- output_original_text: bool = False,
- ceil_to_power_2: bool = False,
- get_attention_mask_from_fusion: bool = True,
- sanity_check_dist_workers: bool = True,
Bases:
torch.utils.data.Dataset
Initialization
file_path: Path to a JSONL GPT supervised fine-tuning dataset. Data is formatted as multiple JSON lines with each line formatted as follows: { 'input': 'John von Neumann
Von Neumann made fundamental contributions … Q: What did the math of artificial viscosity do?’, ‘output’: ‘smoothed the shock transition without sacrificing basic physics’ } tokenizer: Tokenizer for the dataset. Instance of a class that inherits MegatronTokenizer (ex: SentencePiece). max_seq_length (int): maximum sequence length for each dataset examples. Examples will either be truncated to fit this length or dropped if they cannot be truncated. min_seq_length (int): min length of each data example in the dataset. Data examples will be dropped if they do not meet the min length requirements. add_bos (bool): Whether to add a beginning of sentence token to each data example add_eos (bool): Whether to add an end of sentence token to each data example add_sep (bool): Whether to add a separation token to each data example (goes between prompt and answer) tokens_to_generate (int): (inference only) Number of tokens to generate during inference seed: Random seed for data shuffling. max_num_samples: Maximum number of samples to load. This can be > dataset length if you want to oversample data. If None, all samples will be loaded. label_key: Key to use for the label in your JSONL file answer_only_loss: If True, will compute the loss only on the answer part of the input. If False, will compute the loss on the entire input. truncation_field: Field to use for truncation. (Options: keys in prompt_template). Field to be used for truncation if the combined length exceeds the max sequence length. pad_to_max_length: Whether to pad the input to the max sequence length. If False, will pad to the max length of the current batch. index_mapping_dir: Directory to save the index mapping to. If None, will write to the same folder as the dataset. prompt_template: Prompt template to inject via an fstring. Formatted like Q: {context_key}
A: {label_key} hf_dataset: Whether to load the json file with the HuggingFace dataset. Otherwise, will load the jsonl file with the JSONLMemMapDataset. global_sample_mapping: Whether to shuffle all data together, or shuffle the dataset within each epoch truncation_method: Truncation from which position. Options: [‘left’, ‘right’] special_tokens: special tokens for the chat prompts, a dictionary of {token_type: token}. Default: { ‘system_turn_start’: ‘<extra_id_0>’, ‘turn_start’: ‘<extra_id_1>’, ‘label_start’: ‘<extra_id_2>’, ‘end_of_turn’: ‘ ‘, ‘end_of_name’: ‘ ‘ } is_test: Whether this dataset is the test split. output_original_text (bool): if true, will keep the original text in the output alongside the tokenized ids. get_attention_mask_from_fusion (bool): if true, lets attention kernel handle creation of causal mask instead of adding it to the batch dict. sanity_check_dist_workers (bool): if true, will run sanity check across workers when making mapping.
- _load_dataset()#
- _maybe_validate_prompt_template()#
- _build_samples_mapping()#
- __len__()#
Return the total number of samples in this dataset.
- __getitem__(idx)#
- _separate_template(prompt_template_values: List[str])#
Combine contexts and label based on prompt_template into a list of strings and a list of keys.
- Parameters:
prompt_template_values (List[str]) – the list of context and label strings extrated from jsonl file with prompt_template_keys.
- Returns:
separated prompt_template with contexts/label placeholder filled with corresponding strings template_strings_keys (List[str]): strings point to placeholder keys or
- Return type:
template_strings (List[str])
.. rubric:: Examples
prompt_template = ‘Context: {context} Question: {question} Answer: {label}’ prompt_template_values = [‘xxx’, ‘yyy’, ‘zzz’]
tokenizer.space_sensitive = True#
template_strings = [‘Context:’, ‘ xxx’, ‘ Question:’, ‘ yyy’, ‘ Answer:’, ‘ zzz’]
tokenizer.space_sensitive = False#
template_strings = [‘Context:’, ‘ xxx’, ‘Question:’, ‘yyy’, ‘Answer:’, ‘zzz’]
template_strings_keys = [’’, ‘context’, ‘’, ‘question’, ‘’, ‘label’]
- _multiple_truncation(
- template_ids: List[List[int]],
- template_ids_keys: List[str],
Calculate total tokens and truncate multiple contexts in truncation_fields.
- Parameters:
template_ids (List[List[int]]) – the list of separate prompt_template ids.
template_ids_keys (List[str]) – the list of placeholder keys or (used to check key in truncation_fields).
- Returns:
all context ids. label_ids (List[int]): all label ids.
- Return type:
context_ids (List[int])
- _truncation(ids, expect_length)#
- _process_example(example)#
Create an example by concatenating text and answer. Truncation is carried out when needed, but it is performed only on the prompt side. BOS, EOS, and SEP, are added if specified.
- _maybe_cast_to_list(x)#
- _ceil_to_nearest(n, m)#
- _collate_item(item, max_length, pad_id)#
- _build_loss_mask(processed_example)#
Pad input_ids in batch to max batch length while building loss mask
- _create_attention_mask(max_length)#
Creates an upper-triangular causal attention mask.
- Parameters:
input_ids – A 1D tensor that holds the indices of tokens.
- collate_fn(batch)#
Collate a list of samples into a batch dictionary for model training or evaluation.
This function takes a list of individual processed samples (from
__getitem__
) and groups them into a batch. It handles padding of sequences to the maximum length found in the batch (orself.max_seq_length
ifpad_to_max_length
is True), and prepares all necessary tensors for the model.- Parameters:
batch (List[dict]) – A list of dictionaries, where each dictionary is a sample processed by
_process_example
.- Returns:
A dictionary of batched tensors ready for model input. Key tensors include ‘tokens’, ‘labels’, ‘loss_mask’, ‘position_ids’, and ‘attention_mask’.
- Return type:
dict
- class bridge.data.datasets.sft.GPTSFTPackedDataset(
- file_path: str,
- tokenizer: megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer,
- return_cu_seqlen: bool = True,
- pad_cu_seqlens: bool = False,
- pack_metadata_file_path: Optional[str] = None,
- **kwargs,
Bases:
bridge.data.datasets.sft.GPTSFTDataset
Initialization
file_path: See
file_path
in the parent class. tokenizer: Seetokenizer
in the parent class. return_cu_seqlen: Whether to returncu_seqlen
to pass to the model. Havingcu_seqlen
in the model input enables THD attention kernel, which is the correct format for training with packed sequence to prevent cross-sequence attention. This flag should be True unless you have a specific use case.- __getitem__(idx)#
- _load_dataset()#
- _build_samples_mapping()#
- _build_loss_mask(processed_example)#
- _maybe_cast_to_list(x)#
- collate_fn(batch)#
Collates a list of packed sequence samples into a batch for the model.
This method is specifically designed for
GPTSFTPackedDataset
. It takes a list of packed sequence items (as returned by__getitem__
) and prepares a batch of tensors. This includes handlingcu_seqlens
which are crucial for the efficient processing of packed sequences with kernels like THD attention.- Parameters:
batch (List[dict]) – A list of packed sequence samples.
- Returns:
A dictionary of batched tensors, including ‘tokens’, ‘labels’, ‘loss_mask’, ‘position_ids’, and potentially ‘cu_seqlens’, ‘cu_seqlens_argmin’, ‘max_seqlen’ if
return_cu_seqlen
is True.- Return type:
dict
- class bridge.data.datasets.sft.GPTSFTChatDataset(
- file_path: str,
- tokenizer: megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer,
- max_seq_length: int = 1024,
- min_seq_length: int = 1,
- pad_seq_length_to_mult: int = 16,
- add_bos: bool = False,
- add_eos: bool = True,
- add_sep: bool = False,
- sep_id: int = None,
- max_num_samples: int = None,
- seed: int = 1234,
- label_key: str = 'answer',
- answer_only_loss: bool = True,
- truncation_field: str = 'text',
- pad_to_max_length: bool = False,
- index_mapping_dir: str = None,
- prompt_template: str = None,
- virtual_tokens: int = 0,
- tokens_to_generate: int = 0,
- memmap_workers: Optional[int] = None,
- hf_dataset: bool = False,
- global_sample_mapping: bool = False,
- truncation_method: str = 'right',
- special_tokens: Optional[Mapping[str, str]] = None,
- is_test: bool = False,
- output_original_text: bool = False,
- ceil_to_power_2: bool = False,
- get_attention_mask_from_fusion: bool = True,
- sanity_check_dist_workers: bool = True,
Bases:
bridge.data.datasets.sft.GPTSFTDataset
Initialization
file_path: Path to a JSONL GPT supervised fine-tuning dataset. Data is formatted as multiple JSON lines with each line formatted as follows: { 'input': 'John von Neumann
Von Neumann made fundamental contributions … Q: What did the math of artificial viscosity do?’, ‘output’: ‘smoothed the shock transition without sacrificing basic physics’ } tokenizer: Tokenizer for the dataset. Instance of a class that inherits MegatronTokenizer (ex: SentencePiece). max_seq_length (int): maximum sequence length for each dataset examples. Examples will either be truncated to fit this length or dropped if they cannot be truncated. min_seq_length (int): min length of each data example in the dataset. Data examples will be dropped if they do not meet the min length requirements. add_bos (bool): Whether to add a beginning of sentence token to each data example add_eos (bool): Whether to add an end of sentence token to each data example add_sep (bool): Whether to add a separation token to each data example (goes between prompt and answer) tokens_to_generate (int): (inference only) Number of tokens to generate during inference seed: Random seed for data shuffling. max_num_samples: Maximum number of samples to load. This can be > dataset length if you want to oversample data. If None, all samples will be loaded. label_key: Key to use for the label in your JSONL file answer_only_loss: If True, will compute the loss only on the answer part of the input. If False, will compute the loss on the entire input. truncation_field: Field to use for truncation. (Options: keys in prompt_template). Field to be used for truncation if the combined length exceeds the max sequence length. pad_to_max_length: Whether to pad the input to the max sequence length. If False, will pad to the max length of the current batch. index_mapping_dir: Directory to save the index mapping to. If None, will write to the same folder as the dataset. prompt_template: Prompt template to inject via an fstring. Formatted like Q: {context_key}
A: {label_key} hf_dataset: Whether to load the json file with the HuggingFace dataset. Otherwise, will load the jsonl file with the JSONLMemMapDataset. global_sample_mapping: Whether to shuffle all data together, or shuffle the dataset within each epoch truncation_method: Truncation from which position. Options: [‘left’, ‘right’] special_tokens: special tokens for the chat prompts, a dictionary of {token_type: token}. Default: { ‘system_turn_start’: ‘<extra_id_0>’, ‘turn_start’: ‘<extra_id_1>’, ‘label_start’: ‘<extra_id_2>’, ‘end_of_turn’: ‘ ‘, ‘end_of_name’: ‘ ‘ } is_test: Whether this dataset is the test split. output_original_text (bool): if true, will keep the original text in the output alongside the tokenized ids. get_attention_mask_from_fusion (bool): if true, lets attention kernel handle creation of causal mask instead of adding it to the batch dict. sanity_check_dist_workers (bool): if true, will run sanity check across workers when making mapping.
- _maybe_validate_prompt_template()#
- _build_samples_mapping()#
- _process_example(example)#
Create an example by concatenating text and answer. Truncation is carried out when needed, but it is performed only on the prompt side. BOS, EOS, and SEP, are added if specified.
- collate_fn(batch)#
Collates a list of processed chat examples into a batch for model input.
This function takes a list of individual processed chat samples (from
__getitem__
, which internally uses_process_example
) and groups them into a batch. It handles padding of sequences to the maximum length in the batch (orself.max_seq_length
ifpad_to_max_length
is True), and prepares all necessary tensors for the model, similar to the base class collate_fn but specific to chat data structure.- Parameters:
batch (List[dict]) – A list of dictionaries, where each dictionary is a sample processed by
_process_example
.- Returns:
A dictionary of batched tensors ready for model input. Key tensors include ‘tokens’, ‘labels’, ‘loss_mask’, ‘position_ids’, and ‘attention_mask’.
- Return type:
dict