bridge.data.datasets.sft#

Module Contents#

Classes#

Functions#

get_dataset_root

Returns the root directory for NeMo datasets, creating it if it doesn’t exist.

create_sft_dataset

Creates and returns a supervised fine-tuning (SFT) dataset instance.

Data#

API#

bridge.data.datasets.sft.DEFAULT_NEMO_CACHE_HOME#

None

bridge.data.datasets.sft.NEMO_CACHE_HOME#

‘Path(…)’

bridge.data.datasets.sft.DEFAULT_NEMO_DATASETS_CACHE#

None

bridge.data.datasets.sft.NEMO_DATASETS_CACHE#

‘Path(…)’

bridge.data.datasets.sft.DEFAULT_NEMO_MODELS_CACHE#

None

bridge.data.datasets.sft.NEMO_MODELS_CACHE#

‘Path(…)’

bridge.data.datasets.sft.logger#

‘getLogger(…)’

bridge.data.datasets.sft.PREFIX_STR#

‘\x00’

bridge.data.datasets.sft.__idx_version__#

‘0.2’

bridge.data.datasets.sft.__idx_suffix__#

‘idx’

bridge.data.datasets.sft.get_dataset_root(name: str) pathlib.Path#

Returns the root directory for NeMo datasets, creating it if it doesn’t exist.

Parameters:

name (str) – The name of the dataset, used to create a subdirectory within the NeMo datasets cache.

Returns:

The path to the dataset’s root directory.

Return type:

Path

bridge.data.datasets.sft.create_sft_dataset(
path: pathlib.Path,
tokenizer: megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer,
seq_length: int = 2048,
add_bos: bool = False,
add_eos: bool = True,
add_sep: bool = False,
seed: int = 1234,
label_key: str = 'output',
answer_only_loss: bool = True,
truncation_field: str = 'input',
pad_to_max_length: bool = False,
index_mapping_dir: Optional[str] = None,
prompt_template: str = '{input} {output}',
truncation_method: str = 'right',
memmap_workers: int = 2,
hf_dataset: bool = False,
global_sample_mapping: bool = False,
get_attention_mask_from_fusion: bool = True,
pack_metadata_file_path: pathlib.Path = None,
pad_cu_seqlens: bool = False,
chat: bool = False,
**kwargs,
) GPTSFTDataset#

Creates and returns a supervised fine-tuning (SFT) dataset instance.

This function acts as a factory for different types of SFT datasets based on the input parameters. It can create standard SFT datasets, chat-specific datasets, or packed sequence datasets.

Parameters:
  • path (Path) – Path to the dataset file. For packed datasets, this should be a .npy file.

  • tokenizer (MegatronTokenizer) – The tokenizer to use for tokenizing the data.

  • seq_length (int, optional) – Maximum sequence length for each example. Defaults to 2048.

  • add_bos (bool, optional) – Whether to add a beginning-of-sentence token. Defaults to False.

  • add_eos (bool, optional) – Whether to add an end-of-sentence token. Defaults to True.

  • add_sep (bool, optional) – Whether to add a separation token between prompt and answer. Defaults to False.

  • seed (int, optional) – Random seed for data shuffling. Defaults to 1234.

  • label_key (str, optional) – The key in the dataset corresponding to the label/output. Defaults to “output”.

  • answer_only_loss (bool, optional) – If True, compute loss only on the answer part. Defaults to True.

  • truncation_field (str, optional) – Field(s) to truncate if the combined length exceeds seq_length. Comma-separated if multiple. Defaults to “input”.

  • pad_to_max_length (bool, optional) – Whether to pad all samples to max_seq_length. Defaults to False.

  • index_mapping_dir (Optional[str], optional) – Directory to store/load index mapping files. Defaults to None.

  • prompt_template (str, optional) – F-string template for combining input fields. Example: “{input} {output}”. Defaults to “{input} {output}”.

  • truncation_method (str, optional) – Method for truncation (‘left’ or ‘right’). Defaults to “right”.

  • memmap_workers (int, optional) – Number of workers for memory-mapped dataset loading. Defaults to 2.

  • hf_dataset (bool, optional) – Whether to load the dataset using HuggingFace’s datasets library. Defaults to False.

  • global_sample_mapping (bool, optional) – Whether to use a global sample mapping for shuffling across all data, or shuffle within each epoch. Defaults to False.

  • get_attention_mask_from_fusion (bool) – if true, lets attention kernel handle creation of causal mask instead of adding it to the batch dict.

  • pack_metadata_file_path (Path, optional) – Path to the metadata file for packed datasets. Required if pad_cu_seqlens is True. Defaults to None.

  • pad_cu_seqlens (bool, optional) – Whether to pad cu_seqlens for packed datasets, required for cudagraphs. Defaults to False.

  • chat (bool, optional) – If True, creates a GPTSFTChatDataset. Defaults to False.

  • **kwargs – Additional keyword arguments passed to the specific dataset class constructor.

Returns:

An instance of the appropriate SFT dataset class.

Return type:

GPTSFTDataset | GPTSFTChatDataset | GPTSFTPackedDataset

class bridge.data.datasets.sft.GPTSFTDataset(
file_path: str,
tokenizer: megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer,
max_seq_length: int = 1024,
min_seq_length: int = 1,
pad_seq_length_to_mult: int = 16,
add_bos: bool = False,
add_eos: bool = True,
add_sep: bool = False,
sep_id: int = None,
max_num_samples: int = None,
seed: int = 1234,
label_key: str = 'answer',
answer_only_loss: bool = True,
truncation_field: str = 'text',
pad_to_max_length: bool = False,
index_mapping_dir: str = None,
prompt_template: str = None,
virtual_tokens: int = 0,
tokens_to_generate: int = 0,
memmap_workers: Optional[int] = None,
hf_dataset: bool = False,
global_sample_mapping: bool = False,
truncation_method: str = 'right',
special_tokens: Optional[Mapping[str, str]] = None,
is_test: bool = False,
output_original_text: bool = False,
ceil_to_power_2: bool = False,
get_attention_mask_from_fusion: bool = True,
sanity_check_dist_workers: bool = True,
)#

Bases: torch.utils.data.Dataset

Initialization

    file_path: Path to a JSONL GPT supervised fine-tuning dataset.
        Data is formatted as multiple JSON lines with each line formatted as follows:
        {
            'input': 'John von Neumann

Von Neumann made fundamental contributions … Q: What did the math of artificial viscosity do?’, ‘output’: ‘smoothed the shock transition without sacrificing basic physics’ } tokenizer: Tokenizer for the dataset. Instance of a class that inherits MegatronTokenizer (ex: SentencePiece). max_seq_length (int): maximum sequence length for each dataset examples. Examples will either be truncated to fit this length or dropped if they cannot be truncated. min_seq_length (int): min length of each data example in the dataset. Data examples will be dropped if they do not meet the min length requirements. add_bos (bool): Whether to add a beginning of sentence token to each data example add_eos (bool): Whether to add an end of sentence token to each data example add_sep (bool): Whether to add a separation token to each data example (goes between prompt and answer) tokens_to_generate (int): (inference only) Number of tokens to generate during inference seed: Random seed for data shuffling. max_num_samples: Maximum number of samples to load. This can be > dataset length if you want to oversample data. If None, all samples will be loaded. label_key: Key to use for the label in your JSONL file answer_only_loss: If True, will compute the loss only on the answer part of the input. If False, will compute the loss on the entire input. truncation_field: Field to use for truncation. (Options: keys in prompt_template). Field to be used for truncation if the combined length exceeds the max sequence length. pad_to_max_length: Whether to pad the input to the max sequence length. If False, will pad to the max length of the current batch. index_mapping_dir: Directory to save the index mapping to. If None, will write to the same folder as the dataset. prompt_template: Prompt template to inject via an fstring. Formatted like Q: {context_key}

A: {label_key} hf_dataset: Whether to load the json file with the HuggingFace dataset. Otherwise, will load the jsonl file with the JSONLMemMapDataset. global_sample_mapping: Whether to shuffle all data together, or shuffle the dataset within each epoch truncation_method: Truncation from which position. Options: [‘left’, ‘right’] special_tokens: special tokens for the chat prompts, a dictionary of {token_type: token}. Default: { ‘system_turn_start’: ‘<extra_id_0>’, ‘turn_start’: ‘<extra_id_1>’, ‘label_start’: ‘<extra_id_2>’, ‘end_of_turn’: ‘ ‘, ‘end_of_name’: ‘ ‘ } is_test: Whether this dataset is the test split. output_original_text (bool): if true, will keep the original text in the output alongside the tokenized ids. get_attention_mask_from_fusion (bool): if true, lets attention kernel handle creation of causal mask instead of adding it to the batch dict. sanity_check_dist_workers (bool): if true, will run sanity check across workers when making mapping.

_load_dataset()#
_maybe_validate_prompt_template()#
_build_samples_mapping()#
__len__()#

Return the total number of samples in this dataset.

__getitem__(idx)#
_separate_template(prompt_template_values: List[str])#

Combine contexts and label based on prompt_template into a list of strings and a list of keys.

Parameters:

prompt_template_values (List[str]) – the list of context and label strings extrated from jsonl file with prompt_template_keys.

Returns:

separated prompt_template with contexts/label placeholder filled with corresponding strings template_strings_keys (List[str]): strings point to placeholder keys or