`bridge.data.datasets.sft`#

Module Contents#

Classes#

`GPTSFTDataset`
`GPTSFTPackedDataset`
`GPTSFTChatDataset`	Dataset class for chat-based fine-tuning with optional HuggingFace chat template support.

Functions#

`get_dataset_root`	Returns the root directory for NeMo datasets, creating it if it doesn’t exist.
`create_sft_dataset`	Creates and returns a supervised fine-tuning (SFT) dataset instance.

Data#

`DEFAULT_NEMO_CACHE_HOME`
`NEMO_CACHE_HOME`
`DEFAULT_NEMO_DATASETS_CACHE`
`NEMO_DATASETS_CACHE`
`DEFAULT_NEMO_MODELS_CACHE`
`NEMO_MODELS_CACHE`
`logger`
`PREFIX_STR`
`__idx_version__`
`__idx_suffix__`

API#

bridge.data.datasets.sft.DEFAULT_NEMO_CACHE_HOME#: None

bridge.data.datasets.sft.NEMO_CACHE_HOME#: ‘Path(…)’

bridge.data.datasets.sft.DEFAULT_NEMO_DATASETS_CACHE#: None

bridge.data.datasets.sft.NEMO_DATASETS_CACHE#: ‘Path(…)’

bridge.data.datasets.sft.DEFAULT_NEMO_MODELS_CACHE#: None

bridge.data.datasets.sft.NEMO_MODELS_CACHE#: ‘Path(…)’

bridge.data.datasets.sft.logger#: ‘getLogger(…)’

bridge.data.datasets.sft.PREFIX_STR#: ‘\x00’

bridge.data.datasets.sft.__idx_version__#: ‘0.2’

bridge.data.datasets.sft.__idx_suffix__#: ‘idx’

bridge.data.datasets.sft.get_dataset_root(name: str) → pathlib.Path#

Returns the root directory for NeMo datasets, creating it if it doesn’t exist.

Parameters:: name (str) – The name of the dataset, used to create a subdirectory within the NeMo datasets cache.
Returns:: The path to the dataset’s root directory.
Return type:: Path

bridge.data.datasets.sft.create_sft_dataset(

path: pathlib.Path,

tokenizer: megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer,

seq_length: int = 2048,

add_bos: bool = False,

add_eos: bool = True,

add_sep: bool = False,

seed: int = 1234,

label_key: str = 'output',

answer_only_loss: bool = True,

truncation_field: str = 'input',

pad_to_max_length: bool = False,

index_mapping_dir: str | None = None,

prompt_template: str = '{input} {output}',

truncation_method: str = 'right',

memmap_workers: int = 2,

hf_dataset: bool = False,

global_sample_mapping: bool = False,

get_attention_mask_from_fusion: bool = True,

pack_metadata_file_path: pathlib.Path = None,

pad_cu_seqlens: bool = False,

chat: bool = False,

use_hf_tokenizer_chat_template: bool = False,

tool_schemas: str | dict | None = None,

**kwargs,

) → GPTSFTDataset#

Creates and returns a supervised fine-tuning (SFT) dataset instance.

This function acts as a factory for different types of SFT datasets based on the input parameters. It can create standard SFT datasets, chat-specific datasets, or packed sequence datasets.

Parameters:

path (Path) – Path to the dataset file. For packed datasets, this should be a .npy file.
tokenizer (MegatronTokenizer) – The tokenizer to use for tokenizing the data.
seq_length (int, optional) – Maximum sequence length for each example. Defaults to 2048.
add_bos (bool, optional) – Whether to add a beginning-of-sentence token. Defaults to False.
add_eos (bool, optional) – Whether to add an end-of-sentence token. Defaults to True.
add_sep (bool, optional) – Whether to add a separation token between prompt and answer. Defaults to False.
seed (int, optional) – Random seed for data shuffling. Defaults to 1234.
label_key (str, optional) – The key in the dataset corresponding to the label/output. Defaults to “output”.
answer_only_loss (bool, optional) – If True, compute loss only on the answer part. Defaults to True.
truncation_field (str, optional) – Field(s) to truncate if the combined length exceeds seq_length. Comma-separated if multiple. Defaults to “input”.
pad_to_max_length (bool, optional) – Whether to pad all samples to max_seq_length. Defaults to False.
index_mapping_dir (str | None, optional) – Directory to store/load index mapping files. Defaults to None.
prompt_template (str, optional) – F-string template for combining input fields. Example: “{input} {output}”. Defaults to “{input} {output}”.
truncation_method (str, optional) – Method for truncation (‘left’ or ‘right’). Defaults to “right”.
memmap_workers (int, optional) – Number of workers for memory-mapped dataset loading. Defaults to 2.
hf_dataset (bool, optional) – Whether to load the dataset using HuggingFace’s datasets library. Defaults to False.
global_sample_mapping (bool, optional) – Whether to use a global sample mapping for shuffling across all data, or shuffle within each epoch. Defaults to False.
get_attention_mask_from_fusion (bool) – if true, lets attention kernel handle creation of causal mask instead of adding it to the batch dict.
pack_metadata_file_path (Path, optional) – Path to the metadata file for packed datasets. Required if pad_cu_seqlens is True. Defaults to None.
pad_cu_seqlens (bool, optional) – Whether to pad cu_seqlens for packed datasets, required for cudagraphs. Defaults to False.
chat (bool, optional) – If True, creates a GPTSFTChatDataset. Defaults to False.
use_hf_tokenizer_chat_template (bool, optional) – If True, uses HuggingFace tokenizer’s chat template via apply_chat_template method. Only applies when chat=True. Defaults to False.
tool_schemas (str | dict | None, optional) – Tool schemas for function calling support. Can be a JSON string or a dict. Only applies when chat=True and use_hf_tokenizer_chat_template=True. Defaults to None.
**kwargs – Additional keyword arguments passed to the specific dataset class constructor.

Returns:

An instance of the appropriate SFT dataset class.

Return type:

GPTSFTDataset | GPTSFTChatDataset | GPTSFTPackedDataset

class bridge.data.datasets.sft.GPTSFTDataset( file_path: str, tokenizer: megatron.bridge.training.tokenizers.tokenizer.MegatronTokenizer, max_seq_length: int = 1024, min_seq_length: int = 1, pad_seq_length_to_mult: int = 16, add_bos: bool = False, add_eos: bool = True, add_sep: bool = False, sep_id: int = None, max_num_samples: int = None, seed: int = 1234, label_key: str = 'answer', answer_only_loss: bool = True, truncation_field: str = 'text', pad_to_max_length: bool = False, index_mapping_dir: str = None, prompt_template: str = None, virtual_tokens: int = 0, tokens_to_generate: int = 0, memmap_workers: int | None = None, hf_dataset: bool = False, global_sample_mapping: bool = False, truncation_method: str = 'right', special_tokens: Mapping[str, str] | None = None, is_test: bool = False, output_original_text: bool = False, ceil_to_power_2: bool = False, get_attention_mask_from_fusion: bool = True, sanity_check_dist_workers: bool = True, )#

Bases: torch.utils.data.Dataset

Initialization

    file_path: Path to a JSONL GPT supervised fine-tuning dataset.
        Data is formatted as multiple JSON lines with each line formatted as follows:
        {
            'input': 'John von Neumann

Von Neumann made fundamental contributions … Q: What did the math of artificial viscosity do?’, ‘output’: ‘smoothed the shock transition without sacrificing basic physics’ } tokenizer: Tokenizer for the dataset. Instance of a class that inherits MegatronTokenizer (ex: SentencePiece). max_seq_length (int): maximum sequence length for each dataset examples. Examples will either be truncated to fit this length or dropped if they cannot be truncated. min_seq_length (int): min length of each data example in the dataset. Data examples will be dropped if they do not meet the min length requirements. add_bos (bool): Whether to add a beginning of sentence token to each data example add_eos (bool): Whether to add an end of sentence token to each data example add_sep (bool): Whether to add a separation token to each data example (goes between prompt and answer) tokens_to_generate (int): (inference only) Number of tokens to generate during inference seed: Random seed for data shuffling. max_num_samples: Maximum number of samples to load. This can be > dataset length if you want to oversample data. If None, all samples will be loaded. label_key: Key to use for the label in your JSONL file answer_only_loss: If True, will compute the loss only on the answer part of the input. If False, will compute the loss on the entire input. truncation_field: Field to use for truncation. (Options: keys in prompt_template). Field to be used for truncation if the combined length exceeds the max sequence length. pad_to_max_length: Whether to pad the input to the max sequence length. If False, will pad to the max length of the current batch. index_mapping_dir: Directory to save the index mapping to. If None, will write to the same folder as the dataset. prompt_template: Prompt template to inject via an fstring. Formatted like Q: {context_key}

A: {label_key} hf_dataset: Whether to load the json file with the HuggingFace dataset. Otherwise, will load the jsonl file with the JSONLMemMapDataset. global_sample_mapping: Whether to shuffle all data together, or shuffle the dataset within each epoch truncation_method: Truncation from which position. Options: [‘left’, ‘right’] special_tokens: special tokens for the chat prompts, a dictionary of {token_type: token}. Default: { ‘system_turn_start’: ‘<extra_id_0>’, ‘turn_start’: ‘<extra_id_1>’, ‘label_start’: ‘<extra_id_2>’, ‘end_of_turn’: ‘ ‘, ‘end_of_name’: ‘ ‘ } is_test: Whether this dataset is the test split. output_original_text (bool): if true, will keep the original text in the output alongside the tokenized ids. get_attention_mask_from_fusion (bool): if true, lets attention kernel handle creation of causal mask instead of adding it to the batch dict. sanity_check_dist_workers (bool): if true, will run sanity check across workers when making mapping.

_load_dataset()#

_maybe_validate_prompt_template()#

_build_samples_mapping()#

__len__()#: Return the total number of samples in this dataset.

__getitem__(idx)#

`bridge.data.datasets.sft`#

Module Contents#

Classes#

Functions#

Data#

API#

tokenizer.space_sensitive = True#

tokenizer.space_sensitive = False#

bridge.data.datasets.sft#

Module Contents#

Classes#

Functions#

Data#

API#

tokenizer.space_sensitive = True#

tokenizer.space_sensitive = False#

`bridge.data.datasets.sft`#