nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset
#
Module Contents#
Classes#
Generic instruction‐tuning dataset that maps arbitrary column names. |
Functions#
Utility that converts val into an iterator of strings. |
|
Check if a string is a valid huggingface dataset id. |
|
Load a dataset either from the Hugging Face Hub or from local JSON/JSONL files. |
|
Tokenization path when the tokenizer supports a chat template. |
|
Tokenization path when chat_template is not available. |
|
Check if the tokenizer supports a chat template. |
|
Check if all values in the sample are of the same length. |
API#
- class nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnTypes(*args, **kwds)[source]#
Bases:
enum.Enum
- Context#
‘context’
- Question#
‘question’
- Answer#
‘answer’
- nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.make_iterable(
- val: Union[str, List[str]],
Utility that converts val into an iterator of strings.
The helper accepts either a single string or a list of strings and yields its contents. This is handy when we want to treat the two cases uniformly downstream (e.g. when iterating over data_files that can be provided as either a single path or a collection of paths).
- Parameters:
val – Either a single string or a list/tuple of strings.
- Yields:
str – The individual strings contained in val.
- Raises:
ValueError – If val is neither a string nor an iterable of strings.
- nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset._str_is_hf_repo_id(val: str) bool [source]#
Check if a string is a valid huggingface dataset id.
- Parameters:
val – A string to check.
- Returns:
True if the string is a valid huggingface dataset id, False otherwise.
- nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset._load_dataset(
- path_or_dataset_id: Union[str, List[str]],
- split: Optional[str] = None,
- streaming: bool = False,
Load a dataset either from the Hugging Face Hub or from local JSON/JSONL files.
If path_or_dataset_id resembles a HF repo ID (i.e. of the form
org/dataset
and the path does not exist on the local filesystem), we defer todatasets.load_dataset
directly. Otherwise, we assume the argument points to one or more local JSON/JSONL files and letdatasets.load_dataset
with the “json” script handle the parsing.- Parameters:
path_or_dataset_id – Either a HF dataset identifier (
org/name
) or a path / list of paths to local.json
/.jsonl
files.split – Optional split to load when retrieving a remote dataset. This parameter is ignored for local files as the json script always returns a single split.
- Returns:
The loaded dataset.
- Return type:
datasets.Dataset
- nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset._apply_tokenizer_with_chat_template(
- tokenizer: transformers.PreTrainedTokenizer,
- context: str,
- question: str,
- answer: str,
- start_of_turn_token: Optional[str] = None,
- answer_only_loss_mask: bool = True,
Tokenization path when the tokenizer supports a chat template.
- Parameters:
context – The context of the sample.
question – The question of the sample.
answer – The answer of the sample.
- Returns:
A dictionary with the tokenized columns.
- nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset._apply_tokenizer_plain(
- tokenizer: transformers.PreTrainedTokenizer,
- context: str,
- question: str,
- answer: str,
- answer_only_loss_mask: bool = True,
Tokenization path when chat_template is not available.
- Parameters:
context – The context of the sample.
question – The question of the sample.
answer – The answer of the sample.
- Returns:
A dictionary with the tokenized columns.
- nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset._has_chat_template(
- tokenizer: transformers.PreTrainedTokenizer,
Check if the tokenizer supports a chat template.
- Parameters:
tokenizer – The tokenizer to check.
- Returns:
True if the tokenizer supports a chat template, False otherwise.
- nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset._check_all_values_equal_length(
- sample: Dict[str, List[int]],
Check if all values in the sample are of the same length.
- class nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnMappedTextInstructionDataset(
- path_or_dataset_id: Union[str, List[str]],
- column_mapping: Dict[str, str],
- tokenizer,
- *,
- split: Optional[str] = None,
- streaming: bool = False,
- answer_only_loss_mask: bool = True,
- start_of_turn_token: Optional[str] = None,
Bases:
torch.utils.data.Dataset
Generic instruction‐tuning dataset that maps arbitrary column names.
The class is intentionally lightweight: it simply loads the raw samples (either from HF or from local JSON/JSONL files) and remaps the columns so that downstream components can rely on a consistent field interface.
Optionally, if answer_only_loss_mask is requested, the dataset will also compute a loss_mask indicating which tokens should contribute to the loss (typically only those belonging to the assistant answer).
Initialization
Initialize the dataset.
- Parameters:
path_or_dataset_id – The path or dataset id of the dataset.
column_mapping – The mapping of the columns.
tokenizer – The tokenizer to use.
split – The split of the dataset to load.
streaming – Whether to load the dataset in streaming mode.
answer_only_loss_mask – Whether to compute the loss mask only on the answer tokens.
start_of_turn_token – The token to use to indicate the start of a turn.
- __len__() int [source]#
Returns the length of the dataset.
- Returns:
The length of the dataset.
- Raises:
RuntimeError – If streaming is enabled.
- __getitem__(idx)[source]#
Returns the item at the given index.
- Parameters:
idx – The index of the item to return.
- Returns:
A dictionary with the mapped columns.
- Raises:
RuntimeError – If streaming is enabled.
- __iter__()[source]#
Iterate over the dataset yielding rows with the requested column mapping.
When streaming=True the underlying dataset is consumed lazily.
If the tokenizer is provided, it will be used to tokenize the dataset.
- Returns:
An iterator over the dataset.
- Raises:
RuntimeError – If streaming is enabled.
- _apply_tokenizer(
- sample: Dict[str, str],
Tokenize a mapped sample and compute auxiliary fields.
If the tokenizer is provided:
If the tokenizer supports a chat template, the dataset will be tokenized in a conversation style.
Otherwise, the dataset will be tokenized in a simple prompt-completion style.
- Parameters:
sample – A dictionary with the mapped columns.
- Returns:
A dictionary with the tokenized columns.