nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset
nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset
Module Contents
Classes
Functions
Data
API
Bases: Dataset
Generic instruction-tuning dataset that maps arbitrary column names.
The class is intentionally lightweight: it simply loads the raw samples (either from HF or from local JSON/JSONL files) and remaps the columns so that downstream components can rely on a consistent field interface.
Optionally, if answer_only_loss_mask is requested, the dataset will also compute a loss_mask indicating which tokens should contribute to the loss (typically only those belonging to the assistant answer).
Returns the item at the given index.
Parameters:
The index of the item to return.
Returns:
A dictionary with the mapped columns.
Returns the length of the dataset.
Returns: int
The length of the dataset.
Tokenize a mapped sample and compute auxiliary fields.
If the tokenizer is provided:
- If the tokenizer supports a chat template, the dataset will be tokenized in a conversation style.
- Otherwise, the dataset will be tokenized in a simple prompt-completion style.
Parameters:
A dictionary with the mapped columns.
Returns: Dict[str, List[int]]
A dictionary with the tokenized columns.
Bases: enum.Enum
Supported logical column roles for text instruction datasets.
Check if all values in the sample are of the same length.
Load a dataset either from the Hugging Face Hub or from local JSON/JSONL files.
If path_or_dataset_id resembles a HF repo ID (i.e. of the form
org/dataset and the path does not exist on the local filesystem),
we defer to datasets.load_dataset directly. Otherwise, we assume the
argument points to one or more local JSON/JSONL files and let
datasets.load_dataset with the “json” script handle the parsing.
Parameters:
Either a HF dataset identifier (org/name) or
a path / list of paths to local .json / .jsonl files.
Optional split to load when retrieving a remote dataset. This parameter is ignored for local files as the json script always returns a single split.
Whether to stream the dataset.
Optional name of the dataset configuration/subset to load
Returns:
datasets.Dataset: The loaded dataset.
Check if a string is a valid huggingface dataset id.
Parameters:
A string to check.
Returns: bool
True if the string is a valid huggingface dataset id, False otherwise.
Utility that converts val into an iterator of strings.
The helper accepts either a single string or a list of strings and yields its contents. This is handy when we want to treat the two cases uniformly downstream (e.g. when iterating over data_files that can be provided as either a single path or a collection of paths).
Parameters:
Either a single string or a list/tuple of strings.
Raises:
ValueError: If val is neither a string nor an iterable of strings.