nemo_rl.data.datasets.utils#
Module Contents#
Functions#
Assert that there are no double starting BOS tokens in the message. |
|
Converts a PIL Image object to a base64 encoded string. |
|
Load a dataset from a json, huggingface dataset, or Arrow dataset (saved with save_to_disk). |
|
Get extra kwargs from the data config. |
|
Fill the single dataset config with default dataset config. |
|
Extract the necessary environment names from the data config. |
Data#
API#
- nemo_rl.data.datasets.utils.TokenizerType#
None
- nemo_rl.data.datasets.utils.assert_no_double_bos(
- token_ids: torch.Tensor,
- tokenizer: nemo_rl.data.datasets.utils.TokenizerType,
Assert that there are no double starting BOS tokens in the message.
- Parameters:
token_ids – List of token IDs
tokenizer – Tokenizer
- nemo_rl.data.datasets.utils.pil_to_base64(image: PIL.Image.Image, format: str = 'PNG') str#
Converts a PIL Image object to a base64 encoded string.
- Parameters:
image – The PIL Image object to convert.
format – The image format (e.g., “PNG”, “JPEG”). Defaults to “PNG”.
- Returns:
A base64 encoded string representation of the image.
- nemo_rl.data.datasets.utils.load_dataset_from_path(
- data_path: str,
- data_split: Optional[str] = 'train',
Load a dataset from a json, huggingface dataset, or Arrow dataset (saved with save_to_disk).
- Parameters:
data_path – The path to the dataset.
data_split – The split to load from the dataset.
- nemo_rl.data.datasets.utils.get_extra_kwargs(data_config: dict, keys: list[str]) dict#
Get extra kwargs from the data config.
If the key is not in the data config, it will be ignored.
- Parameters:
data_config – The data config.
keys – The keys to get from the data config.
- Returns:
The extra kwargs.
- nemo_rl.data.datasets.utils.update_single_dataset_config(
- data_config: dict,
- default_data_config: dict,
Fill the single dataset config with default dataset config.
- nemo_rl.data.datasets.utils.extract_necessary_env_names(data_config: dict) list[str]#
Extract the necessary environment names from the data config.
Some environments are set in env_configs but not used in the data config. This function extracts the necessary environment names from the data config.
- Parameters:
data_config – The data config.
- Returns:
The necessary environment names.