`nemo_rl.data.datasets.utils`#

Module Contents#

`assert_no_double_bos`	Assert that there are no double starting BOS tokens in the message.
`pil_to_base64`	Converts a PIL Image object to a base64 encoded string.
`load_dataset_from_path`	Load a dataset from a json, huggingface dataset, or Arrow dataset (saved with save_to_disk).
`get_extra_kwargs`	Get extra kwargs from the data config.

nemo_rl.data.datasets.utils.assert_no_double_bos( token_ids: torch.Tensor, tokenizer: nemo_rl.data.datasets.utils.TokenizerType, ) → None#

Assert that there are no double starting BOS tokens in the message.

Parameters:

nemo_rl.data.datasets.utils.pil_to_base64(image: PIL.Image.Image, format: str = 'PNG') → str#

Converts a PIL Image object to a base64 encoded string.

Parameters:

Returns:

A base64 encoded string representation of the image.

nemo_rl.data.datasets.utils.load_dataset_from_path( data_path: str, data_split: Optional[str] = 'train', )#

Load a dataset from a json, huggingface dataset, or Arrow dataset (saved with save_to_disk).

Parameters:

nemo_rl.data.datasets.utils.get_extra_kwargs(data_config: dict, keys: list[str]) → dict#

Get extra kwargs from the data config.

If the key is not in the data config, it will be ignored.

Parameters:

Returns:

The extra kwargs.