nemo_rl.data.datasets.utils#

Module Contents#

Functions#

assert_no_double_bos

Assert that there are no double starting BOS tokens in the message.

pil_to_base64

Converts a PIL Image object to a base64 encoded string.

load_dataset_from_path

Load a dataset from a json or huggingface dataset.

get_extra_kwargs

Get extra kwargs from the data config.

Data#

API#

nemo_rl.data.datasets.utils.TokenizerType#

None

nemo_rl.data.datasets.utils.assert_no_double_bos(
token_ids: torch.Tensor,
tokenizer: nemo_rl.data.datasets.utils.TokenizerType,
) None#

Assert that there are no double starting BOS tokens in the message.

Parameters:
  • token_ids – List of token IDs

  • tokenizer – Tokenizer

nemo_rl.data.datasets.utils.pil_to_base64(image: PIL.Image.Image, format: str = 'PNG') str#

Converts a PIL Image object to a base64 encoded string.

Parameters:
  • image – The PIL Image object to convert.

  • format – The image format (e.g., “PNG”, “JPEG”). Defaults to “PNG”.

Returns:

A base64 encoded string representation of the image.

nemo_rl.data.datasets.utils.load_dataset_from_path(
data_path: str,
data_split: Optional[str] = 'train',
)#

Load a dataset from a json or huggingface dataset.

Parameters:
  • data_path – The path to the dataset.

  • data_split – The split to load from the dataset.

nemo_rl.data.datasets.utils.get_extra_kwargs(data_config: dict, keys: list[str]) dict#

Get extra kwargs from the data config.

If the key is not in the data config, it will be ignored.

Parameters:
  • data_config – The data config.

  • keys – The keys to get from the data config.

Returns:

The extra kwargs.