`nemo_rl.data.datasets.response_datasets.general_conversations_dataset`#

Module Contents#

Classes#

GeneralConversationsJsonlDataset

Loads general conversation datasets that have the json (manifest) files and media files in separate files (jsonl datasets).

Functions#

`convert_metadata`
`conversation_process_message`	Convert one conversation message from a string to a list of dictionaries representing media or text.

Data#

conversation_sender_mapping_sample_to_allowed

API#

nemo_rl.data.datasets.response_datasets.general_conversations_dataset.conversation_sender_mapping_sample_to_allowed#: None

nemo_rl.data.datasets.response_datasets.general_conversations_dataset.convert_metadata(metadata: Dict[str, Any])#

nemo_rl.data.datasets.response_datasets.general_conversations_dataset.conversation_process_message( metadata: Dict[str, Any], message: Dict[str, str], media_index: dict, raw: Optional[Dict[str, Any]] = None, allow_empty_text: bool = False, check_if_media_file_exist: bool = True, tried_default_extensions: Optional[set] = None, process_message_fragment: Callable = lambda tag, fragment: ..., ) → list[Dict[str, Any]]#

Convert one conversation message from a string to a list of dictionaries representing media or text.

Parameters:

raw – dictionary with all webdataset compliant keys of a sample. Emtpy for jsonl dataset, non-empty otherwise.
metadata

class nemo_rl.data.datasets.response_datasets.general_conversations_dataset.GeneralConversationsJsonlDataset(

data_path: str,

media_data_dir: Optional[str] = None,

split_validation_size: float = 0,

seed: int = 42,

**kwargs,

)#

Bases: nemo_rl.data.datasets.raw_dataset.RawDataset

Loads general conversation datasets that have the json (manifest) files and media files in separate files (jsonl datasets).

Each sample can be single/multi-turn conversations with multiple modalities. Each modality can have one or more number of media objects. There is no requirement of where the media tag (e.g. ‘’) should appear in the conversations.

The structure of the jsonl files could be like this.

Example media filenames::

sample_000001.2345ew.flac
sample_000001.35tags.mp4
sample_000001.as23ds.jpg
sample_000001.gd1dtg.wav
sample_000001.gds233.jpg
sample_000002.asf234.wav
...

Example JSON structure::

{
  "sound": ["sample_000001.2345ew.flac", "sample_000001.gd1dtg.wav"],
  "video": "sample_000001.35tags.mp4",
  "image": ["sample_000001.as23ds.jpg", "sample_000001.gds233.jpg"],
  "conversations": [
    {
      "from": "user",
      "value": "<sound>"
    },
    {
      "from": "assistant",
      "value": "Automatic speech recognition is a technology that allows computers to recognize and transcribe spoken language. In the NeMo Framework, ASR is used for tasks such as speech-to-text and voice recognition."
    },
    {
      "from": "user",
      "value": "Describe what is NeMo based on the tutorial video: <video> and the information in the two images: <image> <image>. Combine that information with sound <sound>. Answer: "
    },
    {
      "from": "assistant",
      "value": "The NeMo Framework provides a range of tools and features for training and deploying ASR models, including model parallelism, data parallelism, and distributed checkpointing. This allows for faster training and inference times, as well as improved model accuracy and reliability."
    }
  ]
}

Initialization

task_name#: ‘general-conversation-jsonl’

classmethod process_message_fragment( tag: str, fragment: Any, media_directory: Optional[str] = None, ) → list[dict[str, Any]]#

classmethod _datum_preprocessor( example: dict[str, Any], media_directory: Optional[str] = None, ) → dict[str, list[dict[str, Any]]]#: Convert the json structure into an OpenAI-API-like message log.

nemo_rl.data.datasets.response_datasets.general_conversations_dataset#