Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

SpeechLLM API#

Model Classes#

class nemo.collections.nlp.models.language_modeling.megatron_base_model.MegatronBaseModel(*args: Any, **kwargs: Any)

Bases: NLPModel

Megatron base class. All NeMo Megatron models inherit from this class.

  • Initialize the model parallel world for nemo.

  • Turn on all of the nvidia optimizations.

  • If cfg.tokenizer is available, it loads the tokenizer and pad the vocab to the correct size for tensor model parallelism.

  • If using distributed optimizer, configure to be compatible with O2 level optimizations and/or model parallelism.

  • Perform gradient clipping: grad_clip_pl_default triggers the PyTorch Lightning default implementation, with_distributed_adam triggers the distributed optimizer’s implementation, megatron_amp_O2 triggers gradient clipping on the main grads, and otherwise gradient clipping is performed on the model grads.

__init__(
cfg: omegaconf.dictconfig.DictConfig,
trainer: pytorch_lightning.trainer.trainer.Trainer,
no_lm_init=True,
)

Base class from which all NeMo models should inherit

Parameters:
  • cfg (DictConfig) –

    configuration object. The cfg object should have (optionally) the following sub-configs:

    • train_ds - to instantiate training dataset

    • validation_ds - to instantiate validation dataset

    • test_ds - to instantiate testing dataset

    • optim - to instantiate optimizer with learning rate scheduler

  • trainer (Optional) – Pytorch Lightning Trainer instance

class nemo.collections.multimodal.speech_llm.models.modular_models.ModularAudioGPTModel(*args: Any, **kwargs: Any)#

Bases: SpeechLLMAdapterMixin, MegatronGPTSFTModel

Modularized speech GPT model.

__init__(
cfg: omegaconf.dictconfig.DictConfig,
trainer: pytorch_lightning.trainer.trainer.Trainer,
)#

Base class from which all NeMo models should inherit

Parameters:
  • cfg (DictConfig) –

    configuration object. The cfg object should have (optionally) the following sub-configs:

    • train_ds - to instantiate training dataset

    • validation_ds - to instantiate validation dataset

    • test_ds - to instantiate testing dataset

    • optim - to instantiate optimizer with learning rate scheduler

  • trainer (Optional) – Pytorch Lightning Trainer instance

setup(stage=None)#

PTL hook that is executed after DDP spawns. We setup datasets here as megatron datasets require DDP to instantiate. See https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#setup for more information.

Parameters:

stage (str, optional) – Can be ‘fit’, ‘validate’, ‘test’ or ‘predict’. Defaults to None.

training_step(dataloader_iter)#

We pass the dataloader iterator function to the micro-batch scheduler. The input batch to each micro-batch is fetched using the dataloader function in the micro-batch fwd function.

validation_step(dataloader_iter)#

Our dataloaders produce a micro-batch and then we fetch a number of microbatches depending on the global batch size and model parallel size from the dataloader to produce a list of microbatches. The list of microbatches is then piped through the pipeline using megatron-core fwd/bwd functions.

class nemo.collections.multimodal.speech_llm.models.modular_models.CrossAttendModularAudioGPTModel(*args: Any, **kwargs: Any)#

Bases: ModularAudioGPTModel

Modularized speech GPT model.

__init__(
cfg: omegaconf.dictconfig.DictConfig,
trainer: pytorch_lightning.trainer.trainer.Trainer,
)#

Base class from which all NeMo models should inherit

Parameters:
  • cfg (DictConfig) –

    configuration object. The cfg object should have (optionally) the following sub-configs:

    • train_ds - to instantiate training dataset

    • validation_ds - to instantiate validation dataset

    • test_ds - to instantiate testing dataset

    • optim - to instantiate optimizer with learning rate scheduler

  • trainer (Optional) – Pytorch Lightning Trainer instance

setup(stage=None)#

PTL hook that is executed after DDP spawns. We setup datasets here as megatron datasets require DDP to instantiate. See https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#setup for more information.

Parameters:

stage (str, optional) – Can be ‘fit’, ‘validate’, ‘test’ or ‘predict’. Defaults to None.

training_step(dataloader_iter)#

We pass the dataloader iterator function to the micro-batch scheduler. The input batch to each micro-batch is fetched using the dataloader function in the micro-batch fwd function.

validation_step(dataloader_iter)#

Our dataloaders produce a micro-batch and then we fetch a number of microbatches depending on the global batch size and model parallel size from the dataloader to produce a list of microbatches. The list of microbatches is then piped through the pipeline using megatron-core fwd/bwd functions.

class nemo.collections.multimodal.speech_llm.models.modular_t5_models.ModularizedAudioT5Model(*args: Any, **kwargs: Any)#

Bases: MegatronT5LoraModel

Modularized speech GPT model.

__init__(
cfg: omegaconf.dictconfig.DictConfig,
trainer: pytorch_lightning.trainer.trainer.Trainer,
)#

Base class from which all NeMo models should inherit

Parameters:
  • cfg (DictConfig) –

    configuration object. The cfg object should have (optionally) the following sub-configs:

    • train_ds - to instantiate training dataset

    • validation_ds - to instantiate validation dataset

    • test_ds - to instantiate testing dataset

    • optim - to instantiate optimizer with learning rate scheduler

  • trainer (Optional) – Pytorch Lightning Trainer instance

setup(stage=None)#

Called at the beginning of fit, validate, test, or predict. This is called on every process when using DDP.

Parameters:

stage – fit, validate, test or predict

class nemo.collections.multimodal.speech_llm.models.modular_t5_models.DecoderTextPromptModularizedAudioT5Model(*args: Any, **kwargs: Any)#

Bases: ModularizedAudioT5Model

Modularized speech GPT model.

__init__(
cfg: omegaconf.dictconfig.DictConfig,
trainer: pytorch_lightning.trainer.trainer.Trainer,
)#

Base class from which all NeMo models should inherit

Parameters:
  • cfg (DictConfig) –

    configuration object. The cfg object should have (optionally) the following sub-configs:

    • train_ds - to instantiate training dataset

    • validation_ds - to instantiate validation dataset

    • test_ds - to instantiate testing dataset

    • optim - to instantiate optimizer with learning rate scheduler

  • trainer (Optional) – Pytorch Lightning Trainer instance

setup(stage=None)#

Called at the beginning of fit, validate, test, or predict. This is called on every process when using DDP.

Parameters:

stage – fit, validate, test or predict

Modules#

class nemo.collections.multimodal.speech_llm.modules.perception_modules.AudioPerceptionModule(*args: Any, **kwargs: Any)#

Bases: NeuralModule, Exportable

Audio perception module that consists of audio encoder(s) and modality adapter.

class nemo.collections.multimodal.speech_llm.modules.perception_modules.MultiAudioPerceptionModule(*args: Any, **kwargs: Any)#

Bases: NeuralModule, Exportable

Audio perception module that consists of multiple audio encoders and shared modality adapter. This module is experimental. An example perception cfg is:

perception:
    modality_adapter:
        _target_: nemo.collections.multimodal.speechllm.modules.PoolingMLPConnectors
        hidden_dim: 512
        pooling: 'cat'
        pooling_factor: 2
        num_layers: 4
        input_dim: -1
        output_dim: -1

    spec_augment:
        _target_: nemo.collections.asr.modules.SpectrogramAugmentation
        freq_masks: 2 # set to zero to disable it
        time_masks: 10 # set to zero to disable it
        freq_width: 27
        time_width: 0.05

    encoders:
        asr_model:
            _target_: nemo.collections.asr.models.ASRModel
            output_key: d_model
            freeze: True
            pretrained_model: stt_en_fastconformer_transducer_large
        ssl_model:
            _target_: nemo.collections.asr.models.SpeechEncDecSelfSupervisedModel
            output_key: d_model
            freeze: True
            pretrained_model: ssl_en_conformer_large
            use_multi_layer_feat: True
            multi_layer_feat:
            layer_idx_list: [0,16]
            aggregator:
                mode: "cat"
                pooling: "avg"
                rounding: "floor"

        speaker_model:
            segment_length_in_secs: 0.4
            freeze: True
            pretrained_model: titanet_large

        ref_model: asr_model
        aggregator:
            mode: "cat"
            pooling: "mean"
            rounding: "floor"
class nemo.collections.multimodal.speech_llm.modules.TransformerCrossAttention(*args: Any, **kwargs: Any)#

Bases: NeuralModule, Exportable

Transformer module for cross-attention between speech and text embeddings. The module allows optional projection from the input embeddings to a lower dimension before feeding them to the transformer.

Parameters:
  • cfg – DictConfig, configuration object for the module which should include:

  • xattn – DictConfig, configuration object for the transformer decoder

Dataset Classes#

class nemo.collections.multimodal.speech_llm.data.audio_text_dataset.AudioTextDataset(*args: Any, **kwargs: Any)#

Bases: TextProcessing, Dataset

Dataset that loads tensors via a json file containing paths to audio files, transcripts, and durations (in seconds). Each new line is a different sample. Example below:

{"audio_filepath": "1.wav", "duration": 1.12, "question": "what is the capital of France?", "answer": "Paris"}
{"audio_filepath": "2.wav", "duration": 2.15, "question": "what is the capital of Italy?", "answer": "Rome"}
Parameters:
  • manifest_filepath – Path to manifest json as described above. Can be comma-separated paths.

  • tokenizer – text tokenizer object

  • sample_rate (int) – Sample rate to resample loaded audio to

  • int_values (bool) – If true, load samples as 32-bit integers. Defauts to False.

  • augmentor (nemo.collections.asr.parts.perturb.AudioAugmentor) – An AudioAugmentor object used to augment loaded audio

  • max_duration – If audio exceeds this length, do not include in dataset

  • min_duration – If audio is less than this length, do not include in dataset

  • max_utts – Limit number of utterances

  • trim – whether or not to trim silence. Defaults to False

  • channel_selector (int | Iterable[int] | str) –

    select a single channel or a subset of channels from multi-channel audio. If set to ‘average’, it performs averaging across channels. Disabled if set to None. Defaults to None. Uses zero-based indexing.

    note:

    below args are NLP-specific

  • max_seq_length (int) – maximum sequence length for each dataset examples. Examples will either be truncated to fit this length or dropped if they cannot be truncated.

  • min_seq_length (int) – min length of each data example in the dataset. Data examples will be dropped if they do not meet the min length requirements.

  • add_bos (bool) – Whether to add a beginning of sentence token to each data example

  • add_eos (bool) – Whether to add an end of sentence token to each data example

  • add_sep (bool) – Whether to add a separation token to each data example (goes between prompt and answer)

  • tokens_to_generate (int) – (inference only) Number of tokens to generate during inference

  • seed – Random seed for data shuffling.

  • max_num_samples – Maximum number of samples to load. This can be > dataset length if you want to oversample data. If None, all samples will be loaded.

  • seed – int = 1234,

  • context_key – Key to use for the context in your JSONL file

  • answer_key – Key to use for the label in your JSONL file

  • separate_prompt_and_response_with_newline – Adds a newline between prompt and response.

  • answer_only_loss – If True, will compute the loss only on the answer part of the input. If False, will compute the loss on the entire input.

  • truncation_field – Field to use for truncation. (Options: “answer”, “context”). Field to be used for truncation if the combined length exceeds the max sequence length.

  • pad_to_max_length – Whether to pad the input to the max sequence length. If False, will pad to the max length of the current batch.

  • prompt_template

    Prompt template to inject via an fstring. Formatted like:

    Q: {input}\n\nA: {output}
    

  • end_string

    Optional[str] = None, if not None, add this string to the end of the answer.

    note:

    below args are for miscellaneous purposes

  • context_file – Optional[Union[List[str], str]] = None, if provided, will use this file to load random questions from, if question is not in manifest.

  • sample_alpha – Optional[float] = None, for SPE subword sampling

  • audio_locator – Optional[str] = None, a special string to split the context into multiple audio segments.

class nemo.collections.multimodal.speech_llm.data.audio_text_dataset.TarredAudioTextDataset(*args: Any, **kwargs: Any)#

Bases: TextProcessing, IterableDataset

A similar Dataset to the AudioTextDataset, but which loads tarred audio files.

Accepts a single comma-separated JSON manifest file (in the same style as for the AudioTextDataset), as well as the path(s) to the tarball(s) containing the wav files. Each line of the manifest should contain the information for one audio file, including at least the transcript and name of the audio file within the tarball.

Valid formats for the audio_tar_filepaths argument include: (1) a single string that can be brace-expanded, e.g. ‘path/to/audio.tar’ or ‘path/to/audio_{1..100}.tar.gz’, or (2) a list of file paths that will not be brace-expanded, e.g. [‘audio_1.tar’, ‘audio_2.tar’, …].

Note: For brace expansion in (1), there may be cases where {x..y} syntax cannot be used due to shell interference. This occurs most commonly inside SLURM scripts. Therefore we provide a few equivalent replacements. Supported opening braces - { <=> (, [, < and the special tag _OP_. Supported closing braces - } <=> ), ], > and the special tag _CL_. For SLURM based tasks, we suggest the use of the special tags for ease of use.

See the WebDataset documentation for more information about accepted data and input formats.

If using multiple workers the number of shards should be divisible by world_size to ensure an even split among workers. If it is not divisible, logging will give a warning but training will proceed. In addition, if using mutiprocessing, each shard MUST HAVE THE SAME NUMBER OF ENTRIES after filtering is applied. We currently do not check for this, but your program may hang if the shards are uneven!

Additionally, please note that the len() of this DataLayer is assumed to be the length of the manifest after filtering. An incorrect manifest length may lead to some DataLoader issues down the line.

Parameters:
  • audio_tar_filepaths – Either a list of audio tarball filepaths, or a string (can be brace-expandable).

  • manifest_filepath (str) – Path to the manifest.

  • parser (callable) – A callable which is used to pre-process the text output.

  • sample_rate (int) – Sample rate to resample loaded audio to

  • int_values (bool) – If true, load samples as 32-bit integers. Defauts to False.

  • augmentor (nemo.collections.asr.parts.perturb.AudioAugmentor) – An AudioAugmentor object used to augment loaded audio

  • shuffle_n (int) – How many samples to look ahead and load to be shuffled. See WebDataset documentation for more details. Defaults to 0.

  • min_duration (float) – Dataset parameter. All training files which have a duration less than min_duration are dropped. Note: Duration is read from the manifest JSON. Defaults to 0.1.

  • max_duration (float) – Dataset parameter. All training files which have a duration more than max_duration are dropped. Note: Duration is read from the manifest JSON. Defaults to None.

  • blank_index (int) – Blank character index, defaults to -1.

  • unk_index (int) – Unknown character index, defaults to -1.

  • normalize (bool) – Dataset parameter. Whether to use automatic text cleaning. It is highly recommended to manually clean text for best results. Defaults to True.

  • trim (bool) – Whether to use trim silence from beginning and end of audio signal using librosa.effects.trim(). Defaults to False.

  • bos_id (id) – Dataset parameter. Beginning of string symbol id used for seq2seq models. Defaults to None.

  • eos_id (id) – Dataset parameter. End of string symbol id used for seq2seq models. Defaults to None.

  • pad_id (id) – Token used to pad when collating samples in batches. If this is None, pads using 0s. Defaults to None.

  • shard_strategy (str) –

    Tarred dataset shard distribution strategy chosen as a str value during ddp.

    • scatter: The default shard strategy applied by WebDataset, where each node gets a unique set of shards, which are permanently pre-allocated and never changed at runtime.

    • replicate: Optional shard strategy, where each node gets all of the set of shards available in the tarred dataset, which are permanently pre-allocated and never changed at runtime. The benefit of replication is that it allows each node to sample data points from the entire dataset independently of other nodes, and reduces dependence on value of shuffle_n.

    warning:

    Replicated strategy allows every node to sample the entire set of available tarfiles, and therefore more than one node may sample the same tarfile, and even sample the same data points! As such, there is no assured guarantee that all samples in the dataset will be sampled at least once during 1 epoch. Scattered strategy, on the other hand, on specific occasions (when the number of shards is not divisible with world_size), will not sample the entire dataset. For these reasons it is not advisable to use tarred datasets as validation or test datasets.

  • shard_manifests (bool) – Whether or not to try / shard manifests. Defaults to False.

  • global_rank (int) – Worker rank, used for partitioning shards. Defaults to 0.

  • world_size (int) –

    Total number of processes, used for partitioning shards. Defaults to 0.

    note:

    Below args are NLP-specific

  • max_seq_length (int) – maximum sequence length for each dataset examples. Examples will either be truncated to fit this length or dropped if they cannot be truncated.

  • min_seq_length (int) – min length of each data example in the dataset. Data examples will be dropped if they do not meet the min length requirements.

  • add_bos (bool) – Whether to add a beginning of sentence token to each data example

  • add_eos (bool) – Whether to add an end of sentence token to each data example

  • add_sep (bool) – Whether to add a separation token to each data example (goes between prompt and answer)

  • tokens_to_generate (int) – (inference only) Number of tokens to generate during inference

  • seed – Random seed for data shuffling.

  • seed – int = 1234,

  • context_key – Key to use for the context in your JSONL file

  • answer_key – Key to use for the label in your JSONL file

  • separate_prompt_and_response_with_newline – Adds a newline between prompt and response.

  • answer_only_loss – If True, will compute the loss only on the answer part of the input. If False, will compute the loss on the entire input.

  • truncation_field – Field to use for truncation. (Options: “answer”, “context”). Field to be used for truncation if the combined length exceeds the max sequence length.

  • pad_to_max_length – Whether to pad the input to the max sequence length. If False, will pad to the max length of the current batch.

  • prompt_template

    Prompt template to inject via an fstring. Formatted like:

    Q: {input}\n\nA: {output}
    

  • end_string

    Optional[str] = None, if not None, add this string to the end of the answer.

    note:

    Below args are for miscellaneous purposes

  • context_file – Optional[Union[List[str], str]] = None, if provided, will use this file to load random questions from, if question is not in manifest.

  • sample_alpha – Optional[float] = None, for SPE subword sampling

class nemo.collections.multimodal.speech_llm.data.audio_text_dataset.get_tarred_audio_text_dataset_from_config(
config: omegaconf.DictConfig,
tokenizer,
augmentor,
global_rank: int = 0,
world_size: int = 1,
sep_id: int | None = None,
answer_only_loss: bool = True,
virtual_tokens: int = 0,
)#

Bases:

class nemo.collections.multimodal.speech_llm.data.audio_text_dataset.get_audio_text_dataset_from_config(
manifest_filepath: str,
config: omegaconf.DictConfig,
tokenizer,
augmentor,
is_train,
sep_id: int | None = None,
answer_only_loss: bool = True,
virtual_tokens: int = 0,
)#

Bases:

class nemo.collections.multimodal.speech_llm.data.lhotse_dataset.LhotseAudioQuestionAnswerDataset(*args: Any, **kwargs: Any)#

Bases: Dataset

This dataset is based on Lhotse ASR dataset from audio_to_text_lhotse.py and TarredAudioQuestionAnswerDataset from audio_text_qa_dataset.py.

Unlike native NeMo datasets, Lhotse dataset defines only the mapping from a CutSet (meta-data) to a mini-batch with PyTorch tensors. Specifically, it performs tokenization, I/O, augmentation, and feature extraction (if any). Managing data, sampling, de-duplication across workers/nodes etc. is all handled by Lhotse samplers instead.

Parameters:
  • text_processor – TextProcessing object

  • default_context – Default question to use if no question is provided

  • tokens_to_generate – Number of tokens to generate during inference

  • pad_to_max_length – Whether to pad the input to the max sequence length. If False, will pad to the max length of the current batch.

  • max_seq_length – Maximum sequence length for each dataset examples. Examples will either be truncated to fit this length or dropped if they cannot be truncated.

  • context_key – Key to use for the context in your JSONL file

  • default_context_key – Key to use for the default context in lhotse yaml

class nemo.collections.multimodal.speech_llm.data.build_dataset.build_speechllm_dataset(model_instance, data_cfg, is_train)#

Bases:

class nemo.collections.multimodal.speech_llm.data.build_dataset.build_speechllm_dataloader(
dataset,
data_cfg,
consumed_samples=0,
is_predict=False,
is_eval=False,
)#

Bases:

Buld dataloader given an input dataset.