Important
You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.
SpeechLLM API#
Model Classes#
- class nemo.collections.nlp.models.language_modeling.megatron_base_model.MegatronBaseModel(*args: Any, **kwargs: Any)
Bases:
NLPModel
Megatron base class. All NeMo Megatron models inherit from this class.
Initialize the model parallel world for nemo.
Turn on all of the nvidia optimizations.
If cfg.tokenizer is available, it loads the tokenizer and pad the vocab to the correct size for tensor model parallelism.
If using distributed optimizer, configure to be compatible with O2 level optimizations and/or model parallelism.
Perform gradient clipping: grad_clip_pl_default triggers the PyTorch Lightning default implementation, with_distributed_adam triggers the distributed optimizer’s implementation, megatron_amp_O2 triggers gradient clipping on the main grads, and otherwise gradient clipping is performed on the model grads.
- __init__(
- cfg: omegaconf.dictconfig.DictConfig,
- trainer: pytorch_lightning.trainer.trainer.Trainer,
- no_lm_init=True,
Base class from which all NeMo models should inherit
- Parameters:
cfg (DictConfig) –
configuration object. The cfg object should have (optionally) the following sub-configs:
train_ds - to instantiate training dataset
validation_ds - to instantiate validation dataset
test_ds - to instantiate testing dataset
optim - to instantiate optimizer with learning rate scheduler
trainer (Optional) – Pytorch Lightning Trainer instance
- class nemo.collections.multimodal.speech_llm.models.modular_models.ModularAudioGPTModel(*args: Any, **kwargs: Any)#
Bases:
SpeechLLMAdapterMixin
,MegatronGPTSFTModel
Modularized speech GPT model.
- __init__(
- cfg: omegaconf.dictconfig.DictConfig,
- trainer: pytorch_lightning.trainer.trainer.Trainer,
Base class from which all NeMo models should inherit
- Parameters:
cfg (DictConfig) –
configuration object. The cfg object should have (optionally) the following sub-configs:
train_ds - to instantiate training dataset
validation_ds - to instantiate validation dataset
test_ds - to instantiate testing dataset
optim - to instantiate optimizer with learning rate scheduler
trainer (Optional) – Pytorch Lightning Trainer instance
- setup(stage=None)#
PTL hook that is executed after DDP spawns. We setup datasets here as megatron datasets require DDP to instantiate. See https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#setup for more information.
- Parameters:
stage (str, optional) – Can be ‘fit’, ‘validate’, ‘test’ or ‘predict’. Defaults to None.
- training_step(dataloader_iter)#
We pass the dataloader iterator function to the micro-batch scheduler. The input batch to each micro-batch is fetched using the dataloader function in the micro-batch fwd function.
- validation_step(dataloader_iter)#
Our dataloaders produce a micro-batch and then we fetch a number of microbatches depending on the global batch size and model parallel size from the dataloader to produce a list of microbatches. The list of microbatches is then piped through the pipeline using megatron-core fwd/bwd functions.
- class nemo.collections.multimodal.speech_llm.models.modular_models.CrossAttendModularAudioGPTModel(*args: Any, **kwargs: Any)#
Bases:
ModularAudioGPTModel
Modularized speech GPT model.
- __init__(
- cfg: omegaconf.dictconfig.DictConfig,
- trainer: pytorch_lightning.trainer.trainer.Trainer,
Base class from which all NeMo models should inherit
- Parameters:
cfg (DictConfig) –
configuration object. The cfg object should have (optionally) the following sub-configs:
train_ds - to instantiate training dataset
validation_ds - to instantiate validation dataset
test_ds - to instantiate testing dataset
optim - to instantiate optimizer with learning rate scheduler
trainer (Optional) – Pytorch Lightning Trainer instance
- setup(stage=None)#
PTL hook that is executed after DDP spawns. We setup datasets here as megatron datasets require DDP to instantiate. See https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#setup for more information.
- Parameters:
stage (str, optional) – Can be ‘fit’, ‘validate’, ‘test’ or ‘predict’. Defaults to None.
- training_step(dataloader_iter)#
We pass the dataloader iterator function to the micro-batch scheduler. The input batch to each micro-batch is fetched using the dataloader function in the micro-batch fwd function.
- validation_step(dataloader_iter)#
Our dataloaders produce a micro-batch and then we fetch a number of microbatches depending on the global batch size and model parallel size from the dataloader to produce a list of microbatches. The list of microbatches is then piped through the pipeline using megatron-core fwd/bwd functions.
- class nemo.collections.multimodal.speech_llm.models.modular_t5_models.ModularizedAudioT5Model(*args: Any, **kwargs: Any)#
Bases:
MegatronT5LoraModel
Modularized speech GPT model.
- __init__(
- cfg: omegaconf.dictconfig.DictConfig,
- trainer: pytorch_lightning.trainer.trainer.Trainer,
Base class from which all NeMo models should inherit
- Parameters:
cfg (DictConfig) –
configuration object. The cfg object should have (optionally) the following sub-configs:
train_ds - to instantiate training dataset
validation_ds - to instantiate validation dataset
test_ds - to instantiate testing dataset
optim - to instantiate optimizer with learning rate scheduler
trainer (Optional) – Pytorch Lightning Trainer instance
- setup(stage=None)#
Called at the beginning of fit, validate, test, or predict. This is called on every process when using DDP.
- Parameters:
stage – fit, validate, test or predict
- class nemo.collections.multimodal.speech_llm.models.modular_t5_models.DecoderTextPromptModularizedAudioT5Model(*args: Any, **kwargs: Any)#
Bases:
ModularizedAudioT5Model
Modularized speech GPT model.
- __init__(
- cfg: omegaconf.dictconfig.DictConfig,
- trainer: pytorch_lightning.trainer.trainer.Trainer,
Base class from which all NeMo models should inherit
- Parameters:
cfg (DictConfig) –
configuration object. The cfg object should have (optionally) the following sub-configs:
train_ds - to instantiate training dataset
validation_ds - to instantiate validation dataset
test_ds - to instantiate testing dataset
optim - to instantiate optimizer with learning rate scheduler
trainer (Optional) – Pytorch Lightning Trainer instance
- setup(stage=None)#
Called at the beginning of fit, validate, test, or predict. This is called on every process when using DDP.
- Parameters:
stage – fit, validate, test or predict
Modules#
- class nemo.collections.multimodal.speech_llm.modules.perception_modules.AudioPerceptionModule(*args: Any, **kwargs: Any)#
Bases:
NeuralModule
,Exportable
Audio perception module that consists of audio encoder(s) and modality adapter.
- class nemo.collections.multimodal.speech_llm.modules.perception_modules.MultiAudioPerceptionModule(*args: Any, **kwargs: Any)#
Bases:
NeuralModule
,Exportable
Audio perception module that consists of multiple audio encoders and shared modality adapter. This module is experimental. An example perception cfg is:
perception: modality_adapter: _target_: nemo.collections.multimodal.speechllm.modules.PoolingMLPConnectors hidden_dim: 512 pooling: 'cat' pooling_factor: 2 num_layers: 4 input_dim: -1 output_dim: -1 spec_augment: _target_: nemo.collections.asr.modules.SpectrogramAugmentation freq_masks: 2 # set to zero to disable it time_masks: 10 # set to zero to disable it freq_width: 27 time_width: 0.05 encoders: asr_model: _target_: nemo.collections.asr.models.ASRModel output_key: d_model freeze: True pretrained_model: stt_en_fastconformer_transducer_large ssl_model: _target_: nemo.collections.asr.models.SpeechEncDecSelfSupervisedModel output_key: d_model freeze: True pretrained_model: ssl_en_conformer_large use_multi_layer_feat: True multi_layer_feat: layer_idx_list: [0,16] aggregator: mode: "cat" pooling: "avg" rounding: "floor" speaker_model: segment_length_in_secs: 0.4 freeze: True pretrained_model: titanet_large ref_model: asr_model aggregator: mode: "cat" pooling: "mean" rounding: "floor"
- class nemo.collections.multimodal.speech_llm.modules.TransformerCrossAttention(*args: Any, **kwargs: Any)#
Bases:
NeuralModule
,Exportable
Transformer module for cross-attention between speech and text embeddings. The module allows optional projection from the input embeddings to a lower dimension before feeding them to the transformer.
- Parameters:
cfg – DictConfig, configuration object for the module which should include:
xattn – DictConfig, configuration object for the transformer decoder
Dataset Classes#
- class nemo.collections.multimodal.speech_llm.data.audio_text_dataset.AudioTextDataset(*args: Any, **kwargs: Any)#
Bases:
TextProcessing
,Dataset
Dataset that loads tensors via a json file containing paths to audio files, transcripts, and durations (in seconds). Each new line is a different sample. Example below:
{"audio_filepath": "1.wav", "duration": 1.12, "question": "what is the capital of France?", "answer": "Paris"} {"audio_filepath": "2.wav", "duration": 2.15, "question": "what is the capital of Italy?", "answer": "Rome"}
- Parameters:
manifest_filepath – Path to manifest json as described above. Can be comma-separated paths.
tokenizer – text tokenizer object
sample_rate (int) – Sample rate to resample loaded audio to
int_values (bool) – If true, load samples as 32-bit integers. Defauts to False.
augmentor (nemo.collections.asr.parts.perturb.AudioAugmentor) – An AudioAugmentor object used to augment loaded audio
max_duration – If audio exceeds this length, do not include in dataset
min_duration – If audio is less than this length, do not include in dataset
max_utts – Limit number of utterances
trim – whether or not to trim silence. Defaults to False
channel_selector (int | Iterable[int] | str) –
select a single channel or a subset of channels from multi-channel audio. If set to ‘average’, it performs averaging across channels. Disabled if set to None. Defaults to None. Uses zero-based indexing.
- note:
below args are NLP-specific
max_seq_length (int) – maximum sequence length for each dataset examples. Examples will either be truncated to fit this length or dropped if they cannot be truncated.
min_seq_length (int) – min length of each data example in the dataset. Data examples will be dropped if they do not meet the min length requirements.
add_bos (bool) – Whether to add a beginning of sentence token to each data example
add_eos (bool) – Whether to add an end of sentence token to each data example
add_sep (bool) – Whether to add a separation token to each data example (goes between prompt and answer)
tokens_to_generate (int) – (inference only) Number of tokens to generate during inference
seed – Random seed for data shuffling.
max_num_samples – Maximum number of samples to load. This can be > dataset length if you want to oversample data. If None, all samples will be loaded.
seed – int = 1234,
context_key – Key to use for the context in your JSONL file
answer_key – Key to use for the label in your JSONL file
separate_prompt_and_response_with_newline – Adds a newline between prompt and response.
answer_only_loss – If True, will compute the loss only on the answer part of the input. If False, will compute the loss on the entire input.
truncation_field – Field to use for truncation. (Options: “answer”, “context”). Field to be used for truncation if the combined length exceeds the max sequence length.
pad_to_max_length – Whether to pad the input to the max sequence length. If False, will pad to the max length of the current batch.
prompt_template –
Prompt template to inject via an fstring. Formatted like:
Q: {input}\n\nA: {output}
end_string –
Optional[str] = None, if not None, add this string to the end of the answer.
- note:
below args are for miscellaneous purposes
context_file – Optional[Union[List[str], str]] = None, if provided, will use this file to load random questions from, if question is not in manifest.
sample_alpha – Optional[float] = None, for SPE subword sampling
audio_locator – Optional[str] = None, a special string to split the context into multiple audio segments.
- class nemo.collections.multimodal.speech_llm.data.audio_text_dataset.TarredAudioTextDataset(*args: Any, **kwargs: Any)#
Bases:
TextProcessing
,IterableDataset
A similar Dataset to the AudioTextDataset, but which loads tarred audio files.
Accepts a single comma-separated JSON manifest file (in the same style as for the AudioTextDataset), as well as the path(s) to the tarball(s) containing the wav files. Each line of the manifest should contain the information for one audio file, including at least the transcript and name of the audio file within the tarball.
Valid formats for the audio_tar_filepaths argument include: (1) a single string that can be brace-expanded, e.g. ‘path/to/audio.tar’ or ‘path/to/audio_{1..100}.tar.gz’, or (2) a list of file paths that will not be brace-expanded, e.g. [‘audio_1.tar’, ‘audio_2.tar’, …].
Note: For brace expansion in (1), there may be cases where {x..y} syntax cannot be used due to shell interference. This occurs most commonly inside SLURM scripts. Therefore we provide a few equivalent replacements. Supported opening braces - { <=> (, [, < and the special tag _OP_. Supported closing braces - } <=> ), ], > and the special tag _CL_. For SLURM based tasks, we suggest the use of the special tags for ease of use.
See the WebDataset documentation for more information about accepted data and input formats.
If using multiple workers the number of shards should be divisible by world_size to ensure an even split among workers. If it is not divisible, logging will give a warning but training will proceed. In addition, if using mutiprocessing, each shard MUST HAVE THE SAME NUMBER OF ENTRIES after filtering is applied. We currently do not check for this, but your program may hang if the shards are uneven!
Additionally, please note that the len() of this DataLayer is assumed to be the length of the manifest after filtering. An incorrect manifest length may lead to some DataLoader issues down the line.
- Parameters:
audio_tar_filepaths – Either a list of audio tarball filepaths, or a string (can be brace-expandable).
manifest_filepath (str) – Path to the manifest.
parser (callable) – A callable which is used to pre-process the text output.
sample_rate (int) – Sample rate to resample loaded audio to
int_values (bool) – If true, load samples as 32-bit integers. Defauts to False.
augmentor (nemo.collections.asr.parts.perturb.AudioAugmentor) – An AudioAugmentor object used to augment loaded audio
shuffle_n (int) – How many samples to look ahead and load to be shuffled. See WebDataset documentation for more details. Defaults to 0.
min_duration (float) – Dataset parameter. All training files which have a duration less than min_duration are dropped. Note: Duration is read from the manifest JSON. Defaults to 0.1.
max_duration (float) – Dataset parameter. All training files which have a duration more than max_duration are dropped. Note: Duration is read from the manifest JSON. Defaults to None.
blank_index (int) – Blank character index, defaults to -1.
unk_index (int) – Unknown character index, defaults to -1.
normalize (bool) – Dataset parameter. Whether to use automatic text cleaning. It is highly recommended to manually clean text for best results. Defaults to True.
trim (bool) – Whether to use trim silence from beginning and end of audio signal using librosa.effects.trim(). Defaults to False.
bos_id (id) – Dataset parameter. Beginning of string symbol id used for seq2seq models. Defaults to None.
eos_id (id) – Dataset parameter. End of string symbol id used for seq2seq models. Defaults to None.
pad_id (id) – Token used to pad when collating samples in batches. If this is None, pads using 0s. Defaults to None.
shard_strategy (str) –
Tarred dataset shard distribution strategy chosen as a str value during ddp.
scatter: The default shard strategy applied by WebDataset, where each node gets a unique set of shards, which are permanently pre-allocated and never changed at runtime.
replicate: Optional shard strategy, where each node gets all of the set of shards available in the tarred dataset, which are permanently pre-allocated and never changed at runtime. The benefit of replication is that it allows each node to sample data points from the entire dataset independently of other nodes, and reduces dependence on value of shuffle_n.
- warning:
Replicated strategy allows every node to sample the entire set of available tarfiles, and therefore more than one node may sample the same tarfile, and even sample the same data points! As such, there is no assured guarantee that all samples in the dataset will be sampled at least once during 1 epoch. Scattered strategy, on the other hand, on specific occasions (when the number of shards is not divisible with
world_size
), will not sample the entire dataset. For these reasons it is not advisable to use tarred datasets as validation or test datasets.
shard_manifests (bool) – Whether or not to try / shard manifests. Defaults to False.
global_rank (int) – Worker rank, used for partitioning shards. Defaults to 0.
world_size (int) –
Total number of processes, used for partitioning shards. Defaults to 0.
- note:
Below args are NLP-specific
max_seq_length (int) – maximum sequence length for each dataset examples. Examples will either be truncated to fit this length or dropped if they cannot be truncated.
min_seq_length (int) – min length of each data example in the dataset. Data examples will be dropped if they do not meet the min length requirements.
add_bos (bool) – Whether to add a beginning of sentence token to each data example
add_eos (bool) – Whether to add an end of sentence token to each data example
add_sep (bool) – Whether to add a separation token to each data example (goes between prompt and answer)
tokens_to_generate (int) – (inference only) Number of tokens to generate during inference
seed – Random seed for data shuffling.
seed – int = 1234,
context_key – Key to use for the context in your JSONL file
answer_key – Key to use for the label in your JSONL file
separate_prompt_and_response_with_newline – Adds a newline between prompt and response.
answer_only_loss – If True, will compute the loss only on the answer part of the input. If False, will compute the loss on the entire input.
truncation_field – Field to use for truncation. (Options: “answer”, “context”). Field to be used for truncation if the combined length exceeds the max sequence length.
pad_to_max_length – Whether to pad the input to the max sequence length. If False, will pad to the max length of the current batch.
prompt_template –
Prompt template to inject via an fstring. Formatted like:
Q: {input}\n\nA: {output}
end_string –
Optional[str] = None, if not None, add this string to the end of the answer.
- note:
Below args are for miscellaneous purposes
context_file – Optional[Union[List[str], str]] = None, if provided, will use this file to load random questions from, if question is not in manifest.
sample_alpha – Optional[float] = None, for SPE subword sampling
- class nemo.collections.multimodal.speech_llm.data.audio_text_dataset.get_tarred_audio_text_dataset_from_config(
- config: omegaconf.DictConfig,
- tokenizer,
- augmentor,
- global_rank: int = 0,
- world_size: int = 1,
- sep_id: int | None = None,
- answer_only_loss: bool = True,
- virtual_tokens: int = 0,
Bases:
- class nemo.collections.multimodal.speech_llm.data.audio_text_dataset.get_audio_text_dataset_from_config(
- manifest_filepath: str,
- config: omegaconf.DictConfig,
- tokenizer,
- augmentor,
- is_train,
- sep_id: int | None = None,
- answer_only_loss: bool = True,
- virtual_tokens: int = 0,
Bases:
- class nemo.collections.multimodal.speech_llm.data.lhotse_dataset.LhotseAudioQuestionAnswerDataset(*args: Any, **kwargs: Any)#
Bases:
Dataset
This dataset is based on Lhotse ASR dataset from
audio_to_text_lhotse.py
andTarredAudioQuestionAnswerDataset
fromaudio_text_qa_dataset.py
.Unlike native NeMo datasets, Lhotse dataset defines only the mapping from a CutSet (meta-data) to a mini-batch with PyTorch tensors. Specifically, it performs tokenization, I/O, augmentation, and feature extraction (if any). Managing data, sampling, de-duplication across workers/nodes etc. is all handled by Lhotse samplers instead.
- Parameters:
text_processor – TextProcessing object
default_context – Default question to use if no question is provided
tokens_to_generate – Number of tokens to generate during inference
pad_to_max_length – Whether to pad the input to the max sequence length. If False, will pad to the max length of the current batch.
max_seq_length – Maximum sequence length for each dataset examples. Examples will either be truncated to fit this length or dropped if they cannot be truncated.
context_key – Key to use for the context in your JSONL file
default_context_key – Key to use for the default context in lhotse yaml
- class nemo.collections.multimodal.speech_llm.data.build_dataset.build_speechllm_dataset(model_instance, data_cfg, is_train)#
Bases:
- class nemo.collections.multimodal.speech_llm.data.build_dataset.build_speechllm_dataloader(
- dataset,
- data_cfg,
- consumed_samples=0,
- is_predict=False,
- is_eval=False,
Bases:
Buld dataloader given an input dataset.