Configuration Files#
The SpeechLM2 models use YAML configuration files to define model architecture, training parameters, and data settings. This page describes the configuration structure and important parameters for each model type in the collection.
Configuration Structure#
SpeechLM2 configuration files typically have the following high-level structure:
model:
# Model architecture settings
...
trainer:
# PyTorch Lightning trainer settings
...
exp_manager:
# Experiment logging settings
...
data:
# Dataset settings
...
SALM Configuration#
The SALM (Speech-Augmented Language Model) configuration includes settings for the pretrained LLM, audio perception module, and training parameters. See the SALM paper for more details.
model:
# Pretrained model paths
pretrained_llm: "TinyLlama/TinyLlama_v1.1" # HF model path
pretrained_asr: "stt_en_fastconformer_hybrid_large_streaming_80ms" # NeMo checkpoint name
pretrained_weights: True # Whether to load weights or just architecture
# Fine-tune from a previous training checkpoint (weights only, fresh optimizer)
init_from_checkpoint: null # path to .ckpt, DCP dir, or HF dir
# Special token settings
audio_locator_tag: "<audio>" # Tag to replace with audio embeddings
# Freezing parameters
freeze_params:
- "^llm\\.model\\.layers\\.[0-4]\\..+$" # Regex patterns for parameters to freeze
prevent_freeze_params: [] # Override freeze_params for specific submodules
# Optional LoRA settings for efficient fine-tuning
lora:
task_type: CAUSAL_LM
r: 8
lora_alpha: 32
lora_dropout: 0.1
# Audio perception module configuration
perception:
target: nemo.collections.speechlm2.modules.perception.AudioPerceptionModule
preprocessor:
normalize: 'NA'
encoder:
self_attention_model: rel_pos
att_context_size: [-1, -1]
conv_context_size: regular
conv_norm_type: batch_norm
modality_adapter:
_target_: nemo.collections.asr.modules.ConformerEncoder
feat_in: 1024
feat_out: -1
n_layers: 2
d_model: 1024
subsampling: dw_striding
subsampling_factor: 1
subsampling_conv_channels: 256
causal_downsampling: false
ff_expansion_factor: 4
self_attention_model: rel_pos
n_heads: 8
att_context_size: [-1, -1]
att_context_style: regular
xscaling: true
untie_biases: true
pos_emb_max_len: 5000
conv_kernel_size: 9
conv_norm_type: batch_norm
conv_context_size: null
dropout: 0
dropout_pre_encoder: 0
dropout_emb: 0.0
SALMAutomodel Configuration#
The SALMAutomodel configuration extends the SALM configuration with NeMo Automodel
support. The key difference is use_nemo_automodel: true and the use of
AutomodelParallelStrategy instead of DDPStrategy.
The example below shows a configuration for training with NVIDIA Nemotron Nano V3 MoE as the LLM backbone, with Expert Parallelism across 8 GPUs:
model:
use_nemo_automodel: true # Selects SALMAutomodel in salm_train.py
pretrained_llm: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
pretrained_asr: "nvidia/canary-1b-flash"
pretrained_weights: True
freeze_params:
- "^llm\\..+$"
- "^perception\\.preprocessor\\..+$"
- "^perception\\.encoder\\..+$"
prevent_freeze_params: []
# LoRA uses Automodel-native format (not HF PEFT):
# lora:
# dim: 128
# alpha: 256
# dropout: 0.01
# target_modules: ["q_proj", "v_proj"]
perception:
target: nemo.collections.speechlm2.modules.perception.AudioPerceptionModule
output_dim: 2048
modality_adapter:
_target_: nemo.collections.speechlm2.modules.perception.IdentityConnector
d_model: 1024
trainer:
strategy:
_target_: nemo.collections.speechlm2.parts.parallel.AutomodelParallelStrategy
ep_size: 8 # Expert Parallelism across 8 GPUs for MoE
# tp_size: 1
# dp_size: null # inferred
NeMo Automodel applies MoE-specific optimizations automatically when an MoE model is detected:
Grouped GEMM — fuses expert computations into a single batched matrix multiply for higher GPU throughput.
DeepEP (Deep Expert Parallelism) — efficient all-to-all expert routing across GPUs, minimizing communication overhead for MoE layers.
Note the differences from the SALM configuration:
model.use_nemo_automodel: true— selectsSALMAutomodelin the training script.model.pretrained_llmcan point to MoE models likenvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16.trainer.strategy._target_usesAutomodelParallelStrategyinstead ofModelParallelStrategy.ep_sizecontrols Expert Parallelism on the FSDP data-parallel axis — dense layers are sharded via FSDP2, while MoE layers use EP for expert routing on the same GPUs.LoRA config uses
dim/alphakeys (Automodel native) instead ofr/lora_alpha(HF PEFT).No
embed_tokensfreeze pattern — embeddings stay inside the LLM.
DuplexS2SModel Configuration#
The DuplexS2SModel adds speech generation capabilities to the configuration:
model:
# Pretrained model paths
pretrained_llm: "TinyLlama/TinyLlama_v1.1"
pretrained_audio_codec: "path/to/audio_codec.nemo"
pretrained_asr: "stt_en_fastconformer_hybrid_large_streaming_80ms"
scoring_asr: "stt_en_fastconformer_transducer_large" # used only in validation
# Loss weights
audio_loss_weight: 4
text_loss_weight: 3
# Perception module config (similar to SALM)
perception:
# ... (similar to SALM perception module)
DuplexS2SSpeechDecoderModel Configuration#
The DuplexS2SSpeechDecoderModel is similar to DuplexS2SModel, but focuses on an additional speech generation transformer decoder:
model:
# Pretrained model paths
pretrained_llm: "TinyLlama/TinyLlama_v1.1"
pretrained_audio_codec: "path/to/audio_codec.nemo"
pretrained_asr: "stt_en_fastconformer_hybrid_large_streaming_80ms"
# Speech decoder settings
speech_decoder:
target: nemo.collections.speechlm2.modules.speech_generation.TransformerARSpeechDecoder
d_model: 1024
n_layers: 12
n_heads: 16
d_kv: 64
d_ff: 4096
max_seq_len: 2048
dropout: 0.1
layernorm_epsilon: 1e-5
activation_function: "gelu_new"
init_method_std: 0.02
use_cache: True
# ... other settings
DuplexSTTModel Configuration#
The DuplexSTTModel is a speech-to-text model that processes duplex audio conversations and generates agent text responses:
model:
# Pretrained model paths
pretrained_llm: "TinyLlama/TinyLlama_v1.1"
pretrained_asr: "stt_en_fastconformer_hybrid_large_streaming_80ms"
# ... other settings
Trainer Configuration#
The trainer section contains PyTorch Lightning Trainer settings:
trainer:
devices: 1
num_nodes: 1
accelerator: gpu
precision: bf16-true
logger: false
enable_checkpointing: false # handled by exp_manager
replace_sampler_ddp: false # handled by lhotse
max_epochs: null
max_steps: 100000
log_every_n_steps: 10
val_check_interval: 2000
accumulate_grad_batches: 1
gradient_clip_val: 1.0
Experiment Manager Configuration#
The exp_manager section contains settings for experiment logging and model checkpointing:
exp_manager:
explicit_log_dir: path/to/output_dir
exp_dir: null
name: ${name}
create_wandb_logger: false # set to true if you want to use wandb
wandb_logger_kwargs:
project: null
name: null
resume_if_exists: true
resume_ignore_no_checkpoint: true
create_checkpoint_callback: true
checkpoint_callback_params:
monitor: val_loss
filename: "{step}" # checkpoint name will be step=<step>.ckpt
save_top_k: 1
mode: min
create_tensorboard_logger: false # set to true if you want to use tensorboard
version: null
Data Configuration#
The data section defines dataset paths, preprocessing, and data loading parameters:
data:
train_ds:
sample_rate: ${data.target_sample_rate}
input_cfg:
- type: lhotse_shar
shar_path: /path/to/train_data
seed: 42
shard_seed: "randomized"
num_workers: 4
batch_size: 16
# Optional bucketing settings
# batch_duration: 100
# bucket_duration_bins: [8.94766,10.1551,11.64118,19.30376,42.85]
# use_bucketing: true
# num_buckets: 5
# bucket_buffer_size: 5000
validation_ds:
datasets:
val_set_name:
shar_path: /path/to/validation_data
sample_rate: ${data.target_sample_rate}
batch_size: 1
seed: 42
shard_seed: "randomized"
Depending on the model, there may be additional options available under data namespace that are passed to the corresponding Dataset class.
For example, S2S models have:
data:
frame_length: 0.08
source_sample_rate: 16000
target_sample_rate: 22050
input_roles: ["user", "User"]
output_roles: ["agent", "Assistant"]
train_ds: ...
Important Configuration Parameters#
Model Parameters#
pretrained_llm: Path to the pretrained HuggingFace LLM
pretrained_asr: Name of the pretrained NeMo ASR model used for perception
pretrained_audio_codec: Path to the pretrained audio codec model (for speech generation)
init_from_checkpoint: Path to a training checkpoint to initialize model weights from (see Fine-Tuning from a Previous Checkpoint below)
freeze_params: Regex patterns of parameters to freeze during training
audio_loss_weight/text_loss_weight: Weighting of different loss components
Perception Module#
self_attention_model: Type of attention mechanism (“rel_pos” or “abs_pos”)
att_context_size: Context window size for attention ([left, right])
conv_context_size: Context type for convolutions (“causal” or “regular”)
n_layers: Number of encoder layers
d_model: Model dimension size
Data Parameters#
frame_length: Frame duration in seconds
source_sample_rate/target_sample_rate: Sample rates for input/output audio
input_roles/output_roles: Speaker roles for input and output
batch_size: Number of samples per batch
use_bucketing: Whether to use length-based bucketing for efficient batching
Example Configuration Files#
Example configurations for all model types can be found in the example directory:
SALM: examples/speechlm2/conf/salm.yaml
SALMAutomodel: examples/speechlm2/conf/salm_automodel.yaml
DuplexS2SModel: examples/speechlm2/conf/s2s_duplex.yaml
DuplexS2SSpeechDecoderModel: examples/speechlm2/conf/s2s_duplex_speech_decoder.yaml
DuplexSTTModel: examples/speechlm2/conf/duplex_stt.yaml
Using Configuration Files#
You can use these configurations with the training scripts by specifying the config path:
# Train SALM model
python examples/speechlm2/salm_train.py \
--config-path=conf \
--config-name=salm
# Train SALMAutomodel
python examples/speechlm2/salm_train.py \
--config-name=salm_automodel
You can also override configuration values from the command line:
python examples/speechlm2/salm_train.py \
--config-path=conf \
--config-name=salm \
model.pretrained_llm="different/llm/path" \
trainer.max_steps=1000 \
data.train_ds.batch_size=8
Fine-Tuning from a Previous Checkpoint#
To start a new training run initialized from a previous checkpoint — with a fresh
optimizer, LR scheduler, and step counter — set model.init_from_checkpoint:
model:
init_from_checkpoint: /path/to/checkpoints/step=6375.ckpt
Or pass it as a Hydra override:
python examples/speechlm2/salm_train.py \
--config-name=salm_automodel \
++model.init_from_checkpoint=/path/to/checkpoints/step=6375.ckpt
This differs from exp_manager.resume_from_checkpoint which restores the
full training state (optimizer, scheduler, step counter) to continue an
interrupted run. init_from_checkpoint only loads model weights, giving you a
clean starting point for fine-tuning on different data or with different
hyperparameters.
Supported Checkpoint Formats#
Three checkpoint formats are supported:
Distributed checkpoints (DCP): Directories with a
.metadatafile, produced byModelParallelStrategy/AutomodelParallelStrategy. This is the default format when training with FSDP2 or TP. DCP loading handles automatic resharding when the parallelism configuration differs between the source and target runs.HuggingFace model directories: Directories containing
model.safetensors, such as the output ofto_hf.py.Single-file checkpoints: Standard
.ckptor.ptfiles with astate_dictkey.
The model architecture is still defined by pretrained_llm and pretrained_asr
(needed for config and tokenizer initialization), but all weights are overridden by
the checkpoint.
This feature works with both SALM and SALMAutomodel.
Note
init_from_checkpoint requires the source and target models to use the
same model class (e.g., both SALMAutomodel). Cross-model loading
(e.g., SALM checkpoint into SALMAutomodel) will encounter state dict
key mismatches because the two classes structure the embedding layer differently.