Datasets#
The speechlm2 collection supports datasets that contain both audio and text data for training models that can understand speech and generate appropriate responses. This section describes the dataset format, preparation, and usage with the speechlm2 models.
Dataset Format#
Duplex S2S models use the Lhotse framework for audio data management. The primary datasets used are:
DuplexS2SDataset: For general duplex speech-to-speech models
SALMDataset: Specifically for the Speech-Augmented Language Model (SALM), which processes speech+text and outputs text.
DuplexS2S Dataset Structure#
A typical dataset for speechlm2 models consists of:
Audio files: Contains source audio (input speech) and possibly target audio (output speech)
Text transcriptions: Associated text for both input and output speech
Role identifiers: To distinguish between speakers (e.g., “user” vs “agent”)
The dataset organization is built around the concept of conversation turns, with each turn containing audio and text from either a user or an agent/assistant.
The datasets are primarily managed using Lhotse’s CutSet format, which provides efficient handling of audio data and annotations. A typical Lhotse manifest includes:
Audio recording information (path, duration, sample rate)
Supervision information (transcripts, speaker roles, timing)
Optional additional annotations
Example of a Lhotse cut:
{
"id": "conversation_1",
"start": 0,
"duration": 10.7,
"channel": 0,
"supervisions": [
{
"id": "conversation_1_turn_0",
"text": "Can you help me with this problem?",
"start": 0,
"duration": 5.2,
"speaker": "user"
},
{
"id": "conversation_1_turn_1",
"text": "I can help you with that.",
"start": 5.2,
"duration": 3.1,
"speaker": "assistant"
}
],
"recording": {
"id": "conversation_1_user",
"path": "/path/to/audio/conversation_1_user.wav",
"sampling_rate": 16000,
"num_samples": 171200,
"duration": 10.7
},
"custom": {
"target_audio": {
"id": "conversation_1_assistant",
"path": "/path/to/audio/conversation_1_assistant.wav",
"sampling_rate": 22050,
"num_samples": 235935,
"duration": 10.7
}
}
}
The DuplexS2SDataset performs several key operations when processing data:
Turn Identification: Each cut contains a list of supervisions with objects of type lhotse.SupervisionSegment that represent conversation turns with corresponding text and speaker information.
Speaker Role Separation: The text of each supervision is tokenized and identified as the model’s output (when supervision.speaker is in output_roles, e.g., “agent” or “Assistant”) or the model’s input (when in input_roles, e.g., “user” or “User”).
Token Sequence Generation: - target_tokens and source_tokens arrays are created with a length equal to lhotse.utils.compute_num_frames(cut.duration, frame_length, cut.sampling_rate) - The frame_length parameter (typically 80ms) determines the temporal resolution of token assignments - Each token is assigned to a position based on its corresponding audio segment’s timing
Token Offset Calculation: - The starting position for each turn’s tokens is determined using lhotse.utils.compute_num_frames(supervision.start, frame_length) - This ensures tokens are aligned with their corresponding audio segments
Length Validation: - If token sequences are too long compared to the audio duration, warnings are emitted - Tokens that extend beyond the audio length are truncated
This process ensures that the model can correctly align audio input with corresponding text, and learn to generate appropriate responses based on the conversation context.
DuplexS2SDataset#
This dataset class is designed for models that handle both speech understanding and speech generation. It processes audio inputs and prepares them for the model along with corresponding text.
from nemo.collections.speechlm2.data import DuplexS2SDataset
dataset = DuplexS2SDataset(
tokenizer=model.tokenizer, # Text tokenizer
frame_length=0.08, # Frame length in seconds
source_sample_rate=16000, # Input audio sample rate
target_sample_rate=22050, # Output audio sample rate
input_roles=["user", "User"], # Roles considered as input
output_roles=["agent", "Assistant"] # Roles considered as output
)
SALMDataset Structure#
Data used for SALM can be either regular speech-to-text data (in any NeMo or Lhotse format), or a dataset of multi-turn conversions. For the most part, please refer to the Configuring multimodal dataloading section in the ASR documentation.
When using speech-to-text data, you’ll need read it with a special lhotse_as_conversation
data reader
that creates a two-turn, query+response, multi-modal conversation data types out of regular Lhotse cuts.
This approach makes SALM training more flexible, allowing straightforward combination of single-turn and multi-turn data.
Each audio turn is represented as a single token, defined in audio_locator_tag
property, and automatically added to the model’s tokenizer inside model code.
This token is replaced during the training/generation pass with its corresponding audio segment representation.
Example YAML configuration using existing ASR datasets with lhotse_as_conversation
:
data:
train_ds:
prompt_format: "llama3" # Choose based on your model
token_equivalent_duration: 0.08
input_cfg:
# Example 1: Using standard ASR Lhotse manifests (JSONL)
- type: lhotse_as_conversation
cuts_path: /path/to/librispeech_train_clean_100.jsonl.gz
audio_locator_tag: "<|audioplaceholder|>"
tags:
context: "Transcribe the following audio:"
# Optional system prompt can be uncommented
# system_prompt: "You are a helpful assistant that transcribes audio accurately."
# Example 2: Using tarred NeMo manifests
- type: lhotse_as_conversation
manifest_filepath: /path/to/tedlium_train_manifest.jsonl.gz
tarred_audio_filepaths: /path/to/tedlium_shards/shard-{000000..000009}.tar
audio_locator_tag: "<|audioplaceholder|>"
tags:
context: "Write down what is said in this recording:"
# Example 3: Using Lhotse SHAR format
- type: lhotse_as_conversation
shar_path: /path/to/fisher_shar/
audio_locator_tag: "<|audioplaceholder|>"
tags:
context: "Listen to this clip and write a transcript:"
# ... other settings
Alternatively, one can provide an existing YAML file with their dataset composition and wrap
it in a lhotse_as_conversation
reader as follows:
data:
train_ds:
input_cfg:
- type: lhotse_as_conversation
input_cfg: /path/to/dataset_config.yaml
audio_locator_tag: "<|audioplaceholder|>"
tags:
context: "Transcribe the following audio:"
# Optional system prompt can be uncommented
# system_prompt: "You are a helpful assistant that transcribes audio accurately."
The lhotse_as_conversation
reader automatically creates a two-turn conversation from each ASR example:
1. Optionally, if system_prompt
tag is provided, it’s added as a special system turn for LLM models that support system prompts.
2. A user turn containing the audio and a text context (from the context
tag)
3. An assistant turn containing the transcription (from the cut’s supervision text)
If a context
tag is provided in the configuration, it’s added as a text turn before the audio.
SALMDataset#
This dataset class is specialized for the SALM model, which focuses on understanding speech input and generating text output.
from nemo.collections.speechlm2.data import SALMDataset
dataset = SALMDataset(
tokenizer=model.tokenizer, # Text tokenizer
)
DataModule#
The DataModule class in the speechlm2 collection manages dataset loading, preparation, and batching for PyTorch Lightning training:
from nemo.collections.speechlm2.data import DataModule
datamodule = DataModule(
cfg_data, # Configuration dictionary for data
tokenizer=model.tokenizer, # Text tokenizer
dataset=dataset # Instance of DuplexS2SDataset or SALMDataset
)
The DataModule takes care of: 1. Setting up proper data parallel ranks for dataloaders 2. Instantiating the dataloaders with configuration from YAML 3. Managing multiple datasets for validation/testing
Bucketing for Efficient Training#
The DataModule supports bucketing for more efficient training. Bucketing groups samples of similar lengths together, which reduces padding and improves training efficiency. The key bucketing parameters are:
batch_duration: Target cumulative duration (in seconds) of samples in a batch
bucket_duration_bins: List of duration thresholds for bucketing
use_bucketing: Flag to enable/disable bucketing
num_buckets: Number of buckets to create
bucket_buffer_size: Number of samples to load into memory for bucket assignment
Example bucketing configuration:
train_ds:
# ... other settings
batch_duration: 100 # Target 100 seconds per batch
bucket_duration_bins: [8.94766, 10.1551, 11.64118, 19.30376, 42.85] # Duration thresholds
use_bucketing: true # Enable bucketing
num_buckets: 5 # Create 5 buckets
bucket_buffer_size: 5000 # Buffer size for bucket assignment
When bucketing is enabled:
Samples are grouped into buckets based on their duration
Each batch contains samples from the same bucket
The actual batch size can vary to maintain a consistent total duration
The target batch_duration ensures efficient GPU memory usage
Bucketing helps to: - Reduce padding and increase effective batch size - Improve training efficiency and convergence - Manage memory usage with variable-length inputs
Data Configuration#
A typical data configuration in YAML includes:
data:
train_ds:
sample_rate: ${data.target_sample_rate}
input_cfg:
- type: lhotse_shar
shar_path: /path/to/train_data
seed: 42
shard_seed: "randomized"
num_workers: 4
# Optional bucketing settings
batch_duration: 100
bucket_duration_bins: [8.94766, 10.1551, 11.64118, 19.30376, 42.85]
use_bucketing: true
num_buckets: 5
bucket_buffer_size: 5000
# batch_size: 4 # alternative to bucketing
validation_ds:
datasets:
val_set_name_0:
shar_path: /path/to/validation_data_0
val_set_name_1:
shar_path: /path/to/validation_data_1
sample_rate: ${data.target_sample_rate}
batch_size: 4
seed: 42
shard_seed: "randomized"
Note that the actual dataset paths and blend are defined by the YAML config, not Python code. This makes it easy to change the dataset composition without modifying the code. To learn more about the YAML data config, see the Extended multi-dataset configuration format section in the ASR documentation.
Preparing S2S Datasets#
Creating Lhotse Manifests#
To prepare your own dataset, you’ll need to create Lhotse manifests from your audio files and transcripts:
from lhotse import CutSet, Recording, SupervisionSegment
# Create a recording for user and assistant
recording_user = Recording(
id="conversation_1_user",
path="/path/to/audio/conversation_1_user.wav",
sampling_rate=16000,
num_samples=171200,
duration=10.7
)
recording_assistant = Recording(
id="conversation_1_assistant",
path="/path/to/audio/conversation_1_assistant.wav",
sampling_rate=22050,
num_samples=235935,
duration=10.7
)
# Create supervision for this recording
supervisions = [
SupervisionSegment(
id="conversation_1_turn_0",
recording_id="conversation_1",
start=0,
duration=5.2,
text="Can you help me with this problem?",
speaker="user"
),
SupervisionSegment(
id="conversation_1_turn_1",
recording_id="conversation_1",
start=5.5,
duration=3.1,
text="I can help you with that.",
speaker="assistant"
),
]
# Create a CutSet
# The assistant's response is located in target_audio field which makes it easy to replace
# when using multiple models or speakers for synthetic data generation.
cut = recording.to_cut()
cut.supervisions = supervisions
cut.target_audio = recording_assistant
cutset = CutSet([cut])
# Save to disk
cutset.to_file("path/to/manifest.jsonl.gz")
Converting to SHAR Format#
For efficient training, it’s recommended to convert your Lhotse manifests to SHAR (SHarded ARchive) format:
from lhotse import CutSet
from lhotse.shar import SharWriter
cutset = CutSet.from_file("path/to/manifest.jsonl.gz")
cutset.to_shar("path/to/train_shar", fields={"recording": "flac", "target_audio": "flac"}, shard_size=100)
Training with Prepared Datasets#
Once your datasets are prepared, you can use them to train a model:
# Load configuration
config_path = "path/to/config.yaml"
cfg = OmegaConf.load(config_path)
# The training data paths are available in the config file:
# cfg.data.train_ds.input_cfg[0].shar_path = "path/to/train_shar"
# Create dataset and datamodule
dataset = DuplexS2SDataset(
tokenizer=model.tokenizer,
frame_length=cfg.data.frame_length,
source_sample_rate=cfg.data.source_sample_rate,
target_sample_rate=cfg.data.target_sample_rate,
input_roles=cfg.data.input_roles,
output_roles=cfg.data.output_roles,
)
datamodule = DataModule(cfg.data, tokenizer=model.tokenizer, dataset=dataset)
# Train the model
trainer.fit(model, datamodule)
Example S2S Datasets#
While there are no publicly available datasets specifically formatted for Duplex S2S models yet, you can adapt conversation datasets with audio recordings such as:
Fisher Corpus
Switchboard Corpus
CallHome
Synthetic conversation datasets generated using TTS
You would need to format these datasets as Lhotse manifests with appropriate speaker role annotations to use them with the speechlm2 S2S models.