Magpie-TTS Longform Inference#

This document describes how longform (multi-sentence) text-to-speech inference works in Magpie-TTS.

Overview#

Magpie-TTS supports generating speech for long text inputs by processing them in smaller, sentence-level chunks while maintaining prosodic continuity across the entire utterance. This approach overcomes the context window limitations of the underlying transformer architecture.

When Longform is Used#

Longform inference is automatically triggered based on word count thresholds (approximately 20 seconds of audio):

Language Word Thresholds#
Language	Word Threshold
English	45 words
Spanish	73 words
French	69 words
German	50 words
Italian	53 words
Vietnamese	50 words

Note

Longform is best supported for English. Mandarin currently falls back to standard inference.

Algorithm#

The longform inference algorithm processes long text through the following steps:

Step 1: Sentence Splitting#

The input text is split into individual sentences using punctuation markers (., ?, !, ...). The splitting is intelligent and handles abbreviations like “Dr.”, “Mr.”, “a.m.” by checking if the period is followed by a space.

Example:

Input:  "Dr. Smith arrived early. How are you today?"
Output: ["Dr. Smith arrived early.", "How are you today?"]

Step 2: State Initialization#

A LongformChunkState object is created to track information across sentence chunks:

History text tokens: Text from previous chunks for context
History encoder context: Encoder outputs that provide continuity
Attention tracking: Monitors which positions have been attended to

Step 3: Iterative Chunk Processing#

For each sentence chunk, the following sub-steps are performed:

Context Preparation: Prepend history text and encoder context from previous chunks to maintain prosodic continuity.
Attention Prior Application: Apply a learned attention prior that guides the model to attend to the correct text positions, preventing repetition or skipping.
Autoregressive Generation: Generate audio codes token-by-token using the transformer decoder with temperature sampling.
State Update: Update the chunk state with:
- New history text (last N tokens)
- New encoder context
- Updated attention tracking
Code Collection: Store the generated audio codes for this chunk.

Step 4: Code Concatenation#

After all chunks are processed, concatenate the audio codes from each chunk along the time dimension into a single sequence.

Step 5: Audio Decoding#

Pass the concatenated codes through the neural audio codec decoder to produce the final waveform.

Key Components#

Sentence Splitting (split_by_sentence): Intelligently splits text on sentence boundaries while handling abbreviations (e.g., “Dr.”, “Mr.”).
Chunk State (LongformChunkState): Maintains context across chunks:
- history_text: Text tokens from previous chunks
- history_context_tensor: Encoder outputs for continuity
- last_attended_timesteps: Attention tracking for smooth transitions
Attention Prior: Guides the model’s attention to maintain proper alignment and prevent repetition/skipping.

Usage#

Method 1: Using `do_tts` (Recommended for Simple Use Cases)#

The do_tts method automatically detects whether longform inference is needed:

import torch
from nemo.collections.tts.models import MagpieTTSModel

# Load model
model = MagpieTTSModel.restore_from("path/to/magpietts.nemo")
model.eval()
model.cuda()

# Short text - uses standard inference automatically
short_audio, short_len = model.do_tts(
    transcript="Hello, how are you?",
    language="en",
)

# Long text - automatically switches to longform inference
long_text = """
The quick brown fox jumps over the lazy dog. This sentence contains every
letter of the alphabet. Sphinx of black quartz, judge my vow. Pack my box
with five dozen liquor jugs. How vexingly quick daft zebras jump. The five
boxing wizards jump quickly. Jackdaws love my big sphinx of quartz.
"""

long_audio, long_len = model.do_tts(
    transcript=long_text,
    language="en",
    apply_TN=True,  # Apply text normalization
    temperature=0.7,
    topk=80,
    use_cfg=True,
    cfg_scale=2.5,
)

# Save audio
import soundfile as sf
sf.write("output.wav", long_audio[0].cpu().numpy(), 22050)

Method 2: Using CLI (`magpietts_inference.py`)#

For batch inference from manifests:

# Auto-detect longform based on text length (default)
python examples/tts/magpietts_inference.py \
    --nemo_files /path/to/magpietts.nemo \
    --datasets_json_path /path/to/evalset_config.json \
    --out_dir /path/to/output \
    --codecmodel_path /path/to/codec.nemo \
    --longform_mode auto

# Force longform inference for all inputs
python examples/tts/magpietts_inference.py \
    --nemo_files /path/to/magpietts.nemo \
    --datasets_json_path /path/to/evalset_config.json \
    --out_dir /path/to/output \
    --codecmodel_path /path/to/codec.nemo \
    --longform_mode always \
    --longform_max_decoder_steps 50000

Longform CLI Options:

Option	Default	Description
`--longform_mode`	`auto`	`auto`: detect from text, `always`: force longform, `never`: disable

Configuration Dataclasses#

`LongformConfig`#

Immutable tuning parameters (set in model):

@dataclass
class LongformConfig:
    history_len_heuristic: int = 20      # Max history tokens retained
    prior_weights_init: Tuple = (0.5, 1.0, 0.8, 0.2, 0.2)  # Initial attention weights
    prior_weights: Tuple = (0.2, 1.0, 0.6, 0.4, 0.2, 0.2)  # Generation weights
    finished_limit_with_eot: int = 5     # Steps after text end before EOS
    short_sentence_threshold: int = 35   # Skip prior for short sentences
    attention_sink_threshold: int = 10   # Attention sink detection

`LongformChunkState`#

Mutable state passed between chunk iterations:

@dataclass
class LongformChunkState:
    batch_size: int
    history_text: Optional[torch.Tensor] = None       # (B, T)
    history_text_lens: Optional[torch.Tensor] = None  # (B,)
    history_context_tensor: Optional[torch.Tensor] = None  # (B, T, E)
    end_indices: Dict[int, int] = field(default_factory=dict)
    overall_idx: int = 0
    left_offset: List[int] = field(default_factory=list)
    last_attended_timesteps: List[List[int]] = field(default_factory=list)

Best Practices#

Use ``apply_TN=True`` for raw text to ensure proper normalization before synthesis.
Increase ``max_decoder_steps`` for very long texts (default 50000 is usually sufficient).
Use ``longform_mode=”auto”`` (default) to let the system decide based on text length.
For non-English languages, be aware that longform performance may vary. English is best supported.

Limitations#

Mandarin (zh): Currently falls back to standard inference due to character-based tokenization complexities.
Prosodic boundaries: While the algorithm maintains continuity, natural paragraph breaks may not always be perfectly preserved in non-English languages.