Magpie-TTS Longform Inference#

This document describes how longform (multi-sentence) text-to-speech inference works in Magpie-TTS.

Overview#

Magpie-TTS supports generating speech for long text inputs by processing them in smaller, sentence-level chunks while maintaining prosodic continuity across the entire utterance. This approach overcomes the context window limitations of the underlying transformer architecture.

When Longform is Used#

Longform inference is automatically triggered based on word count thresholds (approximately 20 seconds of audio):

Language Word Thresholds#

Language

Word Threshold

English

45 words

Spanish

73 words

French

69 words

German

50 words

Italian

53 words

Vietnamese

50 words

Note

Longform is best supported for English. Mandarin currently falls back to standard inference.

Algorithm#

The longform inference algorithm processes long text through the following steps:

Step 1: Sentence Splitting#

The input text is split into individual sentences using punctuation markers (., ?, !, ...). The splitting is intelligent and handles abbreviations like “Dr.”, “Mr.”, “a.m.” by checking if the period is followed by a space.

Example:

Input:  "Dr. Smith arrived early. How are you today?"
Output: ["Dr. Smith arrived early.", "How are you today?"]

Step 2: State Initialization#

A LongformChunkState object is created to track information across sentence chunks:

  • History text tokens: Text from previous chunks for context

  • History encoder context: Encoder outputs that provide continuity

  • Attention tracking: Monitors which positions have been attended to

Step 3: Iterative Chunk Processing#

For each sentence chunk, the following sub-steps are performed:

  1. Context Preparation: Prepend history text and encoder context from previous chunks to maintain prosodic continuity.

  2. Attention Prior Application: Apply a learned attention prior that guides the model to attend to the correct text positions, preventing repetition or skipping.

  3. Autoregressive Generation: Generate audio codes token-by-token using the transformer decoder with temperature sampling.

  4. State Update: Update the chunk state with:

    • New history text (last N tokens)

    • New encoder context

    • Updated attention tracking

  5. Code Collection: Store the generated audio codes for this chunk.

Step 4: Code Concatenation#

After all chunks are processed, concatenate the audio codes from each chunk along the time dimension into a single sequence.

Step 5: Audio Decoding#

Pass the concatenated codes through the neural audio codec decoder to produce the final waveform.

Key Components#

  1. Sentence Splitting (split_by_sentence): Intelligently splits text on sentence boundaries while handling abbreviations (e.g., “Dr.”, “Mr.”).

  2. Chunk State (LongformChunkState): Maintains context across chunks:

    • history_text: Text tokens from previous chunks

    • history_context_tensor: Encoder outputs for continuity

    • last_attended_timesteps: Attention tracking for smooth transitions

  3. Attention Prior: Guides the model’s attention to maintain proper alignment and prevent repetition/skipping.

Usage#

Method 2: Using CLI (magpietts_inference.py)#

For batch inference from manifests:

# Auto-detect longform based on text length (default)
python examples/tts/magpietts_inference.py \
    --nemo_files /path/to/magpietts.nemo \
    --datasets_json_path /path/to/evalset_config.json \
    --out_dir /path/to/output \
    --codecmodel_path /path/to/codec.nemo \
    --longform_mode auto

# Force longform inference for all inputs
python examples/tts/magpietts_inference.py \
    --nemo_files /path/to/magpietts.nemo \
    --datasets_json_path /path/to/evalset_config.json \
    --out_dir /path/to/output \
    --codecmodel_path /path/to/codec.nemo \
    --longform_mode always \
    --longform_max_decoder_steps 50000

Longform CLI Options:

Option

Default

Description

--longform_mode

auto

auto: detect from text, always: force longform, never: disable

Configuration Dataclasses#

LongformConfig#

Immutable tuning parameters (set in model):

@dataclass
class LongformConfig:
    history_len_heuristic: int = 20      # Max history tokens retained
    prior_weights_init: Tuple = (0.5, 1.0, 0.8, 0.2, 0.2)  # Initial attention weights
    prior_weights: Tuple = (0.2, 1.0, 0.6, 0.4, 0.2, 0.2)  # Generation weights
    finished_limit_with_eot: int = 5     # Steps after text end before EOS
    short_sentence_threshold: int = 35   # Skip prior for short sentences
    attention_sink_threshold: int = 10   # Attention sink detection

LongformChunkState#

Mutable state passed between chunk iterations:

@dataclass
class LongformChunkState:
    batch_size: int
    history_text: Optional[torch.Tensor] = None       # (B, T)
    history_text_lens: Optional[torch.Tensor] = None  # (B,)
    history_context_tensor: Optional[torch.Tensor] = None  # (B, T, E)
    end_indices: Dict[int, int] = field(default_factory=dict)
    overall_idx: int = 0
    left_offset: List[int] = field(default_factory=list)
    last_attended_timesteps: List[List[int]] = field(default_factory=list)

Best Practices#

  1. Use ``apply_TN=True`` for raw text to ensure proper normalization before synthesis.

  2. Increase ``max_decoder_steps`` for very long texts (default 50000 is usually sufficient).

  3. Use ``longform_mode=”auto”`` (default) to let the system decide based on text length.

  4. For non-English languages, be aware that longform performance may vary. English is best supported.

Limitations#

  • Mandarin (zh): Currently falls back to standard inference due to character-based tokenization complexities.

  • Prosodic boundaries: While the algorithm maintains continuity, natural paragraph breaks may not always be perfectly preserved in non-English languages.

See Also#