bridge.models.nemotron_omni.nemotron_omni_utils#

Module Contents#

Functions#

load_audio

Load an audio file and resample to target_sr Hz.

compute_mel_features

Convert a raw waveform to a mel spectrogram tensor.

compute_audio_token_count

Compute the expected number of audio tokens for a waveform.

API#

bridge.models.nemotron_omni.nemotron_omni_utils.load_audio(path: str, target_sr: int = 16000) numpy.ndarray#

Load an audio file and resample to target_sr Hz.

Supports WAV, MP3, FLAC, and other formats handled by soundfile (with librosa as a fallback for MP3 and other FFmpeg-decoded formats).

Parameters:
  • path – Path to the audio file.

  • target_sr – Target sampling rate in Hz.

Returns:

1-D float32 numpy array of the mono waveform at target_sr.

bridge.models.nemotron_omni.nemotron_omni_utils.compute_mel_features(
waveform: Union[numpy.ndarray, list],
sampling_rate: int = 16000,
num_mel_bins: int = 128,
) torch.Tensor#

Convert a raw waveform to a mel spectrogram tensor.

Uses HF ParakeetFeatureExtractor (from transformers) to produce mel features compatible with BridgeSoundEncoder / ParakeetEncoder.

Parameters:
  • waveform – 1-D float32 numpy array (or list) of the mono waveform.

  • sampling_rate – Sampling rate of waveform (must match the extractor).

  • num_mel_bins – Number of mel frequency bins.

Returns:

Float tensor of shape (frames, num_mel_bins) – a single clip ready to be batched and passed as sound_clips to the model.

bridge.models.nemotron_omni.nemotron_omni_utils.compute_audio_token_count(
waveform: Union[numpy.ndarray, list],
hop_length: int = 160,
subsampling_factor: int = 8,
) int#

Compute the expected number of audio tokens for a waveform.

Uses the same Conv2D subsampling math as ParakeetEncoder / ParakeetEncoderSubsamplingConv2D: kernel_size=3, stride=2, padding=1, applied log2(subsampling_factor) times to the mel frame count.

Parameters:
  • waveform – 1-D waveform array (only its length is used).

  • hop_length – Hop length in samples for mel feature extraction.

  • subsampling_factor – Subsampling factor of the conformer encoder.

Returns:

Number of audio tokens (at least 1).