bridge.models.nemotron_omni.nemotron_omni_utils#
Module Contents#
Functions#
Load an audio file and resample to |
|
Convert a raw waveform to a mel spectrogram tensor. |
|
Compute the expected number of audio tokens for a waveform. |
API#
- bridge.models.nemotron_omni.nemotron_omni_utils.load_audio(path: str, target_sr: int = 16000) numpy.ndarray#
Load an audio file and resample to
target_srHz.Supports WAV, MP3, FLAC, and other formats handled by soundfile (with librosa as a fallback for MP3 and other FFmpeg-decoded formats).
- Parameters:
path – Path to the audio file.
target_sr – Target sampling rate in Hz.
- Returns:
1-D float32 numpy array of the mono waveform at
target_sr.
- bridge.models.nemotron_omni.nemotron_omni_utils.compute_mel_features(
- waveform: Union[numpy.ndarray, list],
- sampling_rate: int = 16000,
- num_mel_bins: int = 128,
Convert a raw waveform to a mel spectrogram tensor.
Uses HF
ParakeetFeatureExtractor(fromtransformers) to produce mel features compatible withBridgeSoundEncoder/ParakeetEncoder.- Parameters:
waveform – 1-D float32 numpy array (or list) of the mono waveform.
sampling_rate – Sampling rate of waveform (must match the extractor).
num_mel_bins – Number of mel frequency bins.
- Returns:
Float tensor of shape
(frames, num_mel_bins)– a single clip ready to be batched and passed assound_clipsto the model.
- bridge.models.nemotron_omni.nemotron_omni_utils.compute_audio_token_count(
- waveform: Union[numpy.ndarray, list],
- hop_length: int = 160,
- subsampling_factor: int = 8,
Compute the expected number of audio tokens for a waveform.
Uses the same Conv2D subsampling math as
ParakeetEncoder/ParakeetEncoderSubsamplingConv2D: kernel_size=3, stride=2, padding=1, applied log2(subsampling_factor) times to the mel frame count.- Parameters:
waveform – 1-D waveform array (only its length is used).
hop_length – Hop length in samples for mel feature extraction.
subsampling_factor – Subsampling factor of the conformer encoder.
- Returns:
Number of audio tokens (at least 1).