nvidia.dali.fn.mel_filter_bank¶

nvidia.dali.fn.mel_filter_bank(*inputs, **kwargs)¶

Converts a spectrogram to a mel spectrogram by applying a bank of triangular filters.

The frequency (‘f’) dimension is selected from the input layout. In case of no layout, “f”, “ft”, or “*ft” is assumed, depending on the number of dimensions.

Supported backends

‘cpu’
‘gpu’

Parameters

input (TensorList) – Input to the operator.

Keyword Arguments

bytes_per_sample_hint (int or list of int, optional, default = [0]) –
Output size hint, in bytes per sample.

If specified, the operator’s outputs residing in GPU or page-locked host memory will be preallocated to accommodate a batch of samples of this size.
freq_high (float, optional, default = 0.0) –
The maximum frequency.

If this value is not provided, sample_rate/2 is used.
freq_low (float, optional, default = 0.0) – The minimum frequency.
mel_formula (str, optional, default = ‘slaney’) –
Determines the formula that will be used to convert frequencies from hertz to mel and from mel to hertz.

The mel scale is a perceptual scale of pitches, so there is no single formula.

The supported values are:
- slaney, which follows Slaney’s MATLAB Auditory Modelling Work behavior.
  
  This formula is linear under 1 KHz and logarithmic above this value. The implementation is consistent with Librosa’s default implementation.
- htk, which follows O’Shaughnessy’s book formula, m = 2595 * log10(1 + (f/700)).
  
  This value is consistent with the implementation of the Hidden Markov Toolkit (HTK).
nfilter (int, optional, default = 128) – Number of mel filters.
normalize (bool, optional, default = True) –
Determines whether to normalize the triangular filter weights by the width of their frequency bands.
- If set to True, the integral of the filter function is 1.
- If set to False, the peak of the filter function will be 1.
preserve (bool, optional, default = False) – Prevents the operator from being removed from the graph even if its outputs are not used.
sample_rate (float, optional, default = 44100.0) – Sampling rate of the audio signal.
seed (int, optional, default = -1) –
Random seed.

If not provided, it will be populated based on the global seed of the pipeline.