nvidia.dali.fn.mel_filter_bank(*inputs, **kwargs)

Converts a spectrogram to a mel spectrogram by applying a bank of triangular filters.

The frequency (‘f’) dimension is selected from the input layout. In case of no layout, “f”, “ft”, or “*ft” is assumed, depending on the number of dimensions.

Supported backends
  • ‘cpu’

  • ‘gpu’


input (TensorList) – Input to the operator.

Keyword Arguments
  • bytes_per_sample_hint (int or list of int, optional, default = [0]) –

    Output size hint, in bytes per sample.

    If specified, the operator’s outputs residing in GPU or page-locked host memory will be preallocated to accommodate a batch of samples of this size.

  • freq_high (float, optional, default = 0.0) –

    The maximum frequency.

    If this value is not provided, sample_rate/2 is used.

  • freq_low (float, optional, default = 0.0) – The minimum frequency.

  • mel_formula (str, optional, default = ‘slaney’) –

    Determines the formula that will be used to convert frequencies from hertz to mel and from mel to hertz.

    The mel scale is a perceptual scale of pitches, so there is no single formula.

    The supported values are:

    • slaney, which follows Slaney’s MATLAB Auditory Modelling Work behavior.
      This formula is linear under 1 KHz and logarithmic above this value. The implementation is consistent with Librosa’s default implementation.
    • htk, which follows O’Shaughnessy’s book formula, m = 2595 * log10(1 + (f/700)).
      This value is consistent with the implementation of the Hidden Markov Toolkit (HTK).

  • nfilter (int, optional, default = 128) – Number of mel filters.

  • normalize (bool, optional, default = True) –

    Determines whether to normalize the triangular filter weights by the width of their frequency bands.

    • If set to True, the integral of the filter function is 1.

    • If set to False, the peak of the filter function will be 1.

  • preserve (bool, optional, default = False) – Prevents the operator from being removed from the graph even if its outputs are not used.

  • sample_rate (float, optional, default = 44100.0) – Sampling rate of the audio signal.

  • seed (int, optional, default = -1) –

    Random seed.

    If not provided, it will be populated based on the global seed of the pipeline.