Audio Decoder in DALI#

This tutorial presents, how to set up a simple pipeline, that loads and decodes audio data using DALI. We will use a simple example from Speech Commands Data Set. While this dataset consists of samples in .wav format, the following procedure can be used for most of the well-known digital audio coding formats as well.

Step-by-Step Guide#

  1. Let’s start by importing DALI and a handful of utils.

from nvidia.dali import pipeline_def
import nvidia.dali.fn as fn
import nvidia.dali.types as types
import matplotlib.pyplot as plt
import numpy as np

batch_size = 1
audio_files = "../data/audio"

used batch_size is 1, to keep things simple.

  1. Next, let’s implement the pipeline. Firstly, we need to load data from disk (or any other source). readers.file is able to load data, as well as it’s labels. For more information, refer to the documentation. Furthermore, similarly to image data, you can use other reader operators that are specific for a given dataset or a dataset format (see readers.caffe). After loading the input data, the pipeline decodes the audio data. As stated above, the operator is able to decode most of the well-known audio formats.

    Note: Please remember that you shall pass proper data type (argument dtype) to the operator. Supported data types can be found in the documentation. If you have 24-bit audio data and you set dtype=INT16, it will result in loosing some information from the samples. The default dtype for this operator is INT16

def audio_decoder_pipe():
    encoded, _ = fn.readers.file(file_root=audio_files)
    audio, sr =, dtype=types.INT16)
    return audio, sr
  1. Now let’s just build and run the pipeline.

pipe = audio_decoder_pipe(batch_size=batch_size, num_threads=1, device_id=0)
cpu_output =

Outputs from consist of a tensor with the decoded data, as well as some metadata (e.g. sampling rate). To access them just check another output. On top of that, returns data in interleaved format, so we need to reshape the output tensor, to properly display it. Here’s how to do that:

audio_data = cpu_output[0].at(0)
sampling_rate = cpu_output[1].at(0)
print("Sampling rate:", sampling_rate, "[Hz]")
print("Audio data:", audio_data)
audio_data = audio_data.flatten()
print("Audio data flattened:", audio_data)
Sampling rate: 16000.0 [Hz]
Audio data: [[  -5]
 [ -95]
 [ 116]
 [ 102]
 [  82]]
Audio data flattened: [  -5  -95 -156 ...  116  102   82]


Let’s verify, that the decoders.Audio actually works. The presented method can also come in handy for debugging DALI pipeline, in case something doesn’t go as planned.

We will use external tool to decode used data and compare the results against data decoded by DALI.


Following snippet installs the external dependency (simpleaudio). In case you already have it, or don’t want to install it, you might want to stop here and not run this one.

[ ]:
import sys

!{sys.executable} -m pip install simpleaudio

Below is the side-by-side comparision of decoded data. If you have the simpleaudio module installed, you can run the snippet and see it for yourself.

import simpleaudio as sa

wav = sa.WaveObject.from_wave_file("../data/audio/wav/three.wav")
three_audio = np.frombuffer(wav.audio_data, dtype=np.int16)

print("src: simpleaudio")
print("shape: ", three_audio.shape)
print("data: ", three_audio)
print("src: DALI")
print("shape: ", audio_data.shape)
print("data: ", audio_data)
    "\nAre the arrays equal?",
    "YES" if np.all(audio_data == three_audio) else "NO",

fig, ax = plt.subplots(1, 2)
src: simpleaudio
shape:  (16000,)
data:  [  -5  -95 -156 ...  116  102   82]

src: DALI
shape:  (16000,)
data:  [  -5  -95 -156 ...  116  102   82]

Are the arrays equal? YES