Getting Started


NVIDIA Data Loading Library (DALI) is a collection of highly optimized building blocks and an execution engine that accelerates the data pipeline for computer vision and audio deep learning applications.

Input and augmentation pipelines provided by Deep Learning frameworks fit typically into one of two categories:

  • fast, but inflexible - written in C++, they are exposed as a single monolithic Python object with very specific set and ordering of operations it provides

  • slow, but flexible - set of building blocks written in either C++ or Python, that can be used to compose arbitrary data pipelines that end up being slow. One of the biggest overheads for this type of data pipelines is Global Interpreter Lock (GIL) in Python. This forces developers to use multiprocessing, complicating the design of efficient input pipelines.

DALI stands out by providing both performance and flexibility of accelerating different data pipelines. It achieves that by exposing optimized building blocks which are executed using simple and efficient engine, and enabling offloading of operations to GPU (thus enabling scaling to multi-GPU systems).

It is a single library, that can be easily integrated into different deep learning training and inference applications.

DALI offers ease-of-use and flexibility across GPU enabled systems with direct framework plugins, multiple input data formats, and configurable graphs. DALI can help achieve overall speedup on deep learning workflows that are bottlenecked on I/O pipelines due to the limitations of CPU cycles. Typically, systems with high GPU to CPU ratio (such as Amazon EC2 P3.16xlarge, NVIDIA DGX1-V or NVIDIA DGX-2) are constrained on the host CPU, thereby under-utilizing the available GPU compute capabilities. DALI significantly accelerates input processing on such dense GPU configurations to achieve the overall throughput.


At the core of data processing with DALI lies the concept of a data processing pipeline. It is composed of multiple operations connected in a directed graph and contained in an object of class class nvidia.dali.Pipeline. This class provides functions necessary for defining, building and running data processing pipelines.

from nvidia.dali.pipeline import Pipeline

Defining the Pipeline

Let us start with defining a very simple pipeline for a classification task determining whether a picture contains a dog or a kitten. We prepared a directory structure containing pictures of dogs and kittens in our repository.

Our simple pipeline will read images from this directory, decode them and return (image, label) pairs.

The easiest way to create a pipieline is by using the pipeline_def decorator. In the simple_pipeline function we define the operations to be performed and the flow of the computation between them.

  1. Use fn.readers.file to read jpegs (encoded images) and labels from the hard drive.

  2. Use the fn.decoders.image operation to decode images from jpeg to RGB.

  3. Specify which of the intermediate variables should be returned as the outputs of the pipeline.

For more information about pipeline_def look to the documentation.

from nvidia.dali import pipeline_def
import nvidia.dali.fn as fn
import nvidia.dali.types as types

image_dir = "data/images"
max_batch_size = 8

def simple_pipeline():
    jpegs, labels = fn.readers.file(file_root=image_dir)
    images = fn.decoders.image(jpegs, device='cpu')

    return images, labels

Building the Pipeline

In order to use the pipeline defined with simple_pipeline, we need to create and build it. This is achieved by calling simple_pipeline, which creates an instance of the pipeline. Then we call build on this newly created instance:

pipe = simple_pipeline(batch_size=max_batch_size, num_threads=1, device_id=0)

Notice that decorating a function with pipeline_def adds new named arguments to it. They can be used to control various aspects of the pipeline, such as:

  • max batch size,

  • number of threads used to perform computation on the CPU,

  • which GPU device to use (pipeline created with simple_pipeline does not yet use GPU for compute though),

  • seed for random number generation.

For more information about Pipeline arguments you can look to Pipeline documentation.

Running the Pipeline

After the pipeline is built, we can run it to get a batch of results.

pipe_out =
(<nvidia.dali.backend_impl.TensorListCPU object at 0x7f789448ddb0>, <nvidia.dali.backend_impl.TensorListCPU object at 0x7f789448dc70>)

The output of the pipeline, which we saved to pipe_out variable, is a tuple of 2 elements (as expected - we specified 2 outputs in simple_pipeline function). Both of these elements are TensorListCPU objects - each containing a list of CPU tensors.

In order to show the results (just for debugging purposes - during the actual training we would not do that step, as it would make our batch of images do a round trip from GPU to CPU and back) we can send our data from DALI’s Tensor to NumPy array. Not every TensorList can be accessed that way though - TensorList is more general than NumPy array and can hold tensors with different shapes. In order to check whether we can send it to NumPy directly, we can call the is_dense_tensor function of TensorList

images, labels = pipe_out
print("Images is_dense_tensor: " + str(images.is_dense_tensor()))
print("Labels is_dense_tensor: " + str(labels.is_dense_tensor()))
Images is_dense_tensor: False
Labels is_dense_tensor: True

As it turns out, TensorList containing labels can be represented by a tensor, while the TensorList containing images cannot.

Let us see, what is the shape and contents of returned labels.

import numpy as np

labels_tensor = labels.as_tensor()

print (labels_tensor.shape())
print (np.array(labels_tensor))
[8, 1]

In order to see the images, we will need to loop over all tensors contained in TensorList, accessed with its at method.

import matplotlib.gridspec as gridspec
import matplotlib.pyplot as plt
%matplotlib inline

def show_images(image_batch):
    columns = 4
    rows = (max_batch_size + 1) // (columns)
    fig = plt.figure(figsize = (32,(32 // columns) * rows))
    gs = gridspec.GridSpec(rows, columns)
    for j in range(rows*columns):

Adding Augmentations

Random Shuffle

As we can see from the example above, the first batch of images returned by our pipeline contains only dogs. That is because we did not shuffle our dataset, and so fn.readers.file returns images in lexicographic order.

Let us make a new pipeline, that will change that.

def shuffled_pipeline():
    jpegs, labels = fn.readers.file(file_root=image_dir, random_shuffle=True, initial_fill=21)
    images = fn.decoders.image(jpegs, device='cpu')

    return images, labels

We made 2 changes to the simple_pipeline to obtain the shuffled_pipeline - we added 2 arguments to the fn.readers.file operation

  • random_shuffle enables shuffling of images in the reader. Shuffling is performed by using a buffer of images read from disk. When the reader is asked to provide the next image, it randomly selects an image from the buffer, outputs it and immediately replaces that spot in the buffer with a freshly read image.

  • initial_fill sets the capacity of the buffer. The default value of this parameter (1000), well suited for datasets containing thousands of examples, is too big for our very small dataset, which contains only 21 images. This could result in frequent duplicates in the returned batch. That is why in this example we set it to the size of our dataset.

Let us test the result of this modification.

pipe = shuffled_pipeline(batch_size=max_batch_size, num_threads=1, device_id=0, seed=1234)
pipe_out =
images, labels = pipe_out

Now the images returned by the pipeline are shuffled properly.


DALI can not only read images from disk and batch them into tensors, it can also perform various augmentations on those images to improve Deep Learning training results.

One example of such augmentations is rotation. Let us make a new pipeline, which rotates the images before outputting them.

def rotated_pipeline():
    jpegs, labels = fn.readers.file(file_root=image_dir, random_shuffle=True, initial_fill=21)
    images = fn.decoders.image(jpegs, device='cpu')
    rotated_images = fn.rotate(images, angle=10.0, fill_value=0)

    return rotated_images, labels

To do that, we added a new operation to our pipeline: fn.rotate.

As we can see in the documentation, rotate can take multiple arguments, but only one of them beyond input is required - angle tells the operator how much it should rotate images. We also specified fill_value to better visualise the results.

Let us test the newly created pipeline:

pipe = rotated_pipeline(batch_size=max_batch_size, num_threads=1, device_id=0, seed=1234)
pipe_out =
images, labels = pipe_out

Tensors as Arguments and Random Number Generation

Rotating every image by 10 degrees is not that interesting. To make a meaningful augmentation, we would like an operator that rotates our images by a random angle in a given range.

Rotate’s angle parameter can accept float or float tensor types of values. The second option, float tensor, enables us to feed the operator with different rotation angles for every image, via a tensor produced by other operation.

Random number generators are examples of operations that one can use with DALI. Let us use fn.random.uniform to make a pipeline that rotates images by a random angle.

def random_rotated_pipeline():
    jpegs, labels = fn.readers.file(file_root=image_dir, random_shuffle=True, initial_fill=21)
    images = fn.decoders.image(jpegs, device='cpu')
    angle = fn.random.uniform(range=(-10.0, 10.0))
    rotated_images = fn.rotate(images, angle=angle, fill_value=0)

    return rotated_images, labels

This time, instead of providing a fixed value for the angle argument, we set it to the output of the fn.random.uniform operator.

Let us check the result:

pipe = random_rotated_pipeline(batch_size=max_batch_size, num_threads=1, device_id=0, seed=1234)
pipe_out =
images, labels = pipe_out

This time, the rotation angle is randomly selected from a value range.

Adding GPU Acceleration

DALI offers access to GPU accelerated operators, that can increase the speed of the input and augmentation pipeline and let it scale to multi-GPU systems.

Copying Tensors to GPU

Let us modify the previous example of the random_rotated_pipeline to use the GPU for the rotation.

def random_rotated_gpu_pipeline():
    jpegs, labels = fn.readers.file(file_root=image_dir, random_shuffle=True, initial_fill=21)
    images = fn.decoders.image(jpegs, device='cpu')
    angle = fn.random.uniform(range=(-10.0, 10.0))
    rotated_images = fn.rotate(images.gpu(), angle=angle, fill_value=0)

    return rotated_images, labels

In order to tell DALI that we want to use the GPU, we needed to make only one change to the pipeline. We changed input to the rotate operation from images, which is a tensor on the CPU, to images.gpu() which copies it to the GPU.

pipe = random_rotated_gpu_pipeline(batch_size=max_batch_size, num_threads=1, device_id=0, seed=1234)
pipe_out =
(<nvidia.dali.backend_impl.TensorListGPU object at 0x7f77f819a070>, <nvidia.dali.backend_impl.TensorListCPU object at 0x7f77f819a0b0>)

pipe_out still contains 2 TensorLists, but this time the first output, result of the rotate operation, is on the GPU. We cannot access contents of TensorListGPU directly from the CPU, so in order to visualize the result we need to copy it to the CPU by using as_cpu method.

images, labels = pipe_out

Important Notice

DALI does not support moving the data from the GPU to the CPU within the pipeline. That is why a CPU operation cannot follow a GPU one.

Hybrid Decoding

Sometimes, especially for higher resolution images, decoding images stored in JPEG format may become a bottleneck. To address this problem, nvJPEG and nvJPEG2000 libraries were developed. They split the decoding process between CPU and GPU, significantly reducing the decoding time.

Specifying “mixed” device parameter in fn.decoders.image enables nvJPEG and nvJPEG2000 support. Other file formats are still decoded on the CPU.

def hybrid_pipeline():
    jpegs, labels = fn.readers.file(file_root=image_dir, random_shuffle=True, initial_fill=21)
    images = fn.decoders.image(jpegs, device='mixed')

    return images, labels

fn.decoders.image with device=mixed uses a hybrid approach of computation that employs both the CPU and the GPU. This means that it accepts CPU inputs, but returns GPU outputs. That is why images objects returned from the pipeline are of type TensorListGPU.

pipe = hybrid_pipeline(batch_size=max_batch_size, num_threads=1, device_id=0, seed=1234)
pipe_out =
images, labels = pipe_out

Let us compare the speed of fn.decoders.image for ‘cpu’ and ‘mixed’ backends by measuring speed of shuffled_pipeline and hybrid_pipeline with 4 CPU threads.

from timeit import default_timer as timer

test_batch_size = 64

def speedtest(pipeline, batch, n_threads):
    pipe = pipeline(batch_size=batch, num_threads=n_threads, device_id=0)
    # warmup
    for i in range(5):
    # test
    n_test = 20
    t_start = timer()
    for i in range(n_test):
    t = timer() - t_start
    print("Speed: {} imgs/s".format((n_test * batch)/t))
speedtest(shuffled_pipeline, test_batch_size, 4)
Speed: 3148.9324633140664 imgs/s
speedtest(hybrid_pipeline, test_batch_size, 4)
Speed: 5963.145339307848 imgs/s

As we can see, using GPU accelerated decoding resulted in significant speedup.