Data Loading Bottleneck Detection#

You have a PyTorch model, a dataset, and you’re training it. But you’re wondering: “Is my training being slowed down by data loading?” This is a common question in deep learning. When your GPU is waiting for data to be loaded and preprocessed, you’re not getting the full performance out of your expensive hardware.

This quick-start guide shows how easy it is to detect data loading bottlenecks by simply wrapping your existing dataloader.

In this tutorial, you will learn how to:

  • Set up a baseline training scenario

  • Identify if data loading is the bottleneck

  • Measure the potential performance gain if data loading were optimized

1. Setup#

First, let’s import the necessary libraries and set up our environment.

[1]:
import os
import time
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, RandomSampler
from torchvision import datasets, transforms
from tqdm import tqdm
import numpy as np

from nvidia.dali.plugin.pytorch.loader_evaluator import LoaderEvaluator

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
Using device: cuda

2. Model & Data Setup#

We’ll use an ultra-light model to make data loading the bottleneck. This helps us clearly see the impact of data loading performance.

DALI_EXTRA_PATH environment variable should point to the place where data from DALI extra repository is downloaded. Please make sure that the proper release tag is checked out.

[2]:
# Our model
class UltraLightModel(nn.Module):
    def __init__(self, num_classes=1000):
        super(UltraLightModel, self).__init__()
        self.classifier = nn.Linear(3 * 224 * 224, num_classes)

    def forward(self, x):
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x


# Dataloader
def create_dataloader(data_path, batch_size=32, num_workers=4, target_len=1000):
    transform = transforms.Compose(
        [
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize(
                mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
            ),
        ]
    )

    dataset = datasets.ImageFolder(root=data_path, transform=transform)
    # Our toy dataset is small, so oversample with replacement to artificially
    # enlarge each epoch.
    if target_len > len(dataset):
        sampler = RandomSampler(
            dataset, replacement=True, num_samples=target_len
        )
        shuffle = False  # mutually exclusive with sampler
    else:
        sampler = None
        shuffle = True
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        sampler=sampler,
        shuffle=shuffle,
        num_workers=num_workers,
        pin_memory=True,
    )
    return dataloader


# Create your dataloader; for simplicity we'll use a small vision dataset
test_data_root = os.environ["DALI_EXTRA_PATH"]
test_data_path = os.path.join(test_data_root, "db", "single", "jpeg")

dataloader = create_dataloader(
    test_data_path, batch_size=32, num_workers=4, target_len=1000
)
print(f"Samples per epoch: {len(dataloader.sampler)}")
Samples per epoch: 1000

3. Training Function#

Now let’s create a training function that will train our ultra-light model and collect performance metrics.

[3]:
def train_one_epoch(model, dataloader, criterion, optimizer, device, epoch=0):
    model.train()

    epoch_start_time = time.time()
    progress_bar = tqdm(dataloader, desc=f"Epoch {epoch}")

    for data, target in progress_bar:
        data, target = data.to(device), target.to(device)

        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

        progress_bar.set_postfix(
            {"Time": f"{time.time() - epoch_start_time:.1f}s"}
        )

    epoch_time = time.time() - epoch_start_time
    print(f"Epoch {epoch} - Time: {epoch_time:.2f}s")

    return {"epoch": epoch, "epoch_time": epoch_time}

4. Baseline Training#

Now let’s set up the training loop and run our baseline training to see how the ultra-light model performs.

[4]:
# Your existing training setup
model = UltraLightModel(num_classes=1000).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Train with your existing dataloader
print("Baseline Training (Real Data Loading)")
baseline_metrics = []
for epoch in range(2):
    metrics = train_one_epoch(
        model, dataloader, criterion, optimizer, device, epoch
    )
    baseline_metrics.append(metrics)

baseline_avg_time = np.mean([m["epoch_time"] for m in baseline_metrics])
print(f"Baseline average epoch time: {baseline_avg_time:.2f}s")
Baseline Training (Real Data Loading)
Epoch 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:03<00:00,  8.56it/s, Time=3.7s]
Epoch 0 - Time: 3.74s
Epoch 1: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:03<00:00,  8.50it/s, Time=3.7s]
Epoch 1 - Time: 3.76s
Baseline average epoch time: 3.75s

5. No-Overhead Training (One Line Change!)#

Now let’s use the Data Loader Evaluator Tool to simulate ideal data loading performance that doesn’t impact training speed (its overhead is close to 0). This will help us determine if our training is data loading bottlenecked and whether we can improve its performance by accelerating the data loading part.

[5]:
# Wrap your dataloader with LoaderEvaluator (this is the only change!)
dataloader = LoaderEvaluator(
    dataloader, mode="replay", num_cached_batches=len(dataloader) // 10
)

Now let’s train with the “no-overhead” dataloader to see the performance difference.

[6]:
# Train with the same setup, just different dataloader
print("No-Overhead Training (Cached Data Loading)")
sol_metrics = []
for epoch in range(2):
    metrics = train_one_epoch(
        model, dataloader, criterion, optimizer, device, epoch
    )
    sol_metrics.append(metrics)

sol_avg_time = np.mean([m["epoch_time"] for m in sol_metrics])
print(f"No-Overhead average epoch time: {sol_avg_time:.2f}s")
No-Overhead Training (Cached Data Loading)
Epoch 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:00<00:00, 198.03it/s, Time=0.2s]
Epoch 0 - Time: 0.16s
Epoch 1: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:00<00:00, 191.73it/s, Time=0.2s]
Epoch 1 - Time: 0.17s
No-Overhead average epoch time: 0.17s

6. Results - Is Data Loading Your Bottleneck?#

Let’s compare the performance between baseline training and no-overhead training to determine if we have a data loading bottleneck.

[7]:
# Compare performance
speedup = baseline_avg_time / sol_avg_time
time_reduction = baseline_avg_time - sol_avg_time
reduction_percentage = (time_reduction / baseline_avg_time) * 100

print(f"\nPerformance Comparison:")
print(f"Baseline:     {baseline_avg_time:.2f}s per epoch")
print(f"No-Overhead:  {sol_avg_time:.2f}s per epoch")
print(f"Speedup:      {speedup:.2f}x")
print(
    f"Time saved:   {time_reduction:.2f}s per epoch ({reduction_percentage:.1f}%)"
)

# Bottleneck detection
if speedup > 1.5:
    print(f"\n*** DATA LOADING BOTTLENECK DETECTED ***")
    print(
        f"You could speed up training by {reduction_percentage:.1f}% by optimizing data loading."
    )
elif speedup > 1.1:
    print(f"\n** POTENTIAL DATA LOADING BOTTLENECK **")
    print(f"Consider optimizing data loading for better performance.")
else:
    print(f"\n** NO DATA LOADING BOTTLENECK **")
    print(f"Your training is not significantly limited by data loading.")

Performance Comparison:
Baseline:     3.75s per epoch
No-Overhead:  0.17s per epoch
Speedup:      22.71x
Time saved:   3.59s per epoch (95.6%)

*** DATA LOADING BOTTLENECK DETECTED ***
You could speed up training by 95.6% by optimizing data loading.

That’s It!#

Summary:

  • Wrap your existing dataloader: LoaderEvaluator(your_dataloader, mode="replay")

  • Run the same training code

  • Compare performance to detect bottlenecks

Next Steps:

  • If bottleneck detected: optimize your data loading (increase num_workers, use faster storage, etc.)

  • If no bottleneck: focus optimization efforts elsewhere

Key Insight: If no-overhead training is significantly faster, your data loading is the bottleneck!