Data Loading at Scale#

This guide covers how Megatron’s data pipeline works and how to configure it for efficient training at 256 nodes and beyond. At this scale, the primary bottlenecks are index building and barrier synchronization – not raw data bandwidth.

How Data Loading Works#

Understanding the architecture helps explain why specific flags matter.

Megatron builds three index arrays for each dataset: a document index (shuffled document order), a sample index (mapping samples to document offsets), and a shuffle index (final sample permutation). This happens once during initialization:

  1. Rank 0 builds all three indices and writes them to a cache directory as .npy files.

  2. All ranks synchronize at a torch.distributed.barrier().

  3. All other ranks load the cached indices via memory-mapped reads (numpy.load(mmap_mode='r')).

After initialization, data access is read-only and lock-free. Each data-parallel rank consumes a disjoint subset of samples, and no cross-rank coordination is needed during training because all ranks derive the same deterministic permutation from a shared random seed.

The Problem at 256+ Nodes#

Three things break down at large node counts:

  1. Barrier synchronization: All ranks block while rank 0 builds indices. On a 512-node job, this means 4,095 GPUs sit idle.

  2. Simultaneous memory-mapping: All ranks mmap three large .npy files at once after the barrier, causing a burst of page faults and I/O.

Baseline: Establish Maximum Achievable Performance#

Before tuning data loading, establish a performance ceiling by running with --mock-data. This bypasses the data pipeline entirely and shows the maximum throughput your configuration can achieve without any dataloader overhead. The gap between --mock-data performance and real-data performance tells you exactly how much time the dataloader is costing you.

Scaling Characteristics#

Aspect

Behavior

Why it works

Cross-rank contention

None after init

All index files are read-only; numpy.memmap uses OS page cache with no locking

Sampling determinism

All ranks produce the same permutation

Shared numpy.random.RandomState(seed) with epoch-based seed variation

Data-parallel sharding

Each DP rank gets a disjoint subset of samples

No overlap during training; assignment happens in the sampler rather than via extra dataset coordination

Index broadcast

Via shared filesystem, not collectives

Rank 0 writes .npy files; other ranks read them. No explicit torch.distributed.broadcast

Troubleshooting#

Symptom: Training hangs at startup for minutes

  • Likely cause: Rank 0 is building indices while all other ranks wait at the barrier.

  • Fix: Pre-build the cache with tools/prepare_cache.py and enable --dataloader-fast-cache-load.

Symptom: Spike in I/O at training start, then normal

  • Likely cause: All ranks simultaneously memory-mapping index files after the barrier.

  • Fix: Enable --dataloader-defer-npy-index-mmap to overlap index loading with training.

Symptom: Slow data loading during training (not just startup)

  • Run with --mock-data to confirm the dataloader is the bottleneck.

  • If startup, not steady-state throughput, is the main issue, try --dataloader-defer-npy-index-mmap.

  • If you are blending many dataset prefixes, try --per-dataset-sequences-path.

  • Test with --no-mmap-bin-files – the optimal setting depends on your filesystem.