GetBatch: Distributed Multi-Object Retrieval
GetBatch: Distributed Multi-Object Retrieval
Paper: GetBatch: Distributed Multi-Object Retrieval for ML Data Loading - implementation, benchmarks, and discussion.
GetBatch is AIStore’s high-performance API for retrieving multiple objects and/or archived files in a single request. Behind the scenes, GetBatch assembles the requested data from across the cluster and delivers the result as a continuous serialized stream.
Regardless of retrieval source (in-cluster objects, remote objects, or shard extractions), GetBatch always preserves the exact ordering of request entries in both streaming and buffered modes.
Other supported capabilities include:
- Fetching thousands of objects in strict user-specified order
- Extracting specific files from distributed shard archives (TAR, TAR.GZ, TAR.LZ4, ZIP)
- Cross-bucket retrieval in a single request
- Graceful handling of missing data
- Streaming or buffered delivery
GetBatch is implemented by the eXtended Action (job). Internally, the job is codenamed
x-moss. The respective API endpoint is:/v1/ml/moss.
Note: buffered mode always returns both metadata (that describes the output) and the resulting serialized archive containing all requested data items.
Note: for TAR.GZ, both .tgz and .tar.gz are accepted (and interchangeable) aliases - both denote the same exact format.
A Note on HTTP Semantics
GetBatch uses HTTP GET with a JSON request body, which is:
- Permitted by RFC 9110 - HTTP semantics allow message bodies in GET requests.
- Necessary for this API, as the list of requested objects can contain thousands of entries that would exceed URL length limits.
- Semantically correct - the operation is idempotent (pure data retrieval with no server-side state changes).
Rest of this document is structured as follows:
Table of Contents
- Supported APIs
- Terminology
- When to Use GetBatch
- When NOT to Use GetBatch
- Go API Structure
- Python Integrations
- Usage Examples
- Performance Characteristics
- Error Handling
- Configuration
- Output Formats
- Naming Conventions
- Monitoring & Observability
- Advanced Use Cases
- Limitations & Future Work
- References
Supported APIs
GetBatch is typically called from:
- Go services via the
api/ml.goclient bindings, and - Python via the AIStore SDK, and
- Third-party tooling such as Lhotse.
The respective Go and Python-based usage examples follow below, in the sections that include:
Terminology
When to Use GetBatch
ML Training Pipelines
- Loading training data in deterministic order (reproducible epochs)
- Fetching 1000s-100Ks of objects per training batch
- Consuming sharded datasets where each shard contains many samples
Distributed Shard Processing
- Extracting specific files from archives distributed across the cluster
- Processing subsets of large TAR/ZIP collections
- Parallel extraction from 1000s of shards
Ordered Batch Retrieval
- Any workflow requiring strict ordering guarantees
- Sequential processing pipelines
- Deterministic data sampling
When NOT to Use GetBatch
- Single objects => Use regular GET
- Small batches (<16 entries) => Overhead not justified; use parallel GETs
- Order doesn’t matter => Use list-objects + concurrent GETs for better parallelism
- Real-time random access => GetBatch optimizes for sequential streaming
Whether or not batch sizes can be deemed “small” will ultimately (obviously) depend on the workload and cluster; any specific numbers in this document must be considered a “rule of thumb” rather than direct prescription or limitation of any kind.
Go API Structure
Request: apc.MossReq
Field Details:
Request Entry: apc.MossIn
Each entry in the in array can specify:
Response: apc.MossResp
Streaming mode (strm: true):
- Single HTTP response body containing TAR stream
- Files appear in exact order requested
- No separate metadata structure
Buffered mode (strm: false):
- Multipart HTTP response with two parts:
- Metadata part (
MossRespJSON) - Data part (TAR archive)
- Metadata part (
Response Entry: apc.MossOut
Python Integrations
GetBatch also offers robust integration with Python data pipelines, with official support in both the AIStore Python SDK and third-party libraries:
1. AIStore Python SDK
The Python SDK provides a Batch class that wraps the /v1/ml/moss endpoint with a Pythonic fluent API.
Key features:
- Pydantic models (
MossReq,MossIn,MossOut) that mirror Go structs exactly - Fluent interface for building batch requests
- Automatic TAR stream extraction
- Support for archpath (shard extraction), opaque metadata, and cross-bucket requests
The SDK mirrors Go structures, with minor naming conventions:
This mapping helps translate examples between Go and Python.
Basic usage:
Batch class methods:
add(obj, archpath=None, opaque=None, start=None, length=None)- Add object with advanced parametersget(raw=False, decode_as_stream=False, clear_batch=True)- Execute and return generator of(MossOut, bytes)tuplesclear()- Clear batch for reuse
Response structure:
- Streaming mode (
streaming_get=True, default): Returns TAR stream,MossOutmetadata inferred from request - Multipart mode (
streaming_get=False): Returns server-validatedMossOutwith actual sizes, errors
See Python SDK Batch API for complete documentation.
2. Lhotse Integration
Lhotse is a speech/audio data toolkit used by NVIDIA NeMo and other frameworks. It includes native AIStore support via AISBatchLoader.
How it works:
- CutSet Manifests - Lhotse manages audio/feature manifests with AIStore URLs
- Batch Loading -
AISBatchLoadercollects all URLs from a CutSet batch - GetBatch Execution - Issues single batch request via AIStore Python SDK
- Archive Extraction - Automatically extracts files from sharded archives (TAR/TGZ)
- In-Memory Injection - Returns CutSet with data loaded into memory
Usage:
See also:
A complete, runnable example for batch loading audios from AIStore with Lhotse is available in here:
Archive extraction example:
Lhotse URLs like ais://mybucket/shard-0000.tar.gz/audio/sample_42.wav automatically:
- Split into object (
shard-0000.tar.gz) + archpath (audio/sample_42.wav) - Send to GetBatch with
archpathparameter - AIStore extracts the file server-side
- Returns raw audio bytes directly to training loop
Architecture:
Key benefits:
- Single batch request instead of N individual GETs
- Server-side extraction from sharded archives
- Deterministic ordering for reproducible training
- Zero client-side decompression overhead
See also:
- AIStore in NeMo workflows (data loading section)
3. NeMo Framework Integration
NVIDIA NeMo is an end-to-end platform for building and training state-of-the-art AI models, including Automatic Speech Recognition (ASR), Natural Language Processing (NLP), and Text-to-Speech (TTS).
GetBatch is now integrated into the Lhotse-based ASR dataloader. Instead of fetching individual audio files from AIStore during training, the dataloader now batches all required samples for an epoch or mini-batch into a single GetBatch request. This reduces network overhead and improves data loading throughput for large-scale ASR training.
Enabling GetBatch:
A single environment variable activates batch loading for ASR training pipelines using Lhotse+AIStore. No code changes required.
How it works:
- Dataloader collects all audio file paths needed for the current batch
- Issues single GetBatch request to AIStore (replaces N individual GETs)
- AIStore returns TAR archive with all samples in order
- Dataloader extracts and feeds samples to the training loop
This same pattern can be integrated in other NeMo training pipelines (NLP, TTS, multimodal) where datasets are stored in AIStore, providing similar performance benefits for data-intensive workloads.
See also:
4. PyTorch Integration
The AIStore PyTorch Plugin provides AISBatchIterDataset, an iterable-style dataset that uses GetBatch API for efficient multi-worker data loading with automatic batching and streaming support.
Usage Examples
Note: curl examples in this section are purely illustrative - do not copy/paste.
Example 1: Retrieve Plain Objects
Result: batch.tar containing:
Example 2: Extract Files from Shards
Result: extracted.tar containing:
Example 3: Cross-Bucket Retrieval
The request bucket (
default-bucketin URL) is used when bucket is omitted in an in entry.
Result: TAR containing objects from three different buckets in one request.
Example 4: Handle Missing Data Gracefully
Response metadata shows which items failed:
TAR contains:
Performance Characteristics
Throughput by Workload Type
Note: Actual throughput will vary significantly based on object sizes, network topology, storage backend, CPU capability, and cluster configuration. Numbers below are purely indicative ranges rather than guaranteed performance targets.
Note: Archived file extraction from compressed formats is CPU-intensive. A single target extracting from 1000s of TAR/ZIP shards will see significant CPU load.
Latency Components
First-byte latency: 50-500ms (can vary based on cluster size and load)
- Designated Target (DT) selection
- Peer coordination
- First data arrival
Streaming throughput: Wire-speed after first byte
- Limited by network bandwidth or disk I/O
- No additional per-object overhead once streaming starts
Memory & Resource Usage
Memory: Bounded by DT capacity + load-based throttling
- System monitors memory pressure
- Automatically throttles or rejects (429) new requests under stress
- See Monitoring GetBatch for observability
CPU: Varies by workload
- Plain objects: minimal CPU (file I/O bound)
- Compressed archives: moderate CPU (decompression)
- Many small files: higher CPU (archive parsing overhead)
Error Handling
Strict Mode (coer: false)
Behavior: First error aborts entire request
Use when:
- Data completeness is critical
- Prefer fail-fast over partial results
- Small batches where retry cost is low
Error response: HTTP 4xx/5xx, no partial data
Graceful Mode (coer: true) ✅ Recommended
Behavior: Continue processing, mark missing items
Use when:
- Large batches (1000s+ items)
- Some missing data is acceptable
- Want to maximize throughput despite occasional 404s
Missing items:
- Appear in TAR under
__404__/bucket/objectwith size=0 - Metadata includes
err_msgdescribing failure - Extracting the TAR shows all missing items grouped under
__404__/
Soft error limit: Configurable per work item (default: 6)
- Prevents cascading failures
- Aborts work item after N transient errors
- See Configuration section below
Configuration
get_batch.max_soft_errs (default: 6)
Maximum transient errors per work item before aborting.
When to increase:
- Large batches with expected missing data
- Tolerate more GFN (get-from-neighbor) fallbacks
- Unstable network environments
When to decrease:
- Strict availability requirements
- Fail faster on systemic issues
get_batch.warmup_workers (default: 2)
Pagecache warming pool size (best-effort read-ahead).
When to increase:
- Fast NVMe storage
- Reduce first-access latency
- CPU/memory headroom available
When to disable (set to -1):
- High memory pressure
- Slow disks where read-ahead adds no value
- CPU-constrained environments
To disable warmup/look-ahead operation:
Output Formats
GetBatch supports multiple archive formats via the mime field:
Recommendation: Use .tar for maximum throughput unless network bandwidth is constrained.
Naming Conventions
Default Naming (onob: false)
Files in output TAR include bucket prefix:
Object-Only Naming (onob: true)
Files in output TAR omit bucket:
Archived Files
When extracting from shards with archpath:
Default (onob: false):
Object-only (onob: true):
Monitoring & Observability
GetBatch exposes Prometheus metrics for:
- Throughput (objects vs archived files)
- Resource pressure (throttling, RxWait stalls)
- Error rates (soft vs hard errors)
See: Monitoring GetBatch for detailed metrics, PromQL queries, and operational guidance.
CLI monitoring example
CLI wll render
CtlMsgoutput on multiple lines when it includes multiple aggregated messages.
Advanced Use Cases
ML Training: Deterministic Epoch Loading
Distributed Processing: Scatter-Gather Pattern
Limitations & Future Work
Current limitations:
- Range reads (
start/length) not yet implemented - Shard extraction is sequential within each archive
Roadmap:
- Range read support for partial object retrieval
- Multi-file extraction from single shard (performance optimization)
- Finer-grained work item abort controls
References
- Release Notes 4.0 - GetBatch introduction
- Go API - API control structures
- Python Integrations
- Monitoring GetBatch - Metrics and observability
- AIS CLI - Command-line tools