Monitoring get-batch
Monitoring get-batch
Overview
GetBatch (a.k.a get-batch or x-moss) is the high-performance multi-object retrieval subsystem.
Paper: GetBatch: Distributed Multi-Object Retrieval for ML Data Loading - implementation, benchmarks, and discussion.
It streams objects and archived files in strict user-specified order, assembling them on the designated target (DT) and serving them as a TAR archive (buffered or streaming).
TAR is the default output format; compressed serialized options include TAR.GZ, ZIP, and TAR.LZ4 and are also fully supported.
See also:
- GetBatch: Multi-Object Retrieval API for overview, capabilities and APIs, operational guidance, usage, Go and Python examples, and more.
- Release Notes 4.0 - GetBatch introduction.
Unlike ordinary GET requests, get-batch:
- pulls objects concurrently from many targets,
- relies on intra-cluster streaming via SharedDM,
- performs archive extraction (for shards),
- and obeys load-based throttling and soft-error recovery.
This page documents how to observe and monitor get-batch jobs at scale.
Key Metrics
All metrics below are per-target Prometheus counters/totals.
Use rate() or increase() over a window for meaningful rates.
Workload Volume & Mix
These represent actual payload delivered, not including error placeholders.
Latency, Throttling & Backpressure
Interpretation:
- High rxwait → clustering, peer-to-peer streaming, SDM performance, or transient disconnects.
- High throttle → DT-level resource pressure (memory load, CPU load, configured Advice).
These two combined provide a full picture of “Why is my job slower?”
Soft vs Hard Errors
Soft errors reflect recoverable situations; Hard errors reflect request-level failure or 429 (“too-many-requests”) rejection.
PromQL examples
For PromQL examples, please refer to Observability: Prometheus document.
In this section, we only highlight a few less obvious queries that may help analyze GetBatch latency - in particular, distinguish peer-induced stalls from resource-induced throttling.