Monitoring get-batch
Monitoring get-batch
Monitoring get-batch
GetBatch (a.k.a get-batch or x-moss) is the high-performance multi-object retrieval subsystem.
Paper: GetBatch: Distributed Multi-Object Retrieval for ML Data Loading - implementation, benchmarks, and discussion.
It streams objects and archived files in strict user-specified order, assembling them on the designated target (DT) and serving them as a TAR archive (buffered or streaming).
TAR is the default output format; compressed serialized options include TAR.GZ, ZIP, and TAR.LZ4 and are also fully supported.
See also:
Unlike ordinary GET requests, get-batch:
This page documents how to observe and monitor get-batch jobs at scale.
All metrics below are per-target Prometheus counters/totals.
Use rate() or increase() over a window for meaningful rates.
These represent actual payload delivered, not including error placeholders.
Interpretation:
These two combined provide a full picture of “Why is my job slower?”
Soft errors reflect recoverable situations; Hard errors reflect request-level failure or 429 (“too-many-requests”) rejection.
For PromQL examples, please refer to Observability: Prometheus document.
In this section, we only highlight a few less obvious queries that may help analyze GetBatch latency - in particular, distinguish peer-induced stalls from resource-induced throttling.