Monitoring get-batch | NVIDIA AIStore

Overview

GetBatch (a.k.a get-batch or x-moss) is the high-performance multi-object retrieval subsystem.

Paper: GetBatch: Distributed Multi-Object Retrieval for ML Data Loading - implementation, benchmarks, and discussion.

It streams objects and archived files in strict user-specified order, assembling them on the designated target (DT) and serving them as a TAR archive (buffered or streaming).

TAR is the default output format; compressed serialized options include TAR.GZ, ZIP, and TAR.LZ4 and are also fully supported.

See also:

GetBatch: Multi-Object Retrieval API for overview, capabilities and APIs, operational guidance, usage, Go and Python examples, and more.

Release Notes 4.0 - GetBatch introduction.

Unlike ordinary GET requests, get-batch:

pulls objects concurrently from many targets,
relies on intra-cluster streaming via SharedDM,
performs archive extraction (for shards),
and obeys load-based throttling and soft-error recovery.

This page documents how to observe and monitor get-batch jobs at scale.

Key Metrics

All metrics below are per-target Prometheus counters/totals. Use rate() or increase() over a window for meaningful rates.

Workload Volume & Mix

Metric	Description
`getbatch.n`	Total number of get-batch work items processed (successful or failed).
`getbatch.obj.n`	Number of whole objects retrieved and delivered in the output TAR.
`getbatch.file.n`	Number of files extracted from shard archives.
`getbatch.obj.size`	Cumulative size (bytes) of whole objects retrieved.
`getbatch.file.size`	Cumulative size (bytes) of shard-extracted files.

These represent actual payload delivered, not including error placeholders.

Latency, Throttling & Backpressure

Metric	Description
`getbatch.rxwait.ns`	Total nanoseconds the DT spent waiting to receive missing/out-of-order entries from peer targets. Reflects SDM/peer/network-induced stalls.
`getbatch.throttle.ns`	Total nanoseconds slept due to load-based throttling (memory/cpu pressure). Intentional back-pressure applied before serving next get-batch request.

Interpretation:

High rxwait → clustering, peer-to-peer streaming, SDM performance, or transient disconnects.
High throttle → DT-level resource pressure (memory load, CPU load, configured Advice).

These two combined provide a full picture of “Why is my job slower?”

Soft vs Hard Errors

Metric	Description
`err_soft.getbatch.n`	Number of soft error events (GFN recoveries and missing-path insertions). A single WI may contribute multiple increments.
`err_getbatch.n`	Number of hard errors (unrecoverable WIs, failures including 429 rejections).

Soft errors reflect recoverable situations; Hard errors reflect request-level failure or 429 (“too-many-requests”) rejection.

PromQL examples

For PromQL examples, please refer to Observability: Prometheus document.

In this section, we only highlight a few less obvious queries that may help analyze GetBatch latency - in particular, distinguish peer-induced stalls from resource-induced throttling.

Rx Stall Rate (peer-induced waits)

rate(ais_target_getbatch_rxwait_ns[5m])

Throttle Stall Rate (load-induced waits)

rate(ais_target_getbatch_throttle_ns[5m])

Error Behavior

Soft Error Rate

rate(ais_target_err_soft_getbatch_n[5m])

Hard Error Rate

rate(ais_target_err_getbatch_n[5m])