For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Blog
DocsAPI Reference
DocsAPI Reference
    • AIStore
    • Documentation
  • Core Documentation
    • In-depth Overview
    • Terminology and core abstractions
    • Getting Started
    • Networking model
    • Buckets: design, operations, namespaces, and system buckets
    • Observability overview
    • CLI overview
    • Production deployment
    • Technical Blog
  • APIs, SDKs, and Compatibility
    • Go API
    • Python SDK
    • PyPI package
    • Python SDK reference guide
    • PyTorch integration
    • TensorFlow integration
    • HTTP API reference
    • curl examples
    • Easy URL
    • S3 compatibility
    • s3cmd quick start
    • Presigned S3 requests
    • Boto3 support
  • Command-Line Interface
    • CLI overview
    • ais help
    • CLI reference guide
    • Bucket operations
    • Cluster and remote-cluster management
    • Storage and mountpath management
    • Monitoring and ais show
    • Downloads
    • Jobs
    • Authentication and access control
    • Configuration via CLI
    • ETL CLI
    • Distributed shuffle CLI
    • ML / get-batch CLI
    • GCP credentials
    • TLS certificate management
  • Storage and Data Management
    • Storage services
    • Buckets: design, operations, namespaces, and system buckets
    • Native Bucket Inventory (NBI)
    • Backend providers
    • On-disk layout
    • Virtual directories
    • System files
    • Evicting remote buckets and cached data
  • Cluster Operations
    • Node lifecycle: maintenance, shutdown, decommission
    • Global rebalance
    • Resilver
    • AIS in Containerized Environments
    • Highly available control plane
    • Information Center (IC)
    • Out-of-band updates
    • Troubleshooting
  • Configuration and Security
    • Configuration
    • Environment variables
    • Feature flags
    • AuthN and access control
    • Authentication validation
    • HTTPS and certificates
    • Switching a cluster to HTTPS
  • ETL and Advanced Workflows
    • ETL overview
    • ETL CLI docs
    • ETL Python SDK examples
    • Custom transformers
    • ETL Python webserver SDK
    • ETL Go webserver package
    • Archives: read, write, and list
    • Distributed shuffle (dsort)
    • Initial sharding utility (ishard)
    • Downloader
    • Blob Downloader
    • Batch object retrieval (get-batch)
    • Batch operations
    • Tools and utilities
    • Extended actions (xactions)
  • Observability, Monitoring, and Performance
    • Observability overview
    • Monitoring with CLI
    • Logs
    • Prometheus integration
    • Metrics reference
    • Grafana dashboards
    • Kubernetes monitoring
    • Distributed tracing
    • Monitoring get-batch
    • AIS load generator (aisloader)
    • Benchmarking AIStore
    • Performance tuning and testing
    • Performance monitoring via CLI
    • Rate limiting
    • Checksumming
    • Filesystem Health Checker (FSHC)
    • Traffic patterns
  • Networking
    • Networking: multi-homing, network separation, IPv6
    • HTTPS configuration
    • Switching to HTTPS
    • Idle connections
    • MessagePack protocol
  • Deployment
    • AIStore on Kubernetes
    • Kubernetes Operator
    • Ansible playbooks
    • Helm charts
    • Deployment monitoring
    • Docker
  • Developer Resources
    • Development guide
    • aisnode command line
    • Build tags
  • Object and Bucket Naming
    • Unicode and special symbols in object and bucket names
    • Extremely long object names
Blog
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoAIStore
On this page
  • Overview
  • Key Metrics
  • Workload Volume & Mix
  • Latency, Throttling & Backpressure
  • Soft vs Hard Errors
  • PromQL examples
  • Rx Stall Rate (peer-induced waits)
  • Throttle Stall Rate (load-induced waits)
  • Error Behavior
  • Soft Error Rate
  • Hard Error Rate
Observability, Monitoring, and Performance

Monitoring get-batch

||View as Markdown|
Previous

Distributed tracing

Next

AIS load generator (aisloader)

Overview

GetBatch (a.k.a get-batch or x-moss) is the high-performance multi-object retrieval subsystem.

Paper: GetBatch: Distributed Multi-Object Retrieval for ML Data Loading - implementation, benchmarks, and discussion.

It streams objects and archived files in strict user-specified order, assembling them on the designated target (DT) and serving them as a TAR archive (buffered or streaming).

TAR is the default output format; compressed serialized options include TAR.GZ, ZIP, and TAR.LZ4 and are also fully supported.

See also:

  • GetBatch: Multi-Object Retrieval API for overview, capabilities and APIs, operational guidance, usage, Go and Python examples, and more.
  • Release Notes 4.0 - GetBatch introduction.

Unlike ordinary GET requests, get-batch:

  • pulls objects concurrently from many targets,
  • relies on intra-cluster streaming via SharedDM,
  • performs archive extraction (for shards),
  • and obeys load-based throttling and soft-error recovery.

This page documents how to observe and monitor get-batch jobs at scale.

Key Metrics

All metrics below are per-target Prometheus counters/totals. Use rate() or increase() over a window for meaningful rates.

Workload Volume & Mix

MetricDescription
getbatch.nTotal number of get-batch work items processed (successful or failed).
getbatch.obj.nNumber of whole objects retrieved and delivered in the output TAR.
getbatch.file.nNumber of files extracted from shard archives.
getbatch.obj.sizeCumulative size (bytes) of whole objects retrieved.
getbatch.file.sizeCumulative size (bytes) of shard-extracted files.

These represent actual payload delivered, not including error placeholders.


Latency, Throttling & Backpressure

MetricDescription
getbatch.rxwait.nsTotal nanoseconds the DT spent waiting to receive missing/out-of-order entries from peer targets. Reflects SDM/peer/network-induced stalls.
getbatch.throttle.nsTotal nanoseconds slept due to load-based throttling (memory/cpu pressure). Intentional back-pressure applied before serving next get-batch request.

Interpretation:

  • High rxwait → clustering, peer-to-peer streaming, SDM performance, or transient disconnects.
  • High throttle → DT-level resource pressure (memory load, CPU load, configured Advice).

These two combined provide a full picture of “Why is my job slower?”


Soft vs Hard Errors

MetricDescription
err_soft.getbatch.nNumber of soft error events (GFN recoveries and missing-path insertions). A single WI may contribute multiple increments.
err_getbatch.nNumber of hard errors (unrecoverable WIs, failures including 429 rejections).

Soft errors reflect recoverable situations; Hard errors reflect request-level failure or 429 (“too-many-requests”) rejection.


PromQL examples

For PromQL examples, please refer to Observability: Prometheus document.

In this section, we only highlight a few less obvious queries that may help analyze GetBatch latency - in particular, distinguish peer-induced stalls from resource-induced throttling.

Rx Stall Rate (peer-induced waits)

1rate(ais_target_getbatch_rxwait_ns[5m])

Throttle Stall Rate (load-induced waits)

1rate(ais_target_getbatch_throttle_ns[5m])

Error Behavior

Soft Error Rate

1rate(ais_target_err_soft_getbatch_n[5m])

Hard Error Rate

1rate(ais_target_err_getbatch_n[5m])