v4.5 | NVIDIA AIStore

AIStore 4.5 is a focused release on three major areas: global rebalance scalability, indexed archive access via the new shard index, and a get-batch flow reordering that substantially reduces memory pressure under load.

The work on global rebalance is one of the headlines. AIS removes the legacy per-object ACK machinery that had become a scalability ceiling for very large rebalances (counting millions of migrated objects), replaces the cleanup behavior that depended on that machinery with a new explicit cleanup mode, and optimizes the lifecycle around data movers, transport endpoints, stage transitions, and peer status queries.

Remote list-objects (R-flow) is another practical driver for this release. Multi-target remote-bucket listing now starts the (N - 1) targets before the designated target (DT) begins backend listing and page distribution. The same area also gets flow-control cleanup, corrected page accounting, and a dedicated CLI job view for remote-list xactions.

The shard index is a new experimental subsystem for indexed extraction from TAR shards. It lets GET and get-batch read files from TAR shards directly using a persisted index instead of scanning the full archive. The 4.5 implementation includes the index format and binary pack/unpack support, persistence in a system bucket, a bucket-scoped indexing xaction, CLI support, read-path integration, tests, and a micro-benchmark.

Streaming get-batch responses now use explicit write deadlines while sending data to the client. This lets AIS detect terminated, stalled, or unreachable clients promptly and abort the request instead of continuing to assemble and transmit a large batch that no client is still reading. The flow was reordered and optimized, work-item cancellation now propagates across senders, and admission control is stricter under load.

Authentication and access control adds support for externally provisioned RSA signing keys, JWKS refresh on cache miss for rotated key pairs, configurable maximum token age, and a more general signing-key configuration. Intra-cluster request validation is also tightened: spoofed caller headers are rejected on the public network, and internal-network checks require a validated Smap entry.

CLI and observability gain several user-visible improvements: dynamic ais show cluster cpu and ais show cluster memory views, a new ais performance intra-data view, shard-index commands, better rebalance rendering including cleanup mode, and improved help for force-join / split-brain recovery workflows.

This release preserves backward compatibility. The few additive API fields, configurable behavior changes, and operational migration notes are summarized in Upgrade Notes.

Global Rebalance

AIStore 4.5 delivers the largest rebalance update in several release cycles. The main theme is replacing per-object state with stage-level coordination and explicit cleanup semantics.

Per-object ACKs removed

Rebalance no longer tracks individual object acknowledgments. The previous mechanism - ACK messages back to the sender, sender-side maps of unacknowledged objects, retransmit and wait loops, and an ACK-driven lazy-delete path - has been removed in favor of stage-level coordination.

In the new model, traversal sends objects without keeping per-object state on the sender, and a post-traverse barrier takes the place of the old wait-for-ACK drain. Intra-cluster transport headers carry a compact opaque payload with the rebalance generation, so receivers can recognize and reject objects that arrive late from a previous run. Object and byte totals reported by ais show rebalance and ais show job now come directly from the transmitted counters rather than from ACK accounting.

The practical effect is lower memory pressure and simpler lifecycle behavior during large rebalances, where per-object ACK state had become the limiting factor.

Cleanup mode

Removing per-object ACKs also removed the old incidental mechanism that trimmed misplaced source copies after migration. AIStore 4.5 adds an explicit replacement: rebalance cleanup mode.

Cleanup mode walks local mountpaths, identifies misplaced object copies using Smap HRW ownership, and removes a misplaced copy only after verifying that the canonical copy exists in the expected location with matching object identity. Identity checks include size and checksum, and use version / ETag when available.

CLI:

1 $ ais start rebalance --cleanup
2 $ ais start rebalance --cleanup --force

By default, cleanup mode keeps diverged copies. With --force, it can also remove copies that differ from the canonical peer version. This is an advanced operator option.

Cleanup mode is intentionally distinct from regular rebalance:

it has its own preflight checks;
it refuses to start while rebalance or resilver is active;
it requires at least two active targets;
it bypasses config.Rebalance.Enabled;
it uses no data mover, no streams, and no GFN;
it skips EC-enabled buckets and busy objects.

ais show job and ais show rebalance render cleanup-mode runs with a dedicated view that reports removed objects and bytes rather than migration TX/RX counters.

Lifecycle and transport

A series of lifecycle changes makes rebalance more robust through abort, preempt, renew, and finalization paths: fresh data mover construction per run, safer handling of duplicate transport endpoints after abort, narrower mutex scope in the finalization path, a consistent same-targets predicate across preempt and renew, and corrected stage-reached detection.

One change is operator-visible: rebalance CtlMsg now carries per-stage timing and final status, and ais show job --all continues to surface this information after the run completes.

CtlMsg is the control-message interface supported by AIS xactions (jobs). Each xaction uses it to report job-specific runtime state - counters, stage information, start parameters, and error summaries - so that ais show job and related commands can render a current view without requiring separate CLI plumbing for every xaction.

Commits

05eddc1e1: global rebalance: remove per-object ACKs (major upd)
d8d4f9931: global rebalance: introduce safe cleanup mode
28d55f6c9: global rebalance: cleanup-mode observability
0cab3b877: global rebalance: consolidate primary’s logic to run cleanup mode
b6037a3da: cli: ais show job to correctly show rebalance in cleanup mode
785a1b769: cli: ‘start rebalance [—cleanup]’ to show ID; add formatted action helpers
0df4c7549: global-rebalance/cleanup-mode: add CLI-based scripted test
3c5ead654: space cleanup: detect cluster-wide misplaced objects via Smap HRW
ad5256b96: docs: update rebalance.md (add cleanup mode; examples)
a10928632: apply rebalance conflict/abort policy consistently across jobs
913f4bd89: lazy-delete: reduce noisy logs and bump work channel cap

Shard Index

Experimental in 4.5. The on-disk format, xaction semantics, and CLI surface may change in subsequent releases.

The shard index is a new subsystem for indexed archive extraction, primarily intended for ML and data-lake workloads that repeatedly fetch named samples from large TAR shards. When a client requests a file inside a TAR shard via archpath, AIS can now resolve the lookup through a persisted index rather than scanning the archive. The fast path is integrated into both GET and get-batch.

A shard index is a compact binary structure built in a single pass over a source TAR. It supports USTAR, GNU, and PAX variants, encodes entries with Pack/Unpack, and embeds the source object’s checksum and size so the index can detect re-uploads and treat itself as stale when the underlying shard has changed. Indexes are persisted as objects in a new system bucket (ais://.sys-shardidx); shards that have an index carry a dedicated flag in their on-disk metadata.

Indexing is performed by a bucket-scoped xaction (job). It walks the bucket, skips non-TAR objects, rebuilds stale or corrupted indexes, and treats per-object failures as non-fatal - a single bad shard does not abort the run.

Index presence, lookup, validation, and fallback are handled entirely by AIS; clients continue to use the same requests to read archived content.

On the read path, index use is gated by a dedicated bit in each shard’s on-disk metadata. That bit is set and cleared atomically together with the index object in the system bucket. When the bit is set, AIS loads the index and seeks directly to the requested entry; otherwise the read falls back to the existing sequential archive traversal, with no functional difference visible to the client.

CLI

1 $ ais shard-index ais://my-shards
2 $ ais show shard-index ais://my-shards/object.tar

The release also includes a micro-benchmark for indexed extraction and an integration test that runs both indexed and non-indexed variants of TestGetFromArch and TestMoss and asserts result equivalence.

Commits

05df64267: cmn/archive: add TAR shard index
ed1ec6977: cmn/archive: Pack/Unpack for TAR shard index via compact binary
08db651bb: shard index: embed source cksum and size; detect stale index on load
4ac3bc91e: core: save and load shard index as object in system bucket
18e27fdae: shard index: update SaveShardIndex into two non-overlapping phases
26a8c8366: xact/xs: add shard index xaction
e0d0c13fd: shard index: integrate with GET and get-batch archpath; fast path
bfebaa127: cli: add shard index command; add e2e test
7bbaa365b: bench: add shard index micro benchmark (indexed archive extraction)

Get-Batch

The 4.5 cycle continues the work on get-batch - proxy memory pressure under load, sender behavior under failure, admission control, and streaming response reliability.

Redirect before sender fan-out

The flow is reordered so that the client redirect to the designated target happens before sender fan-out, with fan-out then running asynchronously. Previously the proxy held assembly-related state while waiting for sender startup; the new ordering moves that ownership to the designated target and reduces proxy-side memory pressure substantially.

Work-item cancellation

A new abort opcode over the target-to-target control path lets the designated target cancel an already-failed work item across all senders. This bounds wasted sender work under error and overload conditions. The cancellation path includes sender-side negative caching of aborted work IDs with periodic cleanup.

Admission control

Get-Batch admission control now accounts for pending work items and system load more directly. Critical memory or disk pressure returns HTTP 429. High memory pressure combined with a high pending-work level can throttle or reject depending on severity. The intent is to keep throughput steady in normal conditions while making overload behavior predictable.

Streaming responses

Get-batch streaming responses can run for a long time as the designated target (DT) assembles and writes the output. If a client terminates or simply stops reading mid-stream, AIS now detects this promptly and stops the assembly, rather than continuing to do reassembly work into a socket nobody is reading from. The implementation sets explicit write deadlines on the response stream so that a hung write fails fast and surfaces the abandoned client to the xaction.

Phase-3 startup acknowledgment

Senders acknowledge phase-3 startup to the designated target over the T2T control path. The designated target tracks expected senders per work item and fails the request if startup acknowledgments don’t arrive within the startup window - turning a class of silent hangs into prompt, observable failures. Object transfer still uses SDM; only the startup notification moves to the control path.

Observability

Get-Batch xaction IDs are now prefixed gbt- for easier identification in CLI output and logs. CtlMsg output separates job-level and node-level counters, and per-work-item CtlMsg is surfaced on assembly errors and in super-verbose mode.

Commits

4570e2af6: get-batch: amend the flow: swap phase2 <=> phase3
2c1cc23ec: get-batch: the capability to cancel work-item (DT => senders)
34a48d0ec: get-batch: admission control (bump num-pending-high-wm)
4db06b1e5: get-batch: streaming write w/ response controller (part four)
814750bcc: get-batch: streaming write w/ response controller (fix)
70ff9ea6c: get-batch: fail request when phase-3 senders do not ping at startup
e8778acad: get-batch: revise CtlMsg; add work-item’s CtlMsg; reduce log noise
799fd50d4: xact: cleanup recv-abort; get-batch prefix and abort path
bdd7f62df: apc: add omitempty json tags for prefetch blob-download params

Intra-Cluster Control Plane

AIS 4.5 continues separating data-plane traffic from xaction control traffic and tightens validation of requests that claim to be intra-cluster.

Remote List-Objects (R-flow)

Multi-target remote-bucket list-objects - the “R-flow” mechanism, one of several distinct list-objects paths in AIS - gets a startup-ordering fix and an accounting cleanup.

In a multi-target R-flow run, one target is designated to perform the backend listing and distribute pages to the others. In 4.5, DT waits a brief grace period before issuing the backend call and beginning page distribution. This is a temporary workaround until the flow gets a proper prepare phase.

Separately, the list-objects flow-control code is cleaned up - local context is threaded through where it was previously passed implicitly - and page-receive accounting is made consistent between DM and the list-objects xaction so that received pages are no longer double-counted.

Further, list-objects jobs now get dedicated ais show job rendering, including a remote-pagination view with TX/RX page counters.

Target-to-target xaction control path

A generic target-to-target xaction control path is added under:

POST /v1/xactions/t2tctrl/{kind}/{xid}/{wid}/{opcode}

It runs over cmn.NetIntraControl and is used by get-batch for sender-started notifications and abort propagation, keeping get-batch control messages off the SDM data path. Future xactions that need out-of-band control signaling can use the same route.

Intra-cluster request validation

Requests that claim intra-cluster identity are now validated more strictly. Caller headers received on the public network are rejected; self-joining nodes may still use the public network but those headers are stripped from their requests. Internal-network checks require a validated current-Smap entry, so a node that isn’t in the current cluster map can’t claim intra-cluster status.

Underlying these checks, /reverse, /cluster, and /daemon each gain separate public and intra-control entry points, with public entry points performing access checks before dispatching. The split makes it harder for an external client to spoof intra-cluster identity or bypass public access controls.

Error propagation

ErrHTTP handling is reinforced across proxy and target boundaries. Forwarded HTTP errors now preserve their original status codes - previously, certain forwarded errors had their status overwritten, which broke clients (including the Python SDK) that retry on specific codes such as 429 from admission control.

Commits

4b339f3fb: list-objects temp workaround: add R-flow startup grace
604a46db5: list-objects: refactor and cleanup flow-control (add local context)
e86e5daa0: core: “broadcast to all targets except one” pattern; feat reserved
3b9b79db2: xactions: introduce out-of-band (target-to-target) control path
d21e45a36: reject intra-cluster “caller” headers on dedicated pub-net
b20cd0113: auth: require validated cluster map entry for internal network check
0cd8ff107: access control: split public and intra-control handlers
86321f015: reinforce & refactor: checking intra-cluster request; external-watchdog
aee738be8: cmn package, common http helpers: content-type
813688cb1: (de)serialize errors across boundaries: keep and reinforce ErrHTTP status
6fb51788b: cmn: preserve ErrHTTP.Status on proxy error forwarding
e8fc0a88c: intra-cluster API change: HEAD(object) will now return properties

AuthN

Authentication gets several changes around RSA signing keys, JWKS, and token validation.

RSA signing configuration is restructured under auth.rsa.{bits,mode}. The new mode: "external" setting indicates that the RSA signing key is managed outside the AuthN process, which is useful for multi-replica deployments, secret managers, and HSM-backed key management.

In external mode:

a missing key file at startup is fatal;
AuthN does not auto-generate the key;
POST /v1/auth/keys/rotate returns 400;
JWKS is built from the key file during initialization;
live reload still requires restart.

JWKS lookup now supports rotated key pairs better: on key-cache miss, AuthN refreshes from the issuer before failing validation. A new max_token_age config option decouples JWKS cache age from token expiration.

The older environment-controlled signing-mode mechanism is removed in favor of the explicit config model.

Commits

0caf82d3b: authn: RSA config restructured under auth.rsa; externally provisioned key support (#294)
7962433fd: authn: follow-up lint fixes
88917b7c9: auth: support rotated key pairs with JWKS refresh on cache miss
3a2c527cc: authn: add max_token_age config option and decouple jwks max-age from token expiry
d7b55ddbc: authn: update config for generic signing keys with backwards compatibility
95528e522: authn: remove support for environment-controlled signing mode

Stats and Observability

GET-not-found stats: now off by default

GET(object) 404s are no longer counted in error stats by default. This avoids inflating Prometheus error counters for workloads where object-not-found is expected or common (probe-then-PUT patterns, content-addressed lookups, etc.).

Operators that prefer the previous behavior can enable the Count-Object-NotFound-Stats feature flag at cluster or bucket scope.

See Upgrade Notes for the CLI invocation.

Cgroup v2 memory pressure

The cgroup v2 actual-free memory estimate is refined to use total file-backed cache minus dirty/writeback pages. This improves memory-pressure signaling in containerized deployments - particularly under heavy write load, where the previous estimate could overcount reclaimable memory.

Cluster CPU and memory views

apc.MemCPUInfo gains structured memory and process-stat fields. The existing flat fields remain, and the new fields are optional. The CLI uses the richer structure to render the new ais show cluster cpu and ais show cluster memory views - see CLI for examples.

Intra-data performance view

A new ais performance intra-data view reports intra-cluster data movement separately from generic stream counters: per-target RX/TX object counts, bytes, throughput, and average object size, with cluster-wide totals and an idle indicator.

To support the view, transport and DM counters now count data PDUs only - control-message PDUs no longer contribute. This is a small behavior change for anyone reading existing transport counters: numbers shown in ais show job and the generic stream views may be slightly lower than they were under 4.4 for the same workload, with the difference accounted for as control traffic.

Msgpack-encoded stats

stats.NodeStatus and related control-plane structures can now be encoded as msgpack as an alternative to JSON. Clients opt in via Accept: application/msgpack; clients that don’t continue to receive JSON. The motivation is lower allocation and faster decoding on hot stats-collection paths.

Commits

60d2e263a: msgpack-encode one of the central control plane structures (part one)
f52af0a2b: [API change]: add structured memory and process stats; cli: ‘ais show cluster’
80a8c459e: GET(object) not found is now not counted by default; new feature flag
ac37f2e33: sys: refine cgroup v2 ‘actual-free’ estimate for memory pressure
ffdbc37d4: cli: ais performance intra-data (new); transport+DM: data-only Tx/Rx counters
a18ee7eab: intra-cluster streams/DM: skip generic (in/out) stats; CLI: dedicated template
e6dbeddf3: sys: use nlog in Init(); cli: drop sys.Init() call

Blob Downloader and Prefetch

Per-chunk read timeout and abort propagation

The blob downloader gains a per-chunk backend read timeout: BlobMsg.ChunkReadTimeout is a new optional field (zero selects the default); the GET path can also pass it via apc.HdrBlobReadTimeout.

Per-range reads now derive their context from the xaction context, so aborting the xaction cancels in-flight backend readers and body copies. Error paths close range readers consistently.

Admission and adaptive parallelism

Blob-download jobs now go through the same load-aware admission and worker-tuning that other heavy-duty xactions (list-range, TCB) use. At job start, AIS checks memory, CPU, and goroutine load. Under critical memory pressure the job is refused with HTTP 429; under high pressure, start may be briefly delayed and the worker count is reduced.

Worker count itself is the minimum of three caps: the load-tuned count (which folds in the user request, mountpath count, media type, and live system load), an object-size cap of roughly one worker per 256 MiB of data, and a hard ceiling. So a small object on a fast disk gets a single worker even if the user asks for more; a multi-GiB object on NVMe scales up; nothing exceeds the hard cap. The same heuristic now backs blob-download, list-range jobs, and TCB.

Prefetch integration

Prefetch decides per-object whether to use the blob-downloader: any object at or above BlobThreshold (when configured) is fetched as a BLOB, otherwise as a normal cold GET.

Starting prefetch on a large remote bucket can therefore spawn a series of blob-download jobs autonomously - which is why the admission and worker-tuning improvements above matter most in this context. Each blob-download job runs through the same checks, so a prefetch fan-out adapts to live memory, CPU, and disk pressure rather than launching workers blindly.

Prefetch can now pass blob-downloader worker and chunk-size settings through to the underlying jobs:

1 $ ais prefetch ... --blob-chunk-size <size> --blob-num-workers <n>

Commits

28b6ee302: [API change] blob-downloader: per-chunk read timeout and context-driven abort
5788880fe: blob-downloader: refactor and cleanup; revise throttling logic
19c0bfa6b: xs/blob_download: cap blob-download workers with prefetch
12b57ed98: cli/prefetch: add --blob-chunk-size and blob-num-workers

CLI

Cluster CPU and memory views

ais show cluster cpu and ais show cluster memory now provide dynamic per-node views. The tables skip all-zero columns, support node filtering, and work with JSON and no-header output modes.

Examples:

1 $ ais show cluster cpu
2 $ ais show cluster memory
3 $ ais show cluster cpu <NODE_ID>

Intra-cluster data bandwidth and throughput

New CLI view ais performance intra-data shows intra-cluster data movement separately from generic performance counters. It reports RX/TX object counts, TX/RX bytes, throughput, average size, and cluster-wide totals.

Rebalance rendering

Rebalance output is more specific and easier to correlate across commands:

ais start rebalance reports the started rebalance ID;
ais show job uses a rebalance-specific renderer;
cleanup-mode rebalances show removed objects/bytes instead of migration TX/RX counters;
completed rebalance CtlMsg remains available via ais show job --all.

List objects

List-objects jobs now use ais show job templates selected specifically for list xactions: remote-pagination runs show TX/RX page counters, while bucket-scoped and generic list jobs use the appropriate bucket/no-bucket layouts.

Shard index commands

The CLI adds commands to build and inspect shard indexes:

1 $ ais shard-index ais://bucket[/prefix]
2 $ ais show shard-index ais://bucket/object.tar

Force-join / split-brain recovery help

The advanced (force-join clusters) workflow around ais cluster set-primary --force gets clearer help, stronger confirmation text, destination URL validation, and more explicit guidance for the follow-up rebalance.

Help text and templates

Several help-template and rendering details are cleaned up:

ais show cluster help gets inline examples;
command [arguments...] is shown only when a command truly requires a subcommand;
NAME vs DESCRIPTION usage is clarified;
table helpers are consolidated and reused.

Commits

9f0c6ae49: cli: ‘ais show cluster cpu/mem’ - dynamic tables, node selection
9142bc905: cli: merge clusters and/or resolve split-brain (advanced usage)
8f2b8d620: cli: extend ‘ais show cluster’ help - add inline examples
7be9fd308: cli/help: show “command” only when the command in question requires completion
65a4b873c: cli: NAME vs DESCRIPTION in templates (minor)

Core, Config, and xactions

Chunk-size validation

Chunk-size constants are exported, and bucket-config validation now backfills and checks chunks.chunk_size even when automatic chunking is disabled. This prevents legacy bucket properties from persisting ChunkSize == 0 and later causing divide-by-zero paths on oversized PUT or cold GET.

Workfile basename overflow

Workfile creation now handles borderline-length object names safely. If decorated temporary workfile names would exceed filesystem basename limits, AIS shortens the temporary basename while preserving the final object path.

Force-join target initialization

Joining targets now create bucket directories during force-join commit before installing the destination BMD. This prevents missing local bucket directories after cross-cluster force-join workflows.

Space cleanup

Space cleanup now distinguishes local mountpath placement from cluster-wide HRW ownership. A locally well-placed object can still be cluster-misplaced after rebalance or target rejoin; cleanup now detects that case and removes the local stale copy only after the cluster-HRW peer confirms identical content.

This complements, but does not replace, rebalance cleanup mode. ais space-cleanup remains a local cleanup tool; ais start rebalance --cleanup is the explicit cluster-wide cleanup workflow for misplaced object copies.

Space cleanup also skips chunk manifests as a temporary safeguard.

Cloud/backend fixes

AWS multipart upload now handles S3 client creation failure. OCI region header spelling is canonicalized.

Commits

77b922b48: xact: extend BckJog with internalized worker pool; migrate TCB
cb8329ab7: config: export chunk (min, max, default) size and max mon-size
d1df788a1: config: fix chunks.chunk_size backfill logic in validation
90362e3dd: fix: workfile basename overflow on borderline-length object names
215252b03: force-join clusters: joining targets must create bucket dirs
4cf1616e7: space-cleanup: skip chunk manifests (temp workaround)
02ee82d3e: aws: handle s3 client creation failure for aws multipart
7a43c4708: apc: canonicalize OCI region header name

Python SDK, ETL, and aisloader

Python SDK v1.24.0

The Python SDK release includes several reliability and compatibility fixes:

RetryConfig and RetryManager are picklable, unblocking PyTorch DataLoader workers under spawn/forkserver-style multiprocessing;
object reads retry after clean short EOF by resuming from the current byte offset;
retry logging emits the final exhausted-retry stack once, while preserving concise per-attempt warnings;
the legacy aistore/client.py compatibility re-export is removed;
missing __init__.py files are added to SDK subpackages so generated API docs include the expected modules.

ETL streaming

ETL streaming paths are tightened across FastAPI, Flask, and HTTPMultiThreadedServer implementations.

Highlights:

no-FQN FastAPI hpull GET streams through the shared session without buffering the full upstream object;
no-FQN FastAPI hpush PUT bridges async request streams into sync readers for constant-memory transform_stream consumption;
HTTPMultiThreadedServer hpush PUT uses a bounded reader around rfile and drains unread bytes on close to preserve keep-alive correctness;
one-shot no-FQN PUT bodies skip local retry and rely on AIS-level retry when the server returns the direct-put transient signal;
transient direct-put bailouts return 503 with Ais-Etl-Retry-Reason: direct-put-transient; exhausted-retry cases continue to use 502.

Python tests

Test infrastructure is cleaned up to reduce flakes:

FastAPI ETL tests eager-import pydantic.v1 on Python 3.9 to avoid an importlib race;
environment mutations are scoped to avoid leakage between xdist workers;
sync and async FastAPI ETL tests are split according to how they exercise the server.

aisloader

aisloader adds -arch.list to enable read-only archpath-selective get-batch listing with -pctput=0. -arch.pct > 0 continues to imply archive listing.

Commits

094fd87d7: python/sdk: retry object reads after clean short EOF
fc53cda99: python/sdk: tidy retry logging
896d5552b: python: remove legacy aistore/client.py compat re-export
df1f8cf20: python/etl: stream HTTPMultiThreadedServer no-FQN hpush PUT
98dd2c231: python/etl: stream FastAPI no-FQN hpull GET
0b3920275: python/etl: scope direct-put 503-retry contract to streaming no-FQN PUT
8f56caf5b: python/etl: close Flask stream GET response on error
a4777582f: python/tests: fix Py3.9 importlib race in FastAPI ETL tests
9fd845d92: python/tests: normalize FastAPI ETL sync/async test split
7a4a4a34a: aisloader: add -arch.list flag for read-only archpath workloads

Documentation and Website

GenDocs and OpenAPI 3.0

GenDocs is rewritten to emit OpenAPI 3.0 instead of Swagger 2.0, and the swag dependency is removed.

Follow-up work improves generated reference quality:

action-message endpoints render as typed per-action variants;
struct godoc comments become schema descriptions;
HTTP header parameters are emitted;
aliased string constants are resolved;
endpoints with the same method and path are merged;
+gen:optional lets fields opt out of required.

The generated HTTP API reference remains available at aistore.nvidia.com/docs/http-api.

Documentation site

The production documentation site is now docs.nvidia.com/aistore. The older aistore.nvidia.com/docs path redirects there, while aistore.nvidia.com/docs/http-api continues to serve the generated HTTP API reference.

The Jekyll site at aistore.nvidia.com remains the project landing page and blog home.

The pydoc-generated Python SDK and PyTorch markdown files are removed from the repository. Links now point to the Fern-rendered documentation under docs.nvidia.com/aistore/python/aistore/....

Blog

Published during this cycle:

Eliminating Cluster Authentication Risks: AIStore with RSA and OIDC Issuer Discovery

Commits

05cb5bdff: docs: GenDocs to OpenAPI 3.0
66220facc: gendocs: emit struct godoc as schema descriptions
ac5302961: gendocs: support HTTP header parameters + resolve aliased string constants
ba73a750f: gendocs: merge endpoints sharing the same method + path
8adf5dfd0: website: redirect docs to docs.nvidia.com/aistore
1745b3f6f: website/fern: add Python API reference docs

Build, CI, and Tools

Build and CI updates include dependency refreshes, more targeted workflow execution, and release-image publishing changes.

Highlights:

OSS dependencies and Go modules are updated;
Dependabot updates receive a dedicated label;
PyTorch short tests auto-run on relevant labels or path changes;
website deployment no longer runs on every PR and can be manually dispatched;
release pipelines publish ais-util and cluster-minimal images;
selected tests are adjusted to reduce overhead or avoid known flakes.

Commits

e9ebe3534: build: upgrade all OSS packages (minor rev-s)
a46329f7f: up modules; main readme
eb9cf5aa0: build(deps): Bump actions/setup-python from 5 to 6
b4d49b8bc: github-ci: add dependabot label to Dependabot updates
9aaaf6345: ci: auto-run test:short:pytorch on label or pytorch-dir changes
95905a719: ci: don’t run deploy-website on PRs; allow manual dispatch
ae258a50b: gitlab-ci: publish ais-util and cluster-minimal images on release
158b500ee: tests: reduce overhead for inventory prefix test
0f7d73cd3: ec/test: skip one; fatalf another (minor)
d40704efd: bump version
8293b87b9: cli: bump version

Upgrade Notes

AuthN config migration

The top-level rsa_key_bits and externally_provisioned settings are replaced by the auth.rsa.{bits,mode} config block.

Use:

1 {
2   "auth": {
3     "rsa": {
4       "bits": 2048,
5       "mode": "external"
6     }
7   }
8 }

for externally provisioned RSA keys.

In external mode, a missing key file at startup is fatal and key rotation through POST /v1/auth/keys/rotate returns 400. Environment-controlled signing mode is removed in favor of explicit config.

Additive APIs

The following API changes are additive and backward-compatible for existing JSON clients:

BlobMsg.ChunkReadTimeout - optional per-chunk backend read timeout for blob downloader; zero selects the default;
apc.HdrBlobReadTimeout - optional GET-path header for blob read timeout;
apc.MemCPUInfo gains optional structured memory and process stats;
Accept: application/msgpack requests msgpack for selected stats/control-plane structures; clients that do not request it continue to receive JSON.

GET-not-found stats

GET(object) 404s are no longer counted in error stats by default. This avoids inflating Prometheus error counters for workloads where object-not-found responses are expected.

To preserve the previous behavior, enable the Count-Object-NotFound-Stats feature flag at cluster or bucket scope.

Example:

1 $ ais config cluster set features=Count-Object-NotFound-Stats

Stream bundle API cleanup

Stream bundles no longer listen for Smap changes and no longer support listener-driven resync APIs such as ManualResync or ListenSmapChanged. Internal callers should pass the intended Smap at construction time and use DM.Smap() when they need to inspect the pinned Smap.

Shard index is experimental

Shard index is available in 4.5 but remains experimental. The index format, xaction behavior, and CLI surface may change in a later release.

Rebalance cleanup mode

ais start rebalance --cleanup is the new explicit cluster-wide path for safely removing misplaced object copies. It is separate from regular rebalance and from local space cleanup.

Use --force only when intentionally removing diverged misplaced copies after verifying that this is the desired operator action.

Table of Contents

Global Rebalance

Per-object ACKs removed

Cleanup mode

Lifecycle and transport

See also

Commits

Shard Index

CLI

Commits

Get-Batch

Redirect before sender fan-out

Work-item cancellation

Admission control

Streaming responses

Phase-3 startup acknowledgment

Observability

Commits

Intra-Cluster Control Plane

Remote List-Objects (R-flow)

Target-to-target xaction control path

Intra-cluster request validation

Error propagation

Commits

AuthN

Commits

Stats and Observability

GET-not-found stats: now off by default

Cgroup v2 memory pressure

Cluster CPU and memory views

Intra-data performance view

Msgpack-encoded stats

Commits

Blob Downloader and Prefetch

Per-chunk read timeout and abort propagation

Admission and adaptive parallelism

Prefetch integration

Commits

CLI

Cluster CPU and memory views

Intra-cluster data bandwidth and throughput

Rebalance rendering

List objects

Shard index commands

Force-join / split-brain recovery help

Help text and templates

Commits

Core, Config, and xactions

Chunk-size validation

Workfile basename overflow

Force-join target initialization

Space cleanup

Cloud/backend fixes

Commits

Python SDK, ETL, and aisloader

Python SDK v1.24.0

ETL streaming

Python tests

aisloader

Commits

Documentation and Website

GenDocs and OpenAPI 3.0

Documentation site

Blog

Commits

Build, CI, and Tools

Commits

Upgrade Notes

AuthN config migration

Additive APIs

GET-not-found stats

Stream bundle API cleanup

Shard index is experimental

Rebalance cleanup mode