Relnotes

AIStore 4.5 Release Notes

View as Markdown

AIStore 4.5 is a focused release on three major areas: global rebalance scalability, indexed archive access via the new shard index, and a get-batch flow reordering that substantially reduces memory pressure under load.

The work on global rebalance is one of the headlines. AIS removes the legacy per-object ACK machinery that had become a scalability ceiling for very large rebalances (counting millions of migrated objects), replaces the cleanup behavior that depended on that machinery with a new explicit cleanup mode, and optimizes the lifecycle around data movers, transport endpoints, stage transitions, and peer status queries.

Remote list-objects (R-flow) is another practical driver for this release. Multi-target remote-bucket listing now starts the (N - 1) targets before the designated target (DT) begins backend listing and page distribution. The same area also gets flow-control cleanup, corrected page accounting, and a dedicated CLI job view for remote-list xactions.

The shard index is a new experimental subsystem for indexed extraction from TAR shards. It lets GET and get-batch read files from TAR shards directly using a persisted index instead of scanning the full archive. The 4.5 implementation includes the index format and binary pack/unpack support, persistence in a system bucket, a bucket-scoped indexing xaction, CLI support, read-path integration, tests, and a micro-benchmark.

Streaming get-batch responses now use explicit write deadlines while sending data to the client. This lets AIS detect terminated, stalled, or unreachable clients promptly and abort the request instead of continuing to assemble and transmit a large batch that no client is still reading. The flow was reordered and optimized, work-item cancellation now propagates across senders, and admission control is stricter under load.

Authentication and access control adds support for externally provisioned RSA signing keys, JWKS refresh on cache miss for rotated key pairs, configurable maximum token age, and a more general signing-key configuration. Intra-cluster request validation is also tightened: spoofed caller headers are rejected on the public network, and internal-network checks require a validated Smap entry.

CLI and observability gain several user-visible improvements: dynamic ais show cluster cpu and ais show cluster memory views, a new ais performance intra-data view, shard-index commands, better rebalance rendering including cleanup mode, and improved help for force-join / split-brain recovery workflows.

This release preserves backward compatibility. The few additive API fields, configurable behavior changes, and operational migration notes are summarized in Upgrade Notes.


Table of Contents

  1. Global Rebalance
  2. Shard Index
  3. Get-Batch
  4. Intra-Cluster Control Plane
  5. AuthN
  6. Stats and Observability
  7. Blob Downloader and Prefetch
  8. CLI
  9. Core, Config, and xactions
  10. Python SDK, ETL, and aisloader
  11. Documentation and Website
  12. Build, CI, and Tools
  13. Upgrade Notes

Global Rebalance

AIStore 4.5 delivers the largest rebalance update in several release cycles. The main theme is replacing per-object state with stage-level coordination and explicit cleanup semantics.

Per-object ACKs removed

Rebalance no longer tracks individual object acknowledgments. The previous mechanism - ACK messages back to the sender, sender-side maps of unacknowledged objects, retransmit and wait loops, and an ACK-driven lazy-delete path - has been removed in favor of stage-level coordination.

In the new model, traversal sends objects without keeping per-object state on the sender, and a post-traverse barrier takes the place of the old wait-for-ACK drain. Intra-cluster transport headers carry a compact opaque payload with the rebalance generation, so receivers can recognize and reject objects that arrive late from a previous run. Object and byte totals reported by ais show rebalance and ais show job now come directly from the transmitted counters rather than from ACK accounting.

The practical effect is lower memory pressure and simpler lifecycle behavior during large rebalances, where per-object ACK state had become the limiting factor.

Cleanup mode

Removing per-object ACKs also removed the old incidental mechanism that trimmed misplaced source copies after migration. AIStore 4.5 adds an explicit replacement: rebalance cleanup mode.

Cleanup mode walks local mountpaths, identifies misplaced object copies using Smap HRW ownership, and removes a misplaced copy only after verifying that the canonical copy exists in the expected location with matching object identity. Identity checks include size and checksum, and use version / ETag when available.

CLI:

1$ ais start rebalance --cleanup
2$ ais start rebalance --cleanup --force

By default, cleanup mode keeps diverged copies. With --force, it can also remove copies that differ from the canonical peer version. This is an advanced operator option.

Cleanup mode is intentionally distinct from regular rebalance:

  • it has its own preflight checks;
  • it refuses to start while rebalance or resilver is active;
  • it requires at least two active targets;
  • it bypasses config.Rebalance.Enabled;
  • it uses no data mover, no streams, and no GFN;
  • it skips EC-enabled buckets and busy objects.

ais show job and ais show rebalance render cleanup-mode runs with a dedicated view that reports removed objects and bytes rather than migration TX/RX counters.

Lifecycle and transport

A series of lifecycle changes makes rebalance more robust through abort, preempt, renew, and finalization paths: fresh data mover construction per run, safer handling of duplicate transport endpoints after abort, narrower mutex scope in the finalization path, a consistent same-targets predicate across preempt and renew, and corrected stage-reached detection.

One change is operator-visible: rebalance CtlMsg now carries per-stage timing and final status, and ais show job --all continues to surface this information after the run completes.

CtlMsg is the control-message interface supported by AIS xactions (jobs). Each xaction uses it to report job-specific runtime state - counters, stage information, start parameters, and error summaries - so that ais show job and related commands can render a current view without requiring separate CLI plumbing for every xaction.

See also

  • docs/rebalance.md - user-facing guide: concepts, configuration, and CLI workflows including cleanup mode.
  • reb/README.md - internal design notes: execution flow, stages, and lifecycle.

Commits

  • 05eddc1e1: global rebalance: remove per-object ACKs (major upd)
  • d8d4f9931: global rebalance: introduce safe cleanup mode
  • 28d55f6c9: global rebalance: cleanup-mode observability
  • 0cab3b877: global rebalance: consolidate primary’s logic to run cleanup mode
  • b6037a3da: cli: ais show job to correctly show rebalance in cleanup mode
  • 785a1b769: cli: ‘start rebalance [—cleanup]’ to show ID; add formatted action helpers
  • 0df4c7549: global-rebalance/cleanup-mode: add CLI-based scripted test
  • 3c5ead654: space cleanup: detect cluster-wide misplaced objects via Smap HRW
  • ad5256b96: docs: update rebalance.md (add cleanup mode; examples)
  • a10928632: apply rebalance conflict/abort policy consistently across jobs
  • 913f4bd89: lazy-delete: reduce noisy logs and bump work channel cap

Shard Index

Experimental in 4.5. The on-disk format, xaction semantics, and CLI surface may change in subsequent releases.

The shard index is a new subsystem for indexed archive extraction, primarily intended for ML and data-lake workloads that repeatedly fetch named samples from large TAR shards. When a client requests a file inside a TAR shard via archpath, AIS can now resolve the lookup through a persisted index rather than scanning the archive. The fast path is integrated into both GET and get-batch.

A shard index is a compact binary structure built in a single pass over a source TAR. It supports USTAR, GNU, and PAX variants, encodes entries with Pack/Unpack, and embeds the source object’s checksum and size so the index can detect re-uploads and treat itself as stale when the underlying shard has changed. Indexes are persisted as objects in a new system bucket (ais://.sys-shardidx); shards that have an index carry a dedicated flag in their on-disk metadata.

Indexing is performed by a bucket-scoped xaction (job). It walks the bucket, skips non-TAR objects, rebuilds stale or corrupted indexes, and treats per-object failures as non-fatal - a single bad shard does not abort the run.

Index presence, lookup, validation, and fallback are handled entirely by AIS; clients continue to use the same requests to read archived content.

On the read path, index use is gated by a dedicated bit in each shard’s on-disk metadata. That bit is set and cleared atomically together with the index object in the system bucket. When the bit is set, AIS loads the index and seeks directly to the requested entry; otherwise the read falls back to the existing sequential archive traversal, with no functional difference visible to the client.

CLI

1$ ais shard-index ais://my-shards
2$ ais show shard-index ais://my-shards/object.tar

The release also includes a micro-benchmark for indexed extraction and an integration test that runs both indexed and non-indexed variants of TestGetFromArch and TestMoss and asserts result equivalence.

Commits

  • 05df64267: cmn/archive: add TAR shard index
  • ed1ec6977: cmn/archive: Pack/Unpack for TAR shard index via compact binary
  • 08db651bb: shard index: embed source cksum and size; detect stale index on load
  • 4ac3bc91e: core: save and load shard index as object in system bucket
  • 18e27fdae: shard index: update SaveShardIndex into two non-overlapping phases
  • 26a8c8366: xact/xs: add shard index xaction
  • e0d0c13fd: shard index: integrate with GET and get-batch archpath; fast path
  • bfebaa127: cli: add shard index command; add e2e test
  • 7bbaa365b: bench: add shard index micro benchmark (indexed archive extraction)

Get-Batch

The 4.5 cycle continues the work on get-batch - proxy memory pressure under load, sender behavior under failure, admission control, and streaming response reliability.

Redirect before sender fan-out

The flow is reordered so that the client redirect to the designated target happens before sender fan-out, with fan-out then running asynchronously. Previously the proxy held assembly-related state while waiting for sender startup; the new ordering moves that ownership to the designated target and reduces proxy-side memory pressure substantially.

Work-item cancellation

A new abort opcode over the target-to-target control path lets the designated target cancel an already-failed work item across all senders. This bounds wasted sender work under error and overload conditions. The cancellation path includes sender-side negative caching of aborted work IDs with periodic cleanup.

Admission control

Get-Batch admission control now accounts for pending work items and system load more directly. Critical memory or disk pressure returns HTTP 429. High memory pressure combined with a high pending-work level can throttle or reject depending on severity. The intent is to keep throughput steady in normal conditions while making overload behavior predictable.

Streaming responses

Get-batch streaming responses can run for a long time as the designated target (DT) assembles and writes the output. If a client terminates or simply stops reading mid-stream, AIS now detects this promptly and stops the assembly, rather than continuing to do reassembly work into a socket nobody is reading from. The implementation sets explicit write deadlines on the response stream so that a hung write fails fast and surfaces the abandoned client to the xaction.

Phase-3 startup acknowledgment

Senders acknowledge phase-3 startup to the designated target over the T2T control path. The designated target tracks expected senders per work item and fails the request if startup acknowledgments don’t arrive within the startup window - turning a class of silent hangs into prompt, observable failures. Object transfer still uses SDM; only the startup notification moves to the control path.

Observability

Get-Batch xaction IDs are now prefixed gbt- for easier identification in CLI output and logs. CtlMsg output separates job-level and node-level counters, and per-work-item CtlMsg is surfaced on assembly errors and in super-verbose mode.

Commits

  • 4570e2af6: get-batch: amend the flow: swap phase2 <=> phase3
  • 2c1cc23ec: get-batch: the capability to cancel work-item (DT => senders)
  • 34a48d0ec: get-batch: admission control (bump num-pending-high-wm)
  • 4db06b1e5: get-batch: streaming write w/ response controller (part four)
  • 814750bcc: get-batch: streaming write w/ response controller (fix)
  • 70ff9ea6c: get-batch: fail request when phase-3 senders do not ping at startup
  • e8778acad: get-batch: revise CtlMsg; add work-item’s CtlMsg; reduce log noise
  • 799fd50d4: xact: cleanup recv-abort; get-batch prefix and abort path
  • bdd7f62df: apc: add omitempty json tags for prefetch blob-download params

Intra-Cluster Control Plane

AIS 4.5 continues separating data-plane traffic from xaction control traffic and tightens validation of requests that claim to be intra-cluster.

Remote List-Objects (R-flow)

Multi-target remote-bucket list-objects - the “R-flow” mechanism, one of several distinct list-objects paths in AIS - gets a startup-ordering fix and an accounting cleanup.

In a multi-target R-flow run, one target is designated to perform the backend listing and distribute pages to the others. In 4.5, DT waits a brief grace period before issuing the backend call and beginning page distribution. This is a temporary workaround until the flow gets a proper prepare phase.

Separately, the list-objects flow-control code is cleaned up - local context is threaded through where it was previously passed implicitly - and page-receive accounting is made consistent between DM and the list-objects xaction so that received pages are no longer double-counted.

Further, list-objects jobs now get dedicated ais show job rendering, including a remote-pagination view with TX/RX page counters.

Target-to-target xaction control path

A generic target-to-target xaction control path is added under:

POST /v1/xactions/t2tctrl/{kind}/{xid}/{wid}/{opcode}

It runs over cmn.NetIntraControl and is used by get-batch for sender-started notifications and abort propagation, keeping get-batch control messages off the SDM data path. Future xactions that need out-of-band control signaling can use the same route.

Intra-cluster request validation

Requests that claim intra-cluster identity are now validated more strictly. Caller headers received on the public network are rejected; self-joining nodes may still use the public network but those headers are stripped from their requests. Internal-network checks require a validated current-Smap entry, so a node that isn’t in the current cluster map can’t claim intra-cluster status.

Underlying these checks, /reverse, /cluster, and /daemon each gain separate public and intra-control entry points, with public entry points performing access checks before dispatching. The split makes it harder for an external client to spoof intra-cluster identity or bypass public access controls.

Error propagation

ErrHTTP handling is reinforced across proxy and target boundaries. Forwarded HTTP errors now preserve their original status codes - previously, certain forwarded errors had their status overwritten, which broke clients (including the Python SDK) that retry on specific codes such as 429 from admission control.

Commits

  • 4b339f3fb: list-objects temp workaround: add R-flow startup grace

  • 604a46db5: list-objects: refactor and cleanup flow-control (add local context)

  • e86e5daa0: core: “broadcast to all targets except one” pattern; feat reserved

  • 3b9b79db2: xactions: introduce out-of-band (target-to-target) control path

  • d21e45a36: reject intra-cluster “caller” headers on dedicated pub-net

  • b20cd0113: auth: require validated cluster map entry for internal network check

  • 0cd8ff107: access control: split public and intra-control handlers

  • 86321f015: reinforce & refactor: checking intra-cluster request; external-watchdog

  • aee738be8: cmn package, common http helpers: content-type

  • 813688cb1: (de)serialize errors across boundaries: keep and reinforce ErrHTTP status

  • 6fb51788b: cmn: preserve ErrHTTP.Status on proxy error forwarding

  • e8fc0a88c: intra-cluster API change: HEAD(object) will now return properties


AuthN

Authentication gets several changes around RSA signing keys, JWKS, and token validation.

RSA signing configuration is restructured under auth.rsa.{bits,mode}. The new mode: "external" setting indicates that the RSA signing key is managed outside the AuthN process, which is useful for multi-replica deployments, secret managers, and HSM-backed key management.

In external mode:

  • a missing key file at startup is fatal;
  • AuthN does not auto-generate the key;
  • POST /v1/auth/keys/rotate returns 400;
  • JWKS is built from the key file during initialization;
  • live reload still requires restart.

JWKS lookup now supports rotated key pairs better: on key-cache miss, AuthN refreshes from the issuer before failing validation. A new max_token_age config option decouples JWKS cache age from token expiration.

The older environment-controlled signing-mode mechanism is removed in favor of the explicit config model.

Commits

  • 0caf82d3b: authn: RSA config restructured under auth.rsa; externally provisioned key support (#294)
  • 7962433fd: authn: follow-up lint fixes
  • 88917b7c9: auth: support rotated key pairs with JWKS refresh on cache miss
  • 3a2c527cc: authn: add max_token_age config option and decouple jwks max-age from token expiry
  • d7b55ddbc: authn: update config for generic signing keys with backwards compatibility
  • 95528e522: authn: remove support for environment-controlled signing mode

Stats and Observability

GET-not-found stats: now off by default

GET(object) 404s are no longer counted in error stats by default. This avoids inflating Prometheus error counters for workloads where object-not-found is expected or common (probe-then-PUT patterns, content-addressed lookups, etc.).

Operators that prefer the previous behavior can enable the Count-Object-NotFound-Stats feature flag at cluster or bucket scope.

See Upgrade Notes for the CLI invocation.

Cgroup v2 memory pressure

The cgroup v2 actual-free memory estimate is refined to use total file-backed cache minus dirty/writeback pages. This improves memory-pressure signaling in containerized deployments - particularly under heavy write load, where the previous estimate could overcount reclaimable memory.

Cluster CPU and memory views

apc.MemCPUInfo gains structured memory and process-stat fields. The existing flat fields remain, and the new fields are optional. The CLI uses the richer structure to render the new ais show cluster cpu and ais show cluster memory views - see CLI for examples.

Intra-data performance view

A new ais performance intra-data view reports intra-cluster data movement separately from generic stream counters: per-target RX/TX object counts, bytes, throughput, and average object size, with cluster-wide totals and an idle indicator.

To support the view, transport and DM counters now count data PDUs only - control-message PDUs no longer contribute. This is a small behavior change for anyone reading existing transport counters: numbers shown in ais show job and the generic stream views may be slightly lower than they were under 4.4 for the same workload, with the difference accounted for as control traffic.

Msgpack-encoded stats

stats.NodeStatus and related control-plane structures can now be encoded as msgpack as an alternative to JSON. Clients opt in via Accept: application/msgpack; clients that don’t continue to receive JSON. The motivation is lower allocation and faster decoding on hot stats-collection paths.

Commits

  • 60d2e263a: msgpack-encode one of the central control plane structures (part one)
  • f52af0a2b: [API change]: add structured memory and process stats; cli: ‘ais show cluster’
  • 80a8c459e: GET(object) not found is now not counted by default; new feature flag
  • ac37f2e33: sys: refine cgroup v2 ‘actual-free’ estimate for memory pressure
  • ffdbc37d4: cli: ais performance intra-data (new); transport+DM: data-only Tx/Rx counters
  • a18ee7eab: intra-cluster streams/DM: skip generic (in/out) stats; CLI: dedicated template
  • e6dbeddf3: sys: use nlog in Init(); cli: drop sys.Init() call

Blob Downloader and Prefetch

Per-chunk read timeout and abort propagation

The blob downloader gains a per-chunk backend read timeout: BlobMsg.ChunkReadTimeout is a new optional field (zero selects the default); the GET path can also pass it via apc.HdrBlobReadTimeout.

Per-range reads now derive their context from the xaction context, so aborting the xaction cancels in-flight backend readers and body copies. Error paths close range readers consistently.

Admission and adaptive parallelism

Blob-download jobs now go through the same load-aware admission and worker-tuning that other heavy-duty xactions (list-range, TCB) use. At job start, AIS checks memory, CPU, and goroutine load. Under critical memory pressure the job is refused with HTTP 429; under high pressure, start may be briefly delayed and the worker count is reduced.

Worker count itself is the minimum of three caps: the load-tuned count (which folds in the user request, mountpath count, media type, and live system load), an object-size cap of roughly one worker per 256 MiB of data, and a hard ceiling. So a small object on a fast disk gets a single worker even if the user asks for more; a multi-GiB object on NVMe scales up; nothing exceeds the hard cap. The same heuristic now backs blob-download, list-range jobs, and TCB.

Prefetch integration

Prefetch decides per-object whether to use the blob-downloader: any object at or above BlobThreshold (when configured) is fetched as a BLOB, otherwise as a normal cold GET.

Starting prefetch on a large remote bucket can therefore spawn a series of blob-download jobs autonomously - which is why the admission and worker-tuning improvements above matter most in this context. Each blob-download job runs through the same checks, so a prefetch fan-out adapts to live memory, CPU, and disk pressure rather than launching workers blindly.

Prefetch can now pass blob-downloader worker and chunk-size settings through to the underlying jobs:

1$ ais prefetch ... --blob-chunk-size <size> --blob-num-workers <n>

Commits

  • 28b6ee302: [API change] blob-downloader: per-chunk read timeout and context-driven abort
  • 5788880fe: blob-downloader: refactor and cleanup; revise throttling logic
  • 19c0bfa6b: xs/blob_download: cap blob-download workers with prefetch
  • 12b57ed98: cli/prefetch: add --blob-chunk-size and blob-num-workers

CLI

Cluster CPU and memory views

ais show cluster cpu and ais show cluster memory now provide dynamic per-node views. The tables skip all-zero columns, support node filtering, and work with JSON and no-header output modes.

Examples:

1$ ais show cluster cpu
2$ ais show cluster memory
3$ ais show cluster cpu <NODE_ID>

Intra-cluster data bandwidth and throughput

New CLI view ais performance intra-data shows intra-cluster data movement separately from generic performance counters. It reports RX/TX object counts, TX/RX bytes, throughput, average size, and cluster-wide totals.

Rebalance rendering

Rebalance output is more specific and easier to correlate across commands:

  • ais start rebalance reports the started rebalance ID;
  • ais show job uses a rebalance-specific renderer;
  • cleanup-mode rebalances show removed objects/bytes instead of migration TX/RX counters;
  • completed rebalance CtlMsg remains available via ais show job --all.

List objects

List-objects jobs now use ais show job templates selected specifically for list xactions: remote-pagination runs show TX/RX page counters, while bucket-scoped and generic list jobs use the appropriate bucket/no-bucket layouts.

Shard index commands

The CLI adds commands to build and inspect shard indexes:

1$ ais shard-index ais://bucket[/prefix]
2$ ais show shard-index ais://bucket/object.tar

Force-join / split-brain recovery help

The advanced (force-join clusters) workflow around ais cluster set-primary --force gets clearer help, stronger confirmation text, destination URL validation, and more explicit guidance for the follow-up rebalance.

Help text and templates

Several help-template and rendering details are cleaned up:

  • ais show cluster help gets inline examples;
  • command [arguments...] is shown only when a command truly requires a subcommand;
  • NAME vs DESCRIPTION usage is clarified;
  • table helpers are consolidated and reused.

Commits

  • 9f0c6ae49: cli: ‘ais show cluster cpu/mem’ - dynamic tables, node selection
  • 9142bc905: cli: merge clusters and/or resolve split-brain (advanced usage)
  • 8f2b8d620: cli: extend ‘ais show cluster’ help - add inline examples
  • 7be9fd308: cli/help: show “command” only when the command in question requires completion
  • 65a4b873c: cli: NAME vs DESCRIPTION in templates (minor)

Core, Config, and xactions

Chunk-size validation

Chunk-size constants are exported, and bucket-config validation now backfills and checks chunks.chunk_size even when automatic chunking is disabled. This prevents legacy bucket properties from persisting ChunkSize == 0 and later causing divide-by-zero paths on oversized PUT or cold GET.

Workfile basename overflow

Workfile creation now handles borderline-length object names safely. If decorated temporary workfile names would exceed filesystem basename limits, AIS shortens the temporary basename while preserving the final object path.

Force-join target initialization

Joining targets now create bucket directories during force-join commit before installing the destination BMD. This prevents missing local bucket directories after cross-cluster force-join workflows.

Space cleanup

Space cleanup now distinguishes local mountpath placement from cluster-wide HRW ownership. A locally well-placed object can still be cluster-misplaced after rebalance or target rejoin; cleanup now detects that case and removes the local stale copy only after the cluster-HRW peer confirms identical content.

This complements, but does not replace, rebalance cleanup mode. ais space-cleanup remains a local cleanup tool; ais start rebalance --cleanup is the explicit cluster-wide cleanup workflow for misplaced object copies.

Space cleanup also skips chunk manifests as a temporary safeguard.

Cloud/backend fixes

AWS multipart upload now handles S3 client creation failure. OCI region header spelling is canonicalized.

Commits

  • 77b922b48: xact: extend BckJog with internalized worker pool; migrate TCB
  • cb8329ab7: config: export chunk (min, max, default) size and max mon-size
  • d1df788a1: config: fix chunks.chunk_size backfill logic in validation
  • 90362e3dd: fix: workfile basename overflow on borderline-length object names
  • 215252b03: force-join clusters: joining targets must create bucket dirs
  • 4cf1616e7: space-cleanup: skip chunk manifests (temp workaround)
  • 02ee82d3e: aws: handle s3 client creation failure for aws multipart
  • 7a43c4708: apc: canonicalize OCI region header name

Python SDK, ETL, and aisloader

Python SDK v1.24.0

The Python SDK release includes several reliability and compatibility fixes:

  • RetryConfig and RetryManager are picklable, unblocking PyTorch DataLoader workers under spawn/forkserver-style multiprocessing;
  • object reads retry after clean short EOF by resuming from the current byte offset;
  • retry logging emits the final exhausted-retry stack once, while preserving concise per-attempt warnings;
  • the legacy aistore/client.py compatibility re-export is removed;
  • missing __init__.py files are added to SDK subpackages so generated API docs include the expected modules.

ETL streaming

ETL streaming paths are tightened across FastAPI, Flask, and HTTPMultiThreadedServer implementations.

Highlights:

  • no-FQN FastAPI hpull GET streams through the shared session without buffering the full upstream object;
  • no-FQN FastAPI hpush PUT bridges async request streams into sync readers for constant-memory transform_stream consumption;
  • HTTPMultiThreadedServer hpush PUT uses a bounded reader around rfile and drains unread bytes on close to preserve keep-alive correctness;
  • one-shot no-FQN PUT bodies skip local retry and rely on AIS-level retry when the server returns the direct-put transient signal;
  • transient direct-put bailouts return 503 with Ais-Etl-Retry-Reason: direct-put-transient; exhausted-retry cases continue to use 502.

Python tests

Test infrastructure is cleaned up to reduce flakes:

  • FastAPI ETL tests eager-import pydantic.v1 on Python 3.9 to avoid an importlib race;
  • environment mutations are scoped to avoid leakage between xdist workers;
  • sync and async FastAPI ETL tests are split according to how they exercise the server.

aisloader

aisloader adds -arch.list to enable read-only archpath-selective get-batch listing with -pctput=0. -arch.pct > 0 continues to imply archive listing.

Commits

  • 094fd87d7: python/sdk: retry object reads after clean short EOF
  • fc53cda99: python/sdk: tidy retry logging
  • 896d5552b: python: remove legacy aistore/client.py compat re-export
  • df1f8cf20: python/etl: stream HTTPMultiThreadedServer no-FQN hpush PUT
  • 98dd2c231: python/etl: stream FastAPI no-FQN hpull GET
  • 0b3920275: python/etl: scope direct-put 503-retry contract to streaming no-FQN PUT
  • 8f56caf5b: python/etl: close Flask stream GET response on error
  • a4777582f: python/tests: fix Py3.9 importlib race in FastAPI ETL tests
  • 9fd845d92: python/tests: normalize FastAPI ETL sync/async test split
  • 7a4a4a34a: aisloader: add -arch.list flag for read-only archpath workloads

Documentation and Website

GenDocs and OpenAPI 3.0

GenDocs is rewritten to emit OpenAPI 3.0 instead of Swagger 2.0, and the swag dependency is removed.

Follow-up work improves generated reference quality:

  • action-message endpoints render as typed per-action variants;
  • struct godoc comments become schema descriptions;
  • HTTP header parameters are emitted;
  • aliased string constants are resolved;
  • endpoints with the same method and path are merged;
  • +gen:optional lets fields opt out of required.

The generated HTTP API reference remains available at aistore.nvidia.com/docs/http-api.

Documentation site

The production documentation site is now docs.nvidia.com/aistore. The older aistore.nvidia.com/docs path redirects there, while aistore.nvidia.com/docs/http-api continues to serve the generated HTTP API reference.

The Jekyll site at aistore.nvidia.com remains the project landing page and blog home.

The pydoc-generated Python SDK and PyTorch markdown files are removed from the repository. Links now point to the Fern-rendered documentation under docs.nvidia.com/aistore/python/aistore/....

Blog

Published during this cycle:

Commits

  • 05cb5bdff: docs: GenDocs to OpenAPI 3.0
  • 66220facc: gendocs: emit struct godoc as schema descriptions
  • ac5302961: gendocs: support HTTP header parameters + resolve aliased string constants
  • ba73a750f: gendocs: merge endpoints sharing the same method + path
  • 8adf5dfd0: website: redirect docs to docs.nvidia.com/aistore
  • 1745b3f6f: website/fern: add Python API reference docs

Build, CI, and Tools

Build and CI updates include dependency refreshes, more targeted workflow execution, and release-image publishing changes.

Highlights:

  • OSS dependencies and Go modules are updated;
  • Dependabot updates receive a dedicated label;
  • PyTorch short tests auto-run on relevant labels or path changes;
  • website deployment no longer runs on every PR and can be manually dispatched;
  • release pipelines publish ais-util and cluster-minimal images;
  • selected tests are adjusted to reduce overhead or avoid known flakes.

Commits

  • e9ebe3534: build: upgrade all OSS packages (minor rev-s)
  • a46329f7f: up modules; main readme
  • eb9cf5aa0: build(deps): Bump actions/setup-python from 5 to 6
  • b4d49b8bc: github-ci: add dependabot label to Dependabot updates
  • 9aaaf6345: ci: auto-run test:short:pytorch on label or pytorch-dir changes
  • 95905a719: ci: don’t run deploy-website on PRs; allow manual dispatch
  • ae258a50b: gitlab-ci: publish ais-util and cluster-minimal images on release
  • 158b500ee: tests: reduce overhead for inventory prefix test
  • 0f7d73cd3: ec/test: skip one; fatalf another (minor)
  • d40704efd: bump version
  • 8293b87b9: cli: bump version

Upgrade Notes

AuthN config migration

The top-level rsa_key_bits and externally_provisioned settings are replaced by the auth.rsa.{bits,mode} config block.

Use:

1{
2 "auth": {
3 "rsa": {
4 "bits": 2048,
5 "mode": "external"
6 }
7 }
8}

for externally provisioned RSA keys.

In external mode, a missing key file at startup is fatal and key rotation through POST /v1/auth/keys/rotate returns 400. Environment-controlled signing mode is removed in favor of explicit config.

Additive APIs

The following API changes are additive and backward-compatible for existing JSON clients:

  • BlobMsg.ChunkReadTimeout - optional per-chunk backend read timeout for blob downloader; zero selects the default;
  • apc.HdrBlobReadTimeout - optional GET-path header for blob read timeout;
  • apc.MemCPUInfo gains optional structured memory and process stats;
  • Accept: application/msgpack requests msgpack for selected stats/control-plane structures; clients that do not request it continue to receive JSON.

GET-not-found stats

GET(object) 404s are no longer counted in error stats by default. This avoids inflating Prometheus error counters for workloads where object-not-found responses are expected.

To preserve the previous behavior, enable the Count-Object-NotFound-Stats feature flag at cluster or bucket scope.

Example:

1$ ais config cluster set features=Count-Object-NotFound-Stats

Stream bundle API cleanup

Stream bundles no longer listen for Smap changes and no longer support listener-driven resync APIs such as ManualResync or ListenSmapChanged. Internal callers should pass the intended Smap at construction time and use DM.Smap() when they need to inspect the pinned Smap.

Shard index is experimental

Shard index is available in 4.5 but remains experimental. The index format, xaction behavior, and CLI surface may change in a later release.

Rebalance cleanup mode

ais start rebalance --cleanup is the new explicit cluster-wide path for safely removing misplaced object copies. It is separate from regular rebalance and from local space cleanup.

Use --force only when intentionally removing diverged misplaced copies after verifying that this is the desired operator action.