For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Blog
DocsAPI Reference
DocsAPI Reference
    • AIStore
    • Documentation
  • Core Documentation
    • In-depth Overview
    • Terminology and core abstractions
    • Getting Started
    • Networking model
    • Buckets: design, operations, namespaces, and system buckets
    • Observability overview
    • CLI overview
    • Production deployment
    • Technical Blog
  • APIs, SDKs, and Compatibility
    • Go API
    • Python SDK
    • PyPI package
    • Python SDK reference guide
    • PyTorch integration
    • TensorFlow integration
    • HTTP API reference
    • curl examples
    • Easy URL
    • S3 compatibility
    • s3cmd quick start
    • Presigned S3 requests
    • Boto3 support
  • Command-Line Interface
    • CLI overview
    • ais help
    • CLI reference guide
    • Bucket operations
    • Cluster and remote-cluster management
    • Storage and mountpath management
    • Monitoring and ais show
    • Downloads
    • Jobs
    • Authentication and access control
    • Configuration via CLI
    • ETL CLI
    • Distributed shuffle CLI
    • ML / get-batch CLI
    • GCP credentials
    • TLS certificate management
  • Storage and Data Management
    • Storage services
    • Buckets: design, operations, namespaces, and system buckets
    • Native Bucket Inventory (NBI)
    • Backend providers
    • On-disk layout
    • Virtual directories
    • System files
    • Evicting remote buckets and cached data
  • Cluster Operations
    • Node lifecycle: maintenance, shutdown, decommission
    • Global rebalance
    • Resilver
    • AIS in Containerized Environments
    • Highly available control plane
    • Information Center (IC)
    • Out-of-band updates
    • Troubleshooting
  • Configuration and Security
    • Configuration
    • Environment variables
    • Feature flags
    • AuthN and access control
    • Authentication validation
    • HTTPS and certificates
    • Switching a cluster to HTTPS
  • ETL and Advanced Workflows
    • ETL overview
    • ETL CLI docs
    • ETL Python SDK examples
    • Custom transformers
    • ETL Python webserver SDK
    • ETL Go webserver package
    • Archives: read, write, and list
    • Distributed shuffle (dsort)
    • Initial sharding utility (ishard)
    • Downloader
    • Blob Downloader
    • Batch object retrieval (get-batch)
    • Batch operations
    • Tools and utilities
    • Extended actions (xactions)
  • Observability, Monitoring, and Performance
    • Observability overview
    • Monitoring with CLI
    • Logs
    • Prometheus integration
    • Metrics reference
    • Grafana dashboards
    • Kubernetes monitoring
    • Distributed tracing
    • Monitoring get-batch
    • AIS load generator (aisloader)
    • Benchmarking AIStore
    • Performance tuning and testing
    • Performance monitoring via CLI
    • Rate limiting
    • Checksumming
    • Filesystem Health Checker (FSHC)
    • Traffic patterns
  • Networking
    • Networking: multi-homing, network separation, IPv6
    • HTTPS configuration
    • Switching to HTTPS
    • Idle connections
    • MessagePack protocol
  • Deployment
    • AIStore on Kubernetes
    • Kubernetes Operator
    • Ansible playbooks
    • Helm charts
    • Deployment monitoring
    • Docker
  • Developer Resources
    • Development guide
    • aisnode command line
    • Build tags
  • Object and Bucket Naming
    • Unicode and special symbols in object and bucket names
    • Extremely long object names
  • Blog Posts
    • AIStore Python SDK: Maintaining Resilient Connectivity During Lifecycle Events
Blog
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoAIStore
On this page
  • Effective Retry Strategies
  • Bucket-Specific Challenges and Considerations
  • Addressing Common Failures with Enhanced Retry Strategy
  • Wrap-Up
Blog Posts

AIStore Python SDK: Maintaining Resilient Connectivity During Lifecycle Events

||View as Markdown|
Previous

Extremely long object names

Apr 02, 2025·Abhishek Gaikwad
aistorepython-sdklong-running-workloadssustained-operationlifecycle-events

In distributed systems, maintaining seamless connectivity during lifecycle events is a key challenge. If the cluster’s state changes while read operations are in progress, transient errors might occur. To overcome these brief disruptions, we need an effective, intelligent retry mechanism.

Consider a simple GET request in AIStore: the GET request reaches the proxy (gateway), and the proxy redirects it to the appropriate target (storage node). However, if that node is restarting, the request fails. Standard Python retries (e.g., urllib3.Retry) are not effective here because they repeatedly retry the same unresponsive node. This underscores the need for an intelligent mechanism that understands these kinds of errors and retries based on such transient failures.

Effective Retry Strategies

By default, urllib3.Retry only handles retries at the target level after redirection, which isn’t always sufficient for AIStore’s architecture. To address transient errors, we introduced a new exception class, AISRetryableError, along with a more comprehensive retry approach.

This approach separates retry logic into two parts:

  1. HTTP-level retries with urllib3.Retry
    These are triggered for specific HTTP status codes (e.g., 500, 502, and 504).

  2. Network-level retries with tenacity
    These handle the entire request (proxy → redirect → target) whenever network issues (e.g., timeouts, connection errors) or AISRetryableError occur.

Retry Config

Both configurations are bundled into a RetryConfig class in the AIStore Python SDK (version 1.13.0). This ensures robust handling of temporary failures at both HTTP and network levels:

1from urllib3.util.retry import Retry
2from tenacity import (
3 Retrying,
4 wait_exponential,
5 stop_after_delay,
6 retry_if_exception_type,
7 before_sleep_log,
8)
9
10NETWORK_RETRY_EXCEPTIONS = (
11 ConnectTimeout,
12 ReadTimeout,
13 ChunkedEncodingError,
14 ConnectionError,
15 AISRetryableError,
16)
17
18def default():
19 return RetryConfig(
20 # http_retry handles retries for server errors (5xx status codes).
21 # connect=0 and read=0 delegate connection/read failures to network_retry.
22 http_retry=Retry(
23 total=3, # Allow up to 3 retry attempts for specified HTTP status codes
24 backoff_factor=0.5, # Wait 0.5s, then 1s, then 2s before each retry (exponential)
25 status_forcelist=[
26 500,
27 502,
28 504,
29 ], # Retry on these server error status codes
30 connect=0, # Connect retries handled by network_retry instead
31 read=0, # Read retries handled by network_retry instead
32 ),
33 # network_retry handles transient network errors and AISRetryableError.
34 network_retry=Retrying(
35 wait=wait_exponential(
36 multiplier=1, min=1, max=10
37 ), # Exponential backoff: 1s, 2s, 4s, up to 10s
38 stop=stop_after_delay(60), # Stop retrying 60s after the first failure
39 retry=retry_if_exception_type(
40 NETWORK_RETRY_EXCEPTIONS
41 ), # Retry only on specific, recognized exceptions
42 reraise=True, # Once all retries are exhausted, raise the original exception
43 ),
44 )

Note:

  1. The default retry strategy continues for up to 60 seconds after the first failure. This works well for stable environments (e.g., training jobs) but may be reduced for faster feedback in other setups.
  2. We’ve also introduced a default timeout of (3, 20) (connect timeout = 3s, read timeout = 20s) in the AIStore Python Client. For operations involving cold reads or slow backends, you may wish to increase these values.
  3. For large objects, consider using the Object-file API and enabling Streaming-Cold-GET. For example:
    ais bucket props set $BUCKET features Streaming-Cold-GET
    This allows data to stream immediately without waiting for the entire object to be stored on AIStore.

Bucket-Specific Challenges and Considerations

  • Buckets without a Remote Backend (ais://)
    When buckets are configured without redundancy features such as n-way mirroring or erasure coding, a GET request can fail if the target node that holds the object is temporarily unavailable. Although the cluster map will eventually exclude the unresponsive node, the proxy may redirect to another node that does not have the requested object. This underscores the need for data redundancy to maintain high availability.

  • Buckets with a Remote Backend (s3://, gcp://, etc.)
    The remote backend acts as a centralized “source of truth,” enabling more flexible recovery. When a target node is down and removed from the cluster map, the AIStore proxy selects a new target—the target where the requested object is expected to be located based on the updated cluster map. If that node does not have the object, it retrieves it from the remote backend and stores it in-cluster according to the bucket policies and configured data redundancy. Once the original target recovers, a global rebalance redistributes objects across the cluster as required.

Addressing Common Failures with Enhanced Retry Strategy

There are many ways a request can fail. Below are some common issues and how our new retry strategy helps overcome these transient errors:


1. Cluster Changes and Instability (Solved by Proxy-Level Retries)
Proxy-level retries send requests back through the proxy, rather than continuing to use a failing target node. This approach leverages the updated cluster map (which may take time to propagate or fully synchronize), ensuring unreachable nodes are excluded and requests are redirected to healthy targets. Common scenarios include:

  • Node Failure: When a node goes down (e.g., due to hardware or network issues, Kubernetes related issues, disk failures, or misconfigurations), the cluster map eventually excludes it. With proxy-level retries, a GET is rerouted to a different, healthy node.
  • Cluster Upgrades: During Kubernetes rollouts, each upgraded target restarts, causing frequent cluster map updates. This can cause temporary routing issues and request failures. Thanks to network-level retries, your request is retried in its entirety, taking into account the newly updated cluster map.
  • Maintenance and Shutdown: Nodes undergoing maintenance rebalance their data elsewhere. Delays in map updates can cause brief misrouting and failures, which are resolved by retrying the entire request after a short interval.

2. Unique Retryable Errors (Solved by Custom Exceptions)
Certain temporary errors—like missing objects during rebalance—are flagged as AISRetryableError, prompting a custom retry pathway in tenacity:

  • Misplaced Objects (404 Errors): During rebalance, some objects can appear missing (ErrObjNotFound). This triggers AISRetryableError, prompting a retry rather than failing outright. After rebalancing is complete, the object is found on the correct node.
  • Conflicts (409 Errors): When multiple threads simultaneously request the same object from a remote backend, an ErrGETConflict (409) can occur. Retrying the request after a short backoff resolves the conflict.

Wrap-Up

AIStore’s new RetryConfig class handles both HTTP and network-level errors, helping your requests gracefully recover from node restarts, rebalancing, and other cluster changes. While the defaults work well for NVIDIA’s training workloads, you can customize timeouts, retry intervals, and backoff strategies to fit your environment. Ultimately, this ensures more resilient, uninterrupted operations even amid dynamic cluster states.