aistore.sdk.retry_config
aistore.sdk.retry_config
Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
Module Contents
Classes
Functions
Data
API
Configuration class for retrying HEAD requests to objects that are not present in cluster when attempting a cold GET.
Attributes: est_bandwidth_bps (int): Estimated bandwidth in bytes per second from the AIS cluster to backend buckets. Used to determine retry intervals for fetching remote objects. Raising this will decrease the initial time we expect object fetch to take. Defaults to 1 Gbps. max_cold_wait (int): Maximum total number of seconds to wait for an object to be present before re-raising a ReadTimeoutError to be handled by the top-level RetryConfig. Defaults to 3 minutes.
Returns the default cold get config options.
Configuration class for managing both HTTP and network retries in AIStore.
AIStore implements two types of retries to ensure reliability and fault tolerance:
- HTTP Retry (urllib3.Retry) - Handles HTTP errors based on status codes (e.g., 429, 500, 502, 503, 504).
- Network Retry (tenacity) - Recovers from connection failures, timeouts, and unreachable targets.
Why two types of retries?
- AIStore uses redirects for GET/PUT operations.
- If a target node is down, we must retry the request via the proxy instead of the same failing target.
network_retryensures that the request is reattempted at the proxy level, preventing unnecessary failures.
Attributes:
http_retry (urllib3.Retry): Defines retry behavior for transient HTTP errors.
network_retry (tenacity.Retrying): Configured tenacity.Retrying instance managing retries for network-related
issues, such as connection failures, timeouts, or unreachable targets.
cold_get_conf (ColdGetConf): Configuration for retrying COLD GET requests, see ColdGetConf class.
Note on pickling (multi-process workloads):
network_retry is a tenacity Retrying object that internally uses
lambdas/closures and is not picklable. When this config crosses a
process boundary (e.g. PyTorch DataLoader(num_workers > 0) under the
forkserver/spawn start method, Ray, Dask, ProcessPoolExecutor),
network_retry is dropped during serialization and rebuilt from
RetryConfig.default() in the worker — any caller-customized
tenacity policy is lost in workers. Other fields (http_retry,
cold_get_conf) survive pickling unchanged. Single-process usage is
unaffected.
Returns the default retry configuration for AIStore.
retry_error_callback: log the underlying error with full traceback,
then re-raise it. Per-retry attempts stay concise via before_sleep_log;
the call stack is emitted once, here, when retries are exhausted.