aistore.sdk.retry_config
aistore.sdk.retry_config
aistore.sdk.retry_config
Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
Configuration class for retrying HEAD requests to objects that are not present in cluster when attempting a cold GET.
Attributes: est_bandwidth_bps (int): Estimated bandwidth in bytes per second from the AIS cluster to backend buckets. Used to determine retry intervals for fetching remote objects. Raising this will decrease the initial time we expect object fetch to take. Defaults to 10 Gbps. max_cold_wait (int): Within an individual retry, the maximum seconds to wait for an object’s write lock to be released before re-raising a ReadTimeoutError to be handled by the top-level RetryConfig. Defaults to 3 minutes. enable_remote_head (bool): Whether to send a HEAD request to the backend bucket if no size information for an object exists locally. Used for retry optimization, but increases requests to the remote backend. Defaults to True.
Configuration class for managing both HTTP and network retries in AIStore.
AIStore implements two types of retries to ensure reliability and fault tolerance:
Why two types of retries?
network_retry ensures that the request is reattempted at the proxy level, preventing unnecessary failures.Attributes:
http_retry (urllib3.Retry): Defines retry behavior for transient HTTP errors.
network_retry (tenacity.Retrying): Configured tenacity.Retrying instance managing retries for network-related
issues, such as connection failures, timeouts, or unreachable targets.
cold_get_conf (ColdGetConf): Configuration for retrying COLD GET requests, see ColdGetConf class.
Note on pickling (multi-process workloads):
network_retry is a tenacity Retrying object that internally uses
lambdas/closures and is not picklable. When this config crosses a
process boundary (e.g. PyTorch DataLoader(num_workers > 0) under the
forkserver/spawn start method, Ray, Dask, ProcessPoolExecutor),
network_retry is dropped during serialization and rebuilt from
RetryConfig.default() in the worker — any caller-customized
tenacity policy is lost in workers. Other fields (http_retry,
cold_get_conf) survive pickling unchanged. Single-process usage is
unaffected.
Returns the default retry configuration for AIStore.
retry_error_callback: log the underlying error with full traceback,
then re-raise it. Per-retry attempts stay concise via before_sleep_log;
the call stack is emitted once, here, when retries are exhausted.