core.fault_injector#

Module Contents#

Classes#

FaultInjectorConfig

Configuration for fault injection testing via nvidia_resiliency_ext.

_FaultInjectorRNG

Minimal RNG interface used by fault injector helper functions.

Functions#

_require_nvidia_resiliency_ext

_require_rng

get_fault_ranks

Return list of ranks to inject faults on, from explicit list or random sample.

get_fault

Sample a fault type according to the configured types and probabilities.

should_setup_fault_injection_at_start

Return True when fault timing is anchored to training start.

should_setup_fault_injection_at_iteration

Return True when fault timing should start from the given iteration.

get_fault_delay

Return fault delay in seconds from the configured scheduling anchor.

setup_fault_injection

Broadcast fault plan across ranks and dispatch injection on target ranks.

Data#

API#

core.fault_injector.__all__#

[‘FaultInjectorConfig’, ‘setup_fault_injection’, ‘maybe_raise_workload_exception’]

core.fault_injector._require_nvidia_resiliency_ext()#
core.fault_injector.logger#

‘getLogger(…)’

core.fault_injector._T#

‘TypeVar(…)’

class core.fault_injector.FaultInjectorConfig#

Configuration for fault injection testing via nvidia_resiliency_ext.

fault_injector_ranks: Optional[str]#

None

Comma-separated list of ranks to inject faults on.

fault_injector_num_ranks: Optional[int]#

None

Number of ranks to inject faults on (random selection).

fault_injector_fault_types: Optional[str]#

None

Comma-separated list of fault types to inject (e.g. ‘hang,crash’).

fault_injector_fault_probabilities: Optional[str]#

None

Comma-separated list of fault probabilities (normalized at runtime).

fault_injector_fault_delay: Optional[float]#

None

Force a specific fault delay in seconds from training start or delay_start_iteration.

fault_injector_delay_start_iteration: Optional[int]#

None

Start the fault delay timer after iteration N completes. If unset, fault delay timing starts from the beginning of training.

fault_injector_mtti_seconds: Optional[float]#

None

Mean time to inject (MTTI) in seconds; used when fault_delay is None.

fault_injector_offset_seconds: Optional[float]#

None

Offset seconds added to the sampled fault delay.

fault_injector_seed: Optional[int]#

None

RNG seed for the fault injector.

class core.fault_injector._FaultInjectorRNG#

Bases: typing.Protocol

Minimal RNG interface used by fault injector helper functions.

sample(population: Sequence[int], k: int) list[int]#

Return k sampled items from the given population.

choices(
population: Sequence[core.fault_injector._T],
weights: Sequence[float],
k: int,
) list[core.fault_injector._T]#

Return k weighted samples from the given population.

random() float#

Return a floating-point value in the half-open interval [0.0, 1.0).

core.fault_injector.rng: core.fault_injector._FaultInjectorRNG | None#

None

core.fault_injector._require_rng() core.fault_injector._FaultInjectorRNG#
core.fault_injector.get_fault_ranks(config: core.fault_injector.FaultInjectorConfig)#

Return list of ranks to inject faults on, from explicit list or random sample.

core.fault_injector.get_fault(config: core.fault_injector.FaultInjectorConfig)#

Sample a fault type according to the configured types and probabilities.

core.fault_injector.should_setup_fault_injection_at_start(
config: core.fault_injector.FaultInjectorConfig,
)#

Return True when fault timing is anchored to training start.

core.fault_injector.should_setup_fault_injection_at_iteration(
config: core.fault_injector.FaultInjectorConfig,
iteration,
)#

Return True when fault timing should start from the given iteration.

core.fault_injector.get_fault_delay(config: core.fault_injector.FaultInjectorConfig)#

Return fault delay in seconds from the configured scheduling anchor.

core.fault_injector.setup_fault_injection(config: core.fault_injector.FaultInjectorConfig)#

Broadcast fault plan across ranks and dispatch injection on target ranks.