core.fault_injector#
Module Contents#
Classes#
Configuration for fault injection testing via nvidia_resiliency_ext. |
|
Minimal RNG interface used by fault injector helper functions. |
Functions#
Return list of ranks to inject faults on, from explicit list or random sample. |
|
Sample a fault type according to the configured types and probabilities. |
|
Return True when fault timing is anchored to training start. |
|
Return True when fault timing should start from the given iteration. |
|
Return fault delay in seconds from the configured scheduling anchor. |
|
Broadcast fault plan across ranks and dispatch injection on target ranks. |
Data#
API#
- core.fault_injector.__all__#
[‘FaultInjectorConfig’, ‘setup_fault_injection’, ‘maybe_raise_workload_exception’]
- core.fault_injector._require_nvidia_resiliency_ext()#
- core.fault_injector.logger#
‘getLogger(…)’
- core.fault_injector._T#
‘TypeVar(…)’
- class core.fault_injector.FaultInjectorConfig#
Configuration for fault injection testing via nvidia_resiliency_ext.
- fault_injector_ranks: Optional[str]#
None
Comma-separated list of ranks to inject faults on.
- fault_injector_num_ranks: Optional[int]#
None
Number of ranks to inject faults on (random selection).
- fault_injector_fault_types: Optional[str]#
None
Comma-separated list of fault types to inject (e.g. ‘hang,crash’).
- fault_injector_fault_probabilities: Optional[str]#
None
Comma-separated list of fault probabilities (normalized at runtime).
- fault_injector_fault_delay: Optional[float]#
None
Force a specific fault delay in seconds from training start or delay_start_iteration.
- fault_injector_delay_start_iteration: Optional[int]#
None
Start the fault delay timer after iteration N completes. If unset, fault delay timing starts from the beginning of training.
- fault_injector_mtti_seconds: Optional[float]#
None
Mean time to inject (MTTI) in seconds; used when fault_delay is None.
- fault_injector_offset_seconds: Optional[float]#
None
Offset seconds added to the sampled fault delay.
- fault_injector_seed: Optional[int]#
None
RNG seed for the fault injector.
- class core.fault_injector._FaultInjectorRNG#
Bases:
typing.ProtocolMinimal RNG interface used by fault injector helper functions.
- sample(population: Sequence[int], k: int) list[int]#
Return
ksampled items from the given population.
- choices(
- population: Sequence[core.fault_injector._T],
- weights: Sequence[float],
- k: int,
Return
kweighted samples from the given population.
- random() float#
Return a floating-point value in the half-open interval [0.0, 1.0).
- core.fault_injector.rng: core.fault_injector._FaultInjectorRNG | None#
None
- core.fault_injector._require_rng() core.fault_injector._FaultInjectorRNG#
- core.fault_injector.get_fault_ranks(config: core.fault_injector.FaultInjectorConfig)#
Return list of ranks to inject faults on, from explicit list or random sample.
- core.fault_injector.get_fault(config: core.fault_injector.FaultInjectorConfig)#
Sample a fault type according to the configured types and probabilities.
- core.fault_injector.should_setup_fault_injection_at_start( )#
Return True when fault timing is anchored to training start.
- core.fault_injector.should_setup_fault_injection_at_iteration(
- config: core.fault_injector.FaultInjectorConfig,
- iteration,
Return True when fault timing should start from the given iteration.
- core.fault_injector.get_fault_delay(config: core.fault_injector.FaultInjectorConfig)#
Return fault delay in seconds from the configured scheduling anchor.
- core.fault_injector.setup_fault_injection(config: core.fault_injector.FaultInjectorConfig)#
Broadcast fault plan across ranks and dispatch injection on target ranks.