Partition Health
The health of a partition can be monitored using the health state maintained within the partition. Depending on the resiliency mode specified for the partition, the partition can get into various states that are listed below:
Full Bandwidth Mode:
HEALTHY: When the partition is marked as healthy, it is expected to be in full-bandwidth and full compute capacity state. This state is considered to be optimal.
DEGRADED: In this state, some of the GPUs could be marked as “parked” and their GPU health could be NO_NVLINK. In this state, the rest of the GPUs will be able to communicate with full bandwidth. This state is considered operational.
UNHEALTHY: There could be various reasons that cause a partition to go into an unhealthy state. There could be a loss of a switch or other internal failures that result in this state. This state is not considered operational.
Adaptive Bandwidth Mode:
HEALTHY: When the partition is marked as healthy, it is expected to be in full-bandwidth and full compute capacity state. This state is considered to be optimal.
BANDWIDTH: In this state, some of the trunk links might be missing. However, all of the GPUs will be able to communicate with one-another in a degraded bandwidth capacity. This state is considered to be operational.
UNHEALTHY: There could be various reasons that cause a partition to go into an unhealthy state. There could be a loss of a switch or other internal failures that result in this state. This state is not considered operational.
User Action Required Mode:
HEALTHY: When the partition is marked as healthy, it is expected to be in full-bandwidth and full compute capacity state. This state is considered to be optimal.
DEGRADED_BANDWIDTH: In this state, some of the trunk links might be missing. However, all of the GPUs will be able to communicate with one-another in a degraded bandwidth capacity. This state is considered to be operational.
UNHEALTHY: There could be various reasons that cause a partition to go into an unhealthy (and non-operational) state:
There could be a loss of a switch or other internal failures that result in this state.