CSP Health Monitor Configuration

View as Markdown

Overview

The CSP Health Monitor detects cloud provider maintenance events and triggers automated node quarantine workflows. This document covers all Helm configuration options.

Module Enable/Disable

Controls whether the csp-health-monitor module is deployed in the cluster.

1global:
2 cspHealthMonitor:
3 enabled: true

Cloud Provider Selection

The cspName field determines which cloud provider to monitor. Only one provider can be active at a time.

1csp-health-monitor:
2 cspName: "gcp" # Options: "gcp" or "aws"

Global Settings

Settings that apply regardless of cloud provider.

1csp-health-monitor:
2 logLevel: info # Options: debug, info, warn, error
3
4 configToml:
5 # Cluster identifier used in health events
6 clusterName: "my-cluster"
7
8 # How often the sidecar polls MongoDB for maintenance events (seconds)
9 maintenanceEventPollIntervalSeconds: 60
10
11 # Minutes before maintenance start time to trigger quarantine
12 triggerQuarantineWorkflowTimeLimitMinutes: 30
13
14 # Minutes after maintenance ends to send healthy event
15 postMaintenanceHealthyDelayMinutes: 15
16
17 # Timeout for node to become ready after maintenance (minutes)
18 nodeReadinessTimeoutMinutes: 60

GCP Configuration

Required Fields

1csp-health-monitor:
2 cspName: "gcp"
3
4 configToml:
5 clusterName: "my-gke-cluster"
6
7 gcp:
8 # GCP project ID where the cluster runs
9 targetProjectId: "my-gcp-project-id"
10
11 # GCP Service Account name (without @project.iam.gserviceaccount.com)
12 # Must match the GCP SA created in IAM setup
13 gcpServiceAccountName: "csp-health-monitor"
14
15 # How often to poll Cloud Logging API (seconds)
16 apiPollingIntervalSeconds: 60
17
18 # Cloud Logging filter for maintenance events
19 logFilter: 'logName="projects/my-gcp-project-id/logs/cloudaudit.googleapis.com%2Fsystem_event" AND protoPayload.methodName="compute.instances.upcomingMaintenance"'

GCP Parameters

targetProjectId

GCP project ID where the GKE cluster is running. The monitor queries Cloud Logging in this project.

gcpServiceAccountName

Name of the GCP Service Account (without the @project.iam.gserviceaccount.com suffix). Used to generate the Workload Identity annotation on the Kubernetes ServiceAccount.

apiPollingIntervalSeconds

How frequently the monitor polls the Cloud Logging API for new maintenance events. Lower values provide faster detection but increase API usage.

logFilter

Cloud Logging filter expression to select maintenance events. Common filters:

1# Standard GCE instance maintenance
2'logName="projects/{PROJECT_ID}/logs/cloudaudit.googleapis.com%2Fsystem_event" AND protoPayload.methodName="compute.instances.upcomingMaintenance"'
3
4# Include termination events
5'logName="projects/{PROJECT_ID}/logs/cloudaudit.googleapis.com%2Fsystem_event" AND (protoPayload.methodName="compute.instances.upcomingMaintenance" OR protoPayload.methodName="compute.instances.terminateOnHostMaintenance")'

Complete GCP Example

1global:
2 cspHealthMonitor:
3 enabled: true
4
5csp-health-monitor:
6 cspName: "gcp"
7 logLevel: info
8
9 configToml:
10 clusterName: "production-gke-cluster"
11 maintenanceEventPollIntervalSeconds: 60
12 triggerQuarantineWorkflowTimeLimitMinutes: 30
13 postMaintenanceHealthyDelayMinutes: 15
14 nodeReadinessTimeoutMinutes: 60
15
16 gcp:
17 targetProjectId: "my-production-project"
18 gcpServiceAccountName: "csp-health-monitor"
19 apiPollingIntervalSeconds: 60
20 logFilter: 'logName="projects/my-production-project/logs/cloudaudit.googleapis.com%2Fsystem_event" AND protoPayload.methodName="compute.instances.upcomingMaintenance"'

AWS Configuration

Required Fields

1csp-health-monitor:
2 cspName: "aws"
3
4 configToml:
5 clusterName: "my-eks-cluster"
6
7 aws:
8 # AWS Account ID (12-digit number)
9 accountId: "123456789012"
10
11 # AWS region where the EKS cluster runs
12 region: "us-east-1"
13
14 # How often to poll AWS Health API (seconds)
15 pollingIntervalSeconds: 60
16
17 # (Optional) Custom IAM role name for IRSA
18 iamRoleName: ""

AWS Parameters

accountId

AWS account ID (12-digit number) where the EKS cluster is running. Used to construct the IAM role ARN annotation.

region

AWS region where the EKS cluster is deployed. The monitor queries the AWS Health API in this region.

pollingIntervalSeconds

How frequently the monitor polls the AWS Health API for maintenance events. Lower values provide faster detection but increase API usage.

iamRoleName

Custom IAM role name for IRSA (IAM Roles for Service Accounts). When set, the ServiceAccount annotation uses this role name directly instead of constructing one from clusterName.

If left empty (default), the role name is generated as <clusterName>-nvsentinel-health-monitor-assume-role-policy.

Important (EKS): AWS IAM role names have a maximum of 64 characters. The default suffix -nvsentinel-health-monitor-assume-role-policy is 45 characters, leaving only 19 characters for the cluster name. If your EKS cluster name exceeds 19 characters, you must set iamRoleName to a custom value.

Complete AWS Example

1global:
2 cspHealthMonitor:
3 enabled: true
4
5csp-health-monitor:
6 cspName: "aws"
7 logLevel: info
8
9 configToml:
10 clusterName: "production-eks-cluster"
11 maintenanceEventPollIntervalSeconds: 60
12 triggerQuarantineWorkflowTimeLimitMinutes: 30
13 postMaintenanceHealthyDelayMinutes: 15
14 nodeReadinessTimeoutMinutes: 60
15
16 aws:
17 accountId: "123456789012"
18 region: "us-east-1"
19 pollingIntervalSeconds: 60

AWS Example with Custom IAM Role Name

For clusters with long names (>19 characters), set iamRoleName explicitly:

1csp-health-monitor:
2 cspName: "aws"
3
4 configToml:
5 clusterName: "my-very-long-production-eks-cluster-name"
6
7 aws:
8 accountId: "123456789012"
9 region: "us-east-1"
10 pollingIntervalSeconds: 60
11 iamRoleName: "my-custom-nvsentinel-role"

CSP-Specific IAM Requirements

Each cloud provider handles IAM identity for the CSP Health Monitor differently:

ProviderIAM Identity ConfigurationNaming Flexibility
GCPgcp.gcpServiceAccountName — User provides any GCP Service Account name. The ServiceAccount annotation is built as <name>@<project>.iam.gserviceaccount.com.Fully flexible. No naming convention enforced.
AWS (EKS)aws.iamRoleName (optional) — User provides a custom IAM role name. If omitted, the role name defaults to <clusterName>-nvsentinel-health-monitor-assume-role-policy.Flexible when iamRoleName is set. The default convention imposes a 19-character cluster name limit (AWS IAM role names max 64 chars, default suffix is 45 chars).

Recommendation for EKS users: If your cluster name is longer than 19 characters, always set aws.iamRoleName explicitly and create the corresponding IAM role with that name. See IAM Setup for detailed instructions.

Advanced Configuration

Out-of-Cluster Monitoring

For monitoring a tenant cluster from a separate management cluster:

1csp-health-monitor:
2 configToml:
3 # Path to kubeconfig for tenant cluster
4 kubeconfigPath: "/etc/kubeconfig/tenant-cluster.yaml"

When kubeconfigPath is set, the monitor uses the specified kubeconfig to connect to the tenant cluster’s Kubernetes API for node mapping. If empty, uses in-cluster config.

Resources

Configure resource requests and limits for the main container and sidecar.

1csp-health-monitor:
2 # Main container resources
3 resources:
4 limits:
5 cpu: "1"
6 memory: "1Gi"
7 requests:
8 cpu: "200m"
9 memory: "256Mi"
10
11 # Sidecar (Quarantine Trigger Engine) resources
12 quarantineTriggerEngine:
13 resources:
14 limits:
15 cpu: "500m"
16 memory: "512Mi"
17 requests:
18 cpu: "100m"
19 memory: "128Mi"

Scheduling

Configure pod placement using node selectors, tolerations, and affinity rules.

1csp-health-monitor:
2 nodeSelector:
3 node-role.kubernetes.io/control-plane: ""
4
5 tolerations:
6 - key: "node-role.kubernetes.io/control-plane"
7 operator: "Exists"
8 effect: "NoSchedule"
9
10 affinity: {}