For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
  • Getting Started
    • Overview
    • Integrations
  • Architecture
    • Data Flow
    • External Datastore
  • Components
    • GPU Health Monitor
    • Syslog Health Monitor
    • CSP Health Monitor IAM
    • Kubernetes Object Monitor
    • Event Exporter
    • Metadata Collector
    • Labeler
    • Platform Connectors
    • Preflight
    • State Manager
    • Node Drainer
    • Fault Quarantine
    • Fault Remediation
    • Circuit Breaker
    • Cancelling Breakfix
    • Log Collection
    • Monitoring Critical Operators
    • PostgreSQL Provider
  • Observability
    • Metrics Reference
    • Distributed Tracing
    • Audit Logging
  • Configuration
    • GPU Health Monitor
    • Syslog Health Monitor
    • CSP Health Monitor
    • Kubernetes Object Monitor
    • Fault Quarantine
    • Node Drainer
    • Fault Remediation
    • Event Exporter
    • Metadata Collector
    • Labeler
    • Platform Connectors
    • Preflight
    • MongoDB Store
  • Runbooks
    • Circuit Breaker
    • Cordoned Nodes
    • CSP Health Monitor IAM
    • Datastore Connection
    • Driver Upgrades
    • GPU Monitor DCGM Failures
    • Health Event Analyzer High Error Rate
    • Health Monitor UDS Failures
    • Log Collection Job Failures
    • Log Rotation Failures
    • MongoDB Connection Error
    • Node Conditions
    • Node Condition Update Failures
    • Node Event Creation Failures
    • Stale Events
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoDocumentation
On this page
  • Overview
  • Google Cloud Platform (GCP)
  • Required IAM Permission
  • Setup Commands
  • Helm Configuration
  • Amazon Web Services (AWS)
  • Required IAM Permissions
  • Setup Commands
  • Helm Configuration
  • Additional Resources
Components

CSP Health Monitor - IAM Requirements

||View as Markdown|
Previous

Syslog Health Monitor

Next

Overview

Overview

The CSP Health Monitor requires IAM permissions to monitor cloud provider maintenance events. This document provides the setup commands for GCP and AWS.

Google Cloud Platform (GCP)

Required IAM Permission

  • logging.logEntries.list - Read Cloud Logging entries for maintenance events

Setup Commands

Replace placeholders:

  • <GCP_SA_NAME> - GCP Service Account name (e.g., csp-health-monitor)
  • <TARGET_PROJECT_ID> - GCP project ID where the cluster runs
  • <GKE_PROJECT_ID> - GCP project ID where GKE cluster is deployed
  • <NAMESPACE> - Kubernetes namespace (default: nvsentinel)
$# 1. Create GCP Service Account
$gcloud iam service-accounts create <GCP_SA_NAME> \
> --display-name="CSP Health Monitor Service Account" \
> --project=<TARGET_PROJECT_ID>
$
$# 2. Create custom IAM role with minimal permissions
$gcloud iam roles create cspHealthMonitorRole \
> --project=<TARGET_PROJECT_ID> \
> --title="CSP Health Monitor Role" \
> --description="Minimal permissions for CSP Health Monitor" \
> --permissions="logging.logEntries.list"
$
$# 3. Grant role to GCP Service Account
$gcloud projects add-iam-policy-binding <TARGET_PROJECT_ID> \
> --member="serviceAccount:<GCP_SA_NAME>@<TARGET_PROJECT_ID>.iam.gserviceaccount.com" \
> --role="projects/<TARGET_PROJECT_ID>/roles/cspHealthMonitorRole"
$
$# 4. Enable Workload Identity binding
$gcloud iam service-accounts add-iam-policy-binding \
> <GCP_SA_NAME>@<TARGET_PROJECT_ID>.iam.gserviceaccount.com \
> --role="roles/iam.workloadIdentityUser" \
> --member="serviceAccount:<GKE_PROJECT_ID>.svc.id.goog[<NAMESPACE>/csp-health-monitor]"

Helm Configuration

1csp-health-monitor:
2 cspName: "gcp"
3 configToml:
4 clusterName: "my-gke-cluster"
5 gcp:
6 targetProjectId: "<TARGET_PROJECT_ID>"
7 gcpServiceAccountName: "<GCP_SA_NAME>"
8 apiPollingIntervalSeconds: 60
9 logFilter: 'logName="projects/<TARGET_PROJECT_ID>/logs/cloudaudit.googleapis.com%2Fsystem_event" AND protoPayload.methodName="compute.instances.upcomingMaintenance"'

Amazon Web Services (AWS)

Required IAM Permissions

  • health:DescribeEvents - Query AWS Health API for maintenance events
  • health:DescribeAffectedEntities - Get affected EC2 instance IDs
  • health:DescribeEventDetails - Get event details and recommended actions

Setup Commands

Replace placeholders:

  • <CLUSTER_NAME> - EKS cluster name
  • <NAMESPACE> - Kubernetes namespace (default: nvsentinel)
$# 1. Create IAM policy
$aws iam create-policy \
> --policy-name CSPHealthMonitorPolicy \
> --policy-document '\{
> "Version": "2012-10-17",
> "Statement": [
> \{
> "Effect": "Allow",
> "Action": [
> "health:DescribeEvents",
> "health:DescribeAffectedEntities",
> "health:DescribeEventDetails"
> ],
> "Resource": "*"
> \}
> ]
> \}'
$
$# 2. Get OIDC provider and Account ID
$OIDC_PROVIDER=$(aws eks describe-cluster --name <CLUSTER_NAME> --query "cluster.identity.oidc.issuer" --output text | sed 's|https://||')
$ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
$
$# 3. Create trust policy file
$cat > trust-policy.json << EOF
$\{
$ "Version": "2012-10-17",
$ "Statement": [
> \{
> "Effect": "Allow",
> "Principal": \{
> "Federated": "arn:aws:iam::$\{ACCOUNT_ID\}:oidc-provider/$\{OIDC_PROVIDER\}"
> \},
> "Action": "sts:AssumeRoleWithWebIdentity",
> "Condition": \{
> "StringEquals": \{
> "$\{OIDC_PROVIDER\}:aud": "sts.amazonaws.com",
> "$\{OIDC_PROVIDER\}:sub": "system:serviceaccount:<NAMESPACE>:csp-health-monitor"
> \}
> \}
> \}
> ]
$\}
$EOF
$
$# 4. Create IAM role
$aws iam create-role \
> --role-name <CLUSTER_NAME>-nvsentinel-health-monitor-assume-role-policy \
> --assume-role-policy-document file://trust-policy.json
$
$# 5. Attach policy to role
$aws iam attach-role-policy \
> --role-name <CLUSTER_NAME>-nvsentinel-health-monitor-assume-role-policy \
> --policy-arn arn:aws:iam::$\{ACCOUNT_ID\}:policy/CSPHealthMonitorPolicy

Helm Configuration

1csp-health-monitor:
2 cspName: "aws"
3 configToml:
4 clusterName: "<CLUSTER_NAME>"
5 aws:
6 accountId: "<ACCOUNT_ID>"
7 region: "<AWS_REGION>"
8 pollingIntervalSeconds: 60
9 # Optional: override the default IAM role name
10 # iamRoleName: "my-custom-nvsentinel-role"

Important (EKS): By default, the IAM role name is constructed as <CLUSTER_NAME>-nvsentinel-health-monitor-assume-role-policy. AWS IAM role names have a 64-character limit, and the default suffix is 45 characters, leaving only 19 characters for the cluster name. If your cluster name exceeds 19 characters, set aws.iamRoleName to a custom role name and create the IAM role with that name instead:

$aws iam create-role \
> --role-name my-custom-nvsentinel-role \
> --assume-role-policy-document file://trust-policy.json

Then in Helm values:

1aws:
2 iamRoleName: "my-custom-nvsentinel-role"

Additional Resources

  • Configuration Guide: See docs/configuration/csp-health-monitor.md for detailed Helm configuration options
  • Troubleshooting: See docs/runbooks/csp-health-monitor-iam.md for common issues and solutions