Runbook: CSP Health Monitor IAM Troubleshooting
Runbook: CSP Health Monitor IAM Troubleshooting
Overview
This runbook covers IAM permission issues for the CSP Health Monitor on GCP and AWS.
GCP Issues
Symptom: PERMISSION_DENIED Errors
Logs show:
Verification Steps
- Check GCP Service Account has required role:
Expected output should show the custom role projects/<TARGET_PROJECT_ID>/roles/cspHealthMonitorRole or predefined role roles/logging.viewer.
- Check Workload Identity binding:
Expected output should show roles/iam.workloadIdentityUser with member serviceAccount:<GKE_PROJECT_ID>.svc.id.goog[nvsentinel/csp-health-monitor].
- Check ServiceAccount annotation:
Expected output: <GCP_SA_NAME>@<TARGET_PROJECT_ID>.iam.gserviceaccount.com
Resolution
If the GCP Service Account is missing the role:
If Workload Identity binding is missing:
Test Permissions Manually
AWS Issues
Symptom: AccessDeniedException Errors
Logs show:
Verification Steps
- Check IAM policy is attached to role:
Note: The role name is either the value of
aws.iamRoleName(if set) or the default<CLUSTER_NAME>-nvsentinel-health-monitor-assume-role-policy.
Expected output should show CSPHealthMonitorPolicy attached.
- Check IAM role trust policy:
Expected: Trust policy should reference the correct EKS OIDC provider and system:serviceaccount:nvsentinel:csp-health-monitor.
- Check ServiceAccount annotation:
Expected output: arn:aws:iam::<ACCOUNT_ID>:role/<IAM_ROLE_NAME>
Resolution
If IAM policy is not attached:
If the role ARN doesn’t match Helm values, ensure aws.iamRoleName (or configToml.clusterName if using the default pattern) is correct, and redeploy.
Test Permissions Manually
Node Mapping Failures
Symptom: Events Detected but Nodes Not Quarantined
Logs show:
Verification Steps
- Check nodes have providerID set:
Expected:
- GCP:
gce://<project-id>/<zone>/<instance-name> - AWS:
aws:///<availability-zone>/<instance-id>
- Check GCP node annotations (GCP only):
- Check RBAC permissions:
Expected: yes
Resolution
If nodes missing providerID, the kubelet configuration may be incorrect. Check node registration and cloud provider integration.
If RBAC is missing, verify the ClusterRole and ClusterRoleBinding were created by the Helm chart: