Log Collection
Overview
NVSentinel’s log collection feature automatically gathers diagnostic logs from GPU nodes when faults are detected. These logs help with troubleshooting and root cause analysis of GPU hardware and software issues.
When a node is drained due to a fault, NVSentinel can optionally create a log collection job that gathers comprehensive diagnostics and stores them in an in-cluster file server for easy access.
Why Do You Need This?
When GPU nodes fail, you need diagnostic information to understand what went wrong:
- Root cause analysis: Determine if the issue is hardware, driver, or configuration related
- Support requests: Provide comprehensive logs to NVIDIA support or your infrastructure team
- Trend analysis: Build historical data on failure patterns
- Faster resolution: All relevant logs collected automatically instead of manual gathering
Without log collection, you’d need to manually SSH to failed nodes, locate the right log files, and extract them before the node is rebooted or replaced - a time-consuming and error-prone process.
How It Works
When log collection is enabled, NVSentinel automatically collects diagnostics when a fault with a supported remediation action is detected:
- Fault Remediation module receives a health event with a supported action
- Creates a Kubernetes Job on the target node to collect diagnostics
- Job runs with privileged access to gather logs (in parallel with node drain/remediation)
- Logs are uploaded to the in-cluster file server
- Job completes and is automatically cleaned up after 1 hour
Note: Log collection is skipped for unsupported remediation actions since no automated remediation is performed and the node remains accessible for manual log collection.
The file server stores logs organized by node name and timestamp, accessible via web browser through port-forwarding.
What Logs Are Collected
NVIDIA Bug Report
Comprehensive NVIDIA driver and GPU diagnostic report:
- GPU configuration and status
- Driver version and details
- GPU error logs
- PCIe information
- DCGM diagnostics
GPU Operator Must-Gather (Optional)
Warning: Must-gather is disabled by default because it collects logs from ALL nodes in the cluster, which can be very time-consuming for large clusters (e.g., GB200 clusters with 100+ nodes).
When enabled, collects Kubernetes resources and logs for GPU operator components:
- GPU operator pod logs
- DCGM exporter logs
- Device plugin logs
- GPU feature discovery logs
- Operator configuration
To enable must-gather, set enableGpuOperatorMustGather: true and increase the timeout accordingly (see Timeout Configuration below).
Cloud Provider SOS Reports (Optional)
System logs and configuration from GCP or AWS instances when enabled.
Configuration
Configure log collection through Helm values:
Timeout Configuration
The default timeout of 10m is sufficient when collecting only the nvidia-bug-report (must-gather disabled).
Important: If you enable
enableGpuOperatorMustGather, you MUST increase the timeout!Must-gather collects logs from all nodes in the cluster, taking approximately 2-3 minutes per node.
Recommended timeout formula:
(number of nodes) × 2-3 minutes
Example configuration for a 100-node cluster with must-gather enabled:
File Server Configuration
The in-cluster file server stores collected logs:
Configuration Options
- Enable/Disable: Turn log collection on or off per deployment
- Storage Size: Configure persistent volume size based on expected log volume
- Log Retention: Automatically clean up old logs after configurable retention period
- Timeout: Set maximum time for log collection job (increase when enabling must-gather)
- Enable Must-Gather: Enable GPU Operator must-gather collection (disabled by default due to performance impact on large clusters)
- Cloud-Specific SOS: Enable additional cloud provider diagnostics
Key Features
Automatic Collection
Logs are gathered automatically when a supported remediation action is triggered - no manual intervention required.
Privileged Access
Collection job runs with necessary privileges to access all diagnostic sources on the node.
Organized Storage
Logs are organized by node name and timestamp in a browsable directory structure:
Web Access
Simple port-forward to browse and download logs via web browser:
Then access http://localhost:8080 in your browser.
Automatic Cleanup
Built-in log rotation removes old logs based on retention policy to manage disk space.
Job Lifecycle Management
Collection jobs are automatically cleaned up after completion with configurable TTL (default: 1 hour).
Storage and Retention
Storage Architecture
Logs are stored in a persistent volume attached to an NGINX-based file server running in the cluster. The file server is accessible only within the cluster or via port-forward.
Log Retention
Automatic cleanup service runs periodically (default: daily) to remove logs older than the configured retention period (default: 7 days). This prevents disk space issues from accumulating logs.