NVIDIA Network Operator v26.1.0

SOS-Report Collection Script

The Network Operator SOS-report script collects comprehensive diagnostic data from a Kubernetes cluster running the NVIDIA Network Operator. It gathers all relevant configuration, logs, status information, and diagnostic output into a single archive, making it easier to troubleshoot issues and share context with support teams.

The script is fully backward compatible and is designed to work with any version of the Network Operator, including all previous releases. Components or resources that are not present in a given release are gracefully skipped without errors.

The script and full README are available on GitHub at scripts/sosreport.

Copy kubectl-netop_sosreport, generate-report.py, and report-template.html to a directory in your PATH:

Copy
Copied!
            

# Install system-wide sudo cp kubectl-netop_sosreport generate-report.py report-template.html /usr/local/bin/ # Or install for the current user only mkdir -p ~/.local/bin cp kubectl-netop_sosreport generate-report.py report-template.html ~/.local/bin/ export PATH="$HOME/.local/bin:$PATH"

Once installed, the script is available as a kubectl subcommand:

Copy
Copied!
            

kubectl netop-sosreport [OPTIONS]

Note

generate-report.py and report-template.html must be in the same directory as kubectl-netop_sosreport for HTML report generation. If these files are not present locally, the script will attempt to download them from GitHub. If download fails, the collection still works but the HTML report is skipped.

As a Standalone Script

Run the script directly from the repository:

Copy
Copied!
            

./kubectl-netop_sosreport [OPTIONS] # Or use the backward-compatible symlink ./network-operator-sosreport.sh [OPTIONS]

  • kubectl binary installed and in PATH

  • Valid kubeconfig with cluster access

  • Permissions to read cluster resources (cluster-admin recommended)

  • Bash 4.0 or later

  • Python 3.6+ (for HTML report generation)

  • Standard Unix utilities (tar, gzip, sha256sum)

Basic Usage

Copy
Copied!
            

# Run with auto-detection (recommended) ./network-operator-sosreport.sh # Specify kubeconfig explicitly ./network-operator-sosreport.sh --kubeconfig /path/to/kubeconfig # Specify operator namespace ./network-operator-sosreport.sh --namespace nvidia-network-operator

The script automatically detects the Network Operator namespace and the cluster platform (Kubernetes or OpenShift).

Command-Line Options

Option

Description

--kubeconfig PATH Path to kubeconfig file. Default: $KUBECONFIG or ~/.kube/config.
--namespace NAMESPACE Network Operator namespace. Default: auto-detect.
--output-dir PATH Output directory. Default: ./network-operator-sosreport-<timestamp>.
--no-compress Do not create a tarball; leave output as a directory.
--log-lines N Number of log lines to collect per pod. Default: 5000.
--skip-diagnostics Skip running diagnostic commands in OFED pods (lsmod, ibstat, ibv_devinfo, mst status, dmesg, ip commands). Use this for faster collection when driver-level diagnostics are not needed.
--skip-report Skip HTML report generation.
--kubectl-path PATH Path to the kubectl binary. Default: kubectl from PATH.
--verbose Enable verbose output during collection.
--help Show the help message.

Custom Resources

All custom resources managed by the Network Operator are collected, including their definitions and instances. For a full reference of available CRDs, see Customization Options and CRDs.

Operator Resources

  • Deployment, Pods, ConfigMaps

  • Secrets (metadata only, no secret data)

  • RBAC resources (ServiceAccounts, Roles, RoleBindings)

  • Events in the operator namespace

  • Webhook configurations (validating and mutating)

Components

The script collects data from all Network Operator components. Components that are not deployed in the cluster are automatically skipped. For the full list of components, see Network Operator Component Matrix.

For each component, the script collects:

  • DaemonSet or Deployment specifications

  • All pod details and status

  • Current and previous container logs (if the container has restarted)

  • Related ConfigMaps and Services

Node Information

  • All node details with labels and annotations

  • Node conditions and status

  • Allocatable resources (RDMA, SR-IOV, GPU)

  • Node-specific feature discovery labels

Diagnostic Commands

The following commands are executed inside OFED driver pods on each node:

  • lsmod | grep mlx — loaded Mellanox kernel modules

  • ibstat — InfiniBand device status

  • ibv_devinfo — RDMA device information

  • mst status — Mellanox Software Tools status

  • uname -r — kernel version

  • dmesg (last 200 lines) — recent kernel messages

  • ip link / ip addr — network interface information

Note

Use --skip-diagnostics to skip these commands for faster collection when driver-level diagnostics are not needed.

The collection script automatically generates a self-contained HTML report (report.html) that provides an interactive, navigable view of all collected data. The report is included in the output archive alongside the raw files.

sos-report-html.png

The report includes the following sections:

  • Executive dashboard with overall NicClusterPolicy status, node count, pod health, and error summary

  • NicClusterPolicy applied states with color-coded status badges

  • Component health grid showing all components with desired/ready replicas, pod counts, and restart counts

  • Per-component detail panels with workload YAML, pod status, and log viewers with error/warning highlighting

  • OFED diagnostics per node

  • Node overview with summary table, resource allocation, and labels

  • Events timeline with warning highlighting

  • CRD inventory with definitions and instances

  • RBAC overview, network configuration, and webhook configuration

  • Collection errors and warnings

The report can also be generated standalone from an existing sosreport directory:

Copy
Copied!
            

python3 generate-report.py ./network-operator-sosreport-20260218-143000/ --template report-template.html # Custom output path python3 generate-report.py ./network-operator-sosreport-20260218-143000/ --template report-template.html --output /tmp/report.html

To skip report generation during collection, use the --skip-report flag.

The script creates a timestamped directory with the following structure:

Copy
Copied!
            

network-operator-sosreport-<timestamp>/ ├── metadata/ │ ├── collection-info.txt │ ├── cluster-version.yaml │ ├── namespaces.txt │ └── api-resources.txt ├── crds/ │ ├── definitions/ │ └── instances/ ├── operator/ │ ├── namespace.yaml │ ├── configmaps.yaml │ ├── secrets-metadata.txt │ ├── rbac/ │ ├── events.yaml │ ├── validatingwebhookconfigurations.yaml │ ├── mutatingwebhookconfigurations.yaml │ └── components/ │ ├── network-operator/ │ ├── ofed-driver/ │ │ ├── daemonset.yaml │ │ ├── pods/ │ │ └── diagnostics/ │ └── ... ├── nodes/ │ ├── all-nodes.yaml │ ├── nodes-summary.txt │ ├── node-labels.txt │ └── node-resources.txt ├── network/ │ └── services.yaml ├── related-operators/ ├── diagnostic-summary.txt ├── report.html └── collection-errors.log

By default, the output is compressed into a tarball with a SHA256 checksum:

  • network-operator-sosreport-<timestamp>.tar.gz

  • network-operator-sosreport-<timestamp>.tar.gz.sha256

Code

Meaning

0 Success — all data collected.
1 Critical error — kubectl not found or no cluster access.
2 Partial success — some resources failed to collect.
  • Secrets: only metadata (names and types) is collected. Secret data is never included.

  • Logs: may contain IP addresses, hostnames, and other environment-specific information.

  • Review: always review the collected data before sharing it externally.

Troubleshooting Pod Failures

Copy
Copied!
            

# Collect full diagnostics with verbose output ./network-operator-sosreport.sh --verbose # Extract and check the diagnostic summary tar -xzf network-operator-sosreport-*.tar.gz cat network-operator-sosreport-*/diagnostic-summary.txt # Look at specific component logs cat network-operator-sosreport-*/operator/components/ofed-driver/pods/*.log

Quick Health Check

Copy
Copied!
            

# Fast collection without driver diagnostics ./network-operator-sosreport.sh --skip-diagnostics --log-lines 1000 # Extract and check summary tar -xzf network-operator-sosreport-*.tar.gz less network-operator-sosreport-*/diagnostic-summary.txt

Preparing for a Support Case

Copy
Copied!
            

# Comprehensive collection with verbose output ./network-operator-sosreport.sh --verbose --log-lines 10000 # Verify the archive integrity sha256sum -c network-operator-sosreport-*.tar.gz.sha256 # The archive is ready to attach to a support case

Previous Troubleshooting
Next Legal Notices and 3rd Party Licenses
© Copyright 2025-2026, NVIDIA. Last updated on Feb 26, 2026