Integrate NetQ with Grafana

The NetQ integration with Grafana allows you to create customized dashboards and to visualize metrics across your network devices. To view data in Grafana, first configure security between NetQ and OTel clients, configure OpenTelemetry (OTel) on the devices in your network, then configure the data sources in Grafana.

The Grafana integration is in beta and supported for on-premises deployments only.

Requirements and Support

  • Switches must have a Spectrum-2 or later ASIC. The number of supported switches varies based on the deployment model and reflects an environment where each switch is configured with OpenTelemetry and running the NetQ agent.
    • Standalone: 5 switches
    • Cluster: 50 switches
    • 3-node scale cluster: 500 switches
    • 5-node scale cluster: 1,000 switches
  • For switches, you must enable OpenTelemetry to collect and export each metric that you want to monitor, as described in the Cumulus Linux documentation.
    • NetQ does not support OpenTelemetry collection from switches with buffer statistics enabled.
  • DPUs and ConnectX hosts must be running DOCA Telemetry Service (DTS) version 1.18 or later.
  • Before you get started with the steps below, install Grafana and start the Grafana server.
  • NetQ allows you to retrieve data from up to seven days in the past.

Secure OpenTelemetry Export

NetQ is configured with OTLP secure mode with TLS by default and expects clients to secure data with a certificate. You can configure NetQ and your client devices to use your own generated CA certificate, NetQ’s self-signed certificate, or set the connections to insecure mode.

TLS with a CA Certificate

NVIDIA recommends using your own generated CA certificate. To configure a CA certificate:

  1. Copy your certificate files to the NetQ server in the /mnt/admin directory. For example, copy the certificate and key to /mnt/admin/certs/server.crt and /mnt/admin/certs/server.key

  2. Import your certificate on your switches using the nv action import system security ca-certificate <cert-id> [data <data> | uri <path>] command. Define the name of the certificate in <cert-id> and either provide the raw PEM string of the certificate as <data> or provide a path to the certificate file containing the public key as <path>.

  3. After importing your certificate, set OTLP insecure mode to disabled on your switches:

    nvidia@switch:~$ nv set system telemetry export otlp grpc insecure disabled
    nvidia@switch:~$ nv config apply
    

TLS with a CA Certificate

NVIDIA recommends using your own generated CA certificate. To configure a CA certificate:

  1. Copy your certificate files to the NetQ server in the /mnt/admin directory. For example, copy the certificate and key to /mnt/admin/certs/server.crt and /mnt/admin/certs/server.key

  2. Copy your certificate to your DPU or NIC in the /opt/mellanox/doca/services/telemetry/config/certs/ directory.

  3. Change permissions on the certificate with the chmod 644 /opt/mellanox/doca/services/telemetry/config/certs/ca.pem command to make the certificate readable to all users.

  4. Configure OpenTelemetry on your DPU or NIC and include an additional line referencing the certificate in /opt/mellanox/doca/services/telemetry/config/dts_config.ini:

open-telemetry-ca-file=/config/certs/ca.pem

TLS with NetQ’s Self-signed Certificate

To run on the switch in secure mode with NetQ’s self-signed certificate:

  1. From the NetQ server, display the certificate using netq show otlp tls-ca-cert dump command. Copy the certificate from the output.

  2. On the switch, import the certificate with the nv action import system security ca-certificate <cert-id> data <data> command. Define the name of the certificate in <cert-id> and replace <data> with the certificate data you generated in the preceding step.

  3. Configure the certificate to secure the OTel connection. Replace ca-certificate with the name of your certificate; this is the <cert-id> from the previous step.

    nvidia@switch:~$ nv set system telemetry export otlp grpc cert-id <ca-certificate>
    nvidia@switch:~$ nv config apply
    
  4. Next, disable insecure mode and apply the change:

    nvidia@switch:~$ nv set system telemetry export otlp grpc insecure disabled
    nvidia@switch:~$ nv config apply
    
  5. Run nv show system telemetry health to display the destination port and IP address, along with connectivity status.

TLS with NetQ’s Self-signed Certificate

To run on a DPU or NIC in secure mode with NetQ’s self-signed certificate:

  1. From the NetQ server, display the certificate using netq show otlp tls-ca-cert dump command. Copy the certificate from the output.

  2. Copy the certificate content from step 1 to a file on your DPU or NIC in the /opt/mellanox/doca/services/telemetry/config/certs/ directory. For example, copy the output content into /opt/mellanox/doca/services/telemetry/config/certs/ca.pem

  3. Change permissions on the certificate with the chmod 644 /opt/mellanox/doca/services/telemetry/config/certs/ca.pem command to make the certificate readable to all users.

  4. Configure OpenTelemetry on your DPU or NIC and include an additional line referencing the certificate in /opt/mellanox/doca/services/telemetry/config/dts_config.ini:

open-telemetry-ca-file=/config/certs/ca.pem

Insecure Mode

To use insecure mode and disable TLS:

  1. On your NetQ server, run the netq set otlp security-mode insecure command.

  2. On your switches, configure insecure mode:

    nvidia@switch:~$ nv set system telemetry export otlp grpc insecure disabled
    nvidia@switch:~$ nv config apply
    

Insecure Mode

To use insecure mode and disable TLS:

  1. On your NetQ server, run the netq set otlp security-mode insecure command.

  2. On your DPU or NIC, Configure OpenTelemetry but do not include a open-telemetry-ca-file= line in the /opt/mellanox/doca/services/telemetry/config/dts_config.ini configuration file.

Configure and Enable OpenTelemetry on Devices

Configure your client devices to send OpenTelemetry data to NetQ.

Enable OpenTelemetry for each metric that you want to monitor, as described in the Cumulus Linux documentation. Use your NetQ server or cluster’s IP address and port 30008 when configuring the OTLP export destination.

NVIDIA recommends setting the sample-interval option to 10 seconds for each metric that allows you to set a sample interval.

  1. Install DOCA Telemetry Service (DTS) on your ConnectX hosts or DPUs.

  2. Configure OpenTelemetry data export by editing the /opt/mellanox/doca/services/telemetry/config/dts_config.ini file. Add the following lines under the IPC transport section. Replace TS-IP with the IP address of your telemetry receiver.

For HTTPS transport:

open-telemetry-transport=http
open-telemetry-receiver=http://<TS-IP>:30009/v1/metrics

For gRPC transport:

open-telemetry-transport=grpc
open-telemetry-receiver=<TS-IP>:30008/v1/metrics
  1. To support telemetry at a scale of up to 4K devices on hosts with ConnectX NICs or DPUs in NIC mode, configure counterset and fieldset files to control the telemetry data exported from DTS.

Download gb200.cset and gb200.fset, then copy the files to /opt/mellanox/doca/services/telemetry/config/prometheus_configs/cset/ on your host.

Add the following lines to the /opt/mellanox/doca/services/telemetry/config/dts_config.ini configuration file:

prometheus-fset-indexes=device_name,device_id,pod_name,^id$,@id 
prometheus-cset-dir=/config/prometheus_configs/cset
prometheus-fset-dir=/config/prometheus_configs/cset
open-telemetry-field-set=gb200 
open-telemetry-counter-set=gb200

It can take up to a minute for the device to restart and apply the changes. If you manually edit the fieldset file, you must restart DTS for the changes to be reflected.

Read more about OpenTelemetry and DTS configurations in the DOCA Telemetry Service guide.

Configure an External TSDB

OpenTelemetry data is stored in the NetQ TSDB. In addition to NetQ’s local storage, you can configure NetQ to also send the collected data to your own external TSDB:

  1. If the connection to your external TSDB is secured with TLS, copy the certificate to the NetQ server in the /mnt/admin/ directory, and reference the full path to the file with the netq set otlp endpoint-ca-cert tsdb-name <text-tsdb-endpoint> ca-cert <text-path-to-ca-crt> command.

  2. From the NetQ server, add the OTel endpoint of your time-series database (TSDB). Replace text-tsdb-endpoint and text-tsdb-endpoint-url with the name and IP address of your TSDB, respectively. Include the export true option to begin exporting data immediately. Set security-mode to tls if you configured a certificate to secure the connection, otherwise use security-mode insecure.

nvidia@netq-server:~$ netq add otlp endpoint tsdb-name <text-tsdb-endpoint> tsdb-url <text-tsdb-endpoint-url> [export true | export false] [security-mode <text-mode>]
  1. If you set the export option to true in the previous step, the TSDB will begin receiving the time-series data for the metrics that you configured on the switch. Use the netq show otlp endpoint command to view the TSDB endpoint configuration.

Collect Slurm Telemetry

Slurm (Simple Linux Utility for Resource Management) is an open-source job scheduler used in high-performance computing (HPC) environments. It manages and allocates compute resources, schedules jobs, and distributes workloads across a cluster.

To view and filter Slurm jobs in Grafana, you must have an NVIDIA Base Command Manager deployment running BCM v11 or later.

  1. Authenticate into BCM using either basic authentication (username and password) or certificate-based authentication.

Two versions of this command exist. Specify either the Base Command Manager IP address in ip-address or the domain name in hostname. Replace port-text with the port that BCM uses. You can run this command from any node in your cluster.

nvidia@netq-server:~$ netq add bcm auth-type basic user <username> pass <password> ip <ip-address> port <port-text>  
nvidia@netq-server:~$ netq add bcm auth-type basic user <username> pass <password> hostname <hostname> port <port-text> 

For example:

nvidia@netq-server:~$ netq add bcm auth-type basic user admin pass secretpass123 ip 192.168.1.100 port 8082

You must run this command from the node that hosts the netq-bcm-gateway pod. To identify this node, run the kubectl get pods -o wide | grep netq-bcm-gateway command. The output of this command displays the correct node.

Two versions of this command exist. Specify either the Base Command Manager IP address in ip-address or the domain name in hostname. Replace port-text with the port that BCM uses. Specify the full path to both the certificate and key files. These are typically located in the /mnt/bcm/ directory.

nvidia@netq-server:~$ netq add bcm auth-type cert cert-file <certificate-path> key-file <key-path> ip <ip-address> port <port-text> 
nvidia@netq-server:~$ netq add bcm auth-type cert cert-file <certificate-path> key-file <key-path> hostname <hostname> port <port-text> 

For example:

nvidia@netq-server:~$ netq add bcm auth-type cert cert-file /mnt/bcm/bcm.crt key-file /mnt/bcm/bcm.key ip 192.168.1.100 port 8082 
  1. Verify that your credentials are correct and check for BCM version compatibility:
nvidia@netq-server:~$ netq show bcm auth-status

You will configure the Slurm data source in the next section using the slurm-nodes-and-jobs-dashboard JSON file.

slurm-nodes-and-jobs-dashboard.json

Configure Data Sources in Grafana

  1. Generate and copy an authentication token using the NetQ CLI. You can adjust time at which the token will expire with the expiry option. For example, the following command generates a token that expires after 40 days. If you do not set an expiry option, the token expires after 5 days. The maximum number of days allowed is 180.
nvidia@netq-server:~$ netq show vm-token expiry 40
  1. Navigate to your Grafana dashboard. From the menu, select Connections and then Data sources. Select Add new data source and add the Prometheus TSDB:
  1. Continue through the steps to configure the data source:
  • In the Name field, enter the name of the data source. The name must start with the data source name and be written in lowercase (for example, slurm_dashboard or kpi-dashboard).
  • In the Connection field, enter the IP address of your NetQ server followed by /api/netq/vm/, for example https://10.255.255.255/api/netq/vm/. In a cluster deployment, enter the virtual IP address in this field (followed by /api/netq/vm/).
  • In the Authentication section, select Forward OAuth identity from the dropdown menu.
    • In TLS settings, select Skip TLS certificate validation.
    • In the HTTP headers section, select Add header. In the Header field, enter Authorization. In the Value field, enter the token that you generated in step one of this section.
  1. Select Save & test. If the operation was successful, you will begin to see metrics in your Grafana dashboard.

Import a Dashboard Template

To import a preconfigured dashboard into your Grafana instance:

  1. From the side menu, select Dashboards.

  2. Click New and select Import from the drop-down menu.

  3. Paste the dashboard JSON text into the text area.

  4. (Optional) Change the dashboard name, folder, or UID.

  5. Click Import.

    If the dashboard does not display data, refresh your browser.

Grafana Best Practices

If data retrieval with Grafana is slow, you might need to adjust your dashboard settings. Fabric-wide queries on large networks (over 1000 switches) can generate millions of data points, which can significantly degrade performance. You can improve performance by optimizing queries, reducing data volume, and simplifying panel rendering.

Avoid plotting all time-series data at once. To visualize the data in different ways:

If Grafana displays "No Data", verify that all VMs in your cluster are operational. You can check the node status using the kubectl get nodes command. A node will show as NotReady if it is down. When the VM is restored, data collection will resume and will be displayed within 20 minutes of restoration.

Retrieve Metrics with the NetQ API

If you want to view or export the time-series database data without using Grafana, you can use curl commands to directly query the NetQ TSDB. These commands typically complete in fewer than 30 seconds, whereas Grafana can take longer to process and display data.

  1. Generate an access token. Replace <username> and <password> with your NetQ username and password. Copy the access token generated by this command. You will use it in the next step.
curl 'https://10.237.212.61/api/netq/auth/v1/login' -H 'content-type: application/json' --data-raw '{"username":"<username>","password":"<password>"}' --insecure 
  1. Generate a JSON Web Token (JWT). Replace <access_token> with the token generated from the previous step. Copy the resulting token generated by this command. You will use it in the next step.
curl -k -X GET "https://10.237.212.61/api/netq/auth/v1/vm-access-token?expiryDays=10" -H "Authorization: Bearer <access_token>" 
  1. Fetch a complete list of metrics. Replace <vm-jwt> with the token generated from the previous step. You can use this list to create queries based on metrics you’re interested in.
export token=<vm-jwt> 
curl -k "https://10.237.212.61/api/netq/vm/api/v1/label/__name__/values" -H "Authorization: Bearer $token" 
Examples queries

Additional Commands