Prometheus Monitoring

Overview

dps-server exports operational and business metrics in Prometheus text format on the HTTP service. This page covers two deployment perspectives:

Prometheus is deployed in the same Kubernetes cluster as DPS – enable the chart-managed ServiceMonitor (recommended).
Prometheus is external to the cluster – expose the metrics endpoint and configure Prometheus to scrape it from outside.

Scope: This document only covers how to make DPS metrics discoverable and scrapable. It does not cover Prometheus / Prometheus Operator installation, alerting, or dashboarding. For the list of metrics emitted by dps-server, see the /metrics endpoint or the metrics reference (when available).

Metrics Endpoint Reference

Property	Value
Pod label / service selector	`app: dps-server`
Service name	`<release>-server-http` (e.g. `dps-server-http` for a release named `dps`)
Service port name	`dps-http`
Service port number	`8000` (configurable via `dps.service.http.port`)
Container port	`8000` (configurable via `dps.containerPorts.http.port`)
Path	`/metrics`
Scheme	`https` (default; depends on `dps.serverTLS.transportSecurity`)
Authentication	None on `/metrics` itself; TLS only

Prerequisites

Scenario	Requirements
In-cluster Prometheus	The Prometheus Operator (e.g. `kube-prometheus-stack`) must be installed in the cluster, and the `monitoring.coreos.com/v1` `ServiceMonitor` CRD must be available. The Prometheus instance must be configured to discover `ServiceMonitor` resources in the DPS namespace.
External Prometheus	The metrics endpoint must be reachable from the Prometheus host (Ingress, `LoadBalancer`, `NodePort`, VPN, etc.). Prometheus must be able to validate the TLS certificate, or be configured with `insecure_skip_verify: true`.

Scenario 1 – Prometheus In the Same Cluster

Enable the Chart-Managed ServiceMonitor

The DPS Helm chart ships a ServiceMonitor template that is rendered when dps.serviceMonitor.enabled is set. By default it is disabled.

Add the following to your values.yaml:

dps:
  serviceMonitor:
    enabled: true
    interval: 30s
    scrapeTimeout: 10s
    scheme: https
    tlsConfig:
      insecureSkipVerify: true
    additionalLabels:
      release: kube-prometheus-stack

Then install or upgrade the chart:

helm upgrade --install dps ngc/dps \
  --namespace dps \
  --values values.yaml

Note: Many Prometheus Operator installations require a label such as release: <prometheus-release-name> so the Prometheus instance picks up the ServiceMonitor. Check your Prometheus’ serviceMonitorSelector to determine which labels to add via dps.serviceMonitor.additionalLabels.

Deploy the ServiceMonitor in a Different Namespace

If your Prometheus Operator is configured to only watch ServiceMonitor objects in a specific namespace (commonly the same namespace as Prometheus itself), deploy the ServiceMonitor there while leaving DPS in its own namespace:

dps:
  serviceMonitor:
    enabled: true
    namespace: monitoring
    additionalLabels:
      release: kube-prometheus-stack

When serviceMonitor.namespace differs from the DPS release namespace, the chart automatically adds a namespaceSelector pointing back at the release namespace so Prometheus knows where to scrape.

To override the namespace selector explicitly:

dps:
  serviceMonitor:
    enabled: true
    namespace: monitoring
    namespaceSelector:
      matchNames:
        - dps
        - dps-prod

Configuration Reference

All ServiceMonitor settings live under dps.serviceMonitor in the chart’s values.yaml. Always refer to the chart README and values.yaml for the authoritative, version-specific list of keys, defaults, and inline documentation – the keys highlighted on this page (enabled, namespace, namespaceSelector, additionalLabels, interval, scrapeTimeout, scheme, tlsConfig) are the most commonly tuned ones, but they may not be exhaustive in every release.

To pull the latest values for the version you are deploying:

helm repo add ngc https://helm.ngc.nvidia.com/nvidia
helm repo update ngc
helm pull ngc/dps --untar
grep -A 30 "^  serviceMonitor:" dps/values.yaml

See Configuration Reference in the Deployment Guide for the general procedure.

Verify the ServiceMonitor

kubectl get servicemonitor -n dps
kubectl describe servicemonitor -n dps dps-server-metrics

In the Prometheus UI, navigate to Status -> Targets and confirm a target with job label dps-server-metrics is UP.

You can also scrape the metrics endpoint directly from inside the cluster:

kubectl run -n dps --rm -it --restart=Never curl --image=curlimages/curl -- \
  curl -sk https://dps-server-http:8000/metrics | head

Scenario 2 – Standalone ServiceMonitor (No Helm)

If you cannot or do not want to enable the chart’s ServiceMonitor (for example, monitoring is owned by a different team that manages all ServiceMonitor objects centrally), apply this manifest directly. It targets the Service created by the DPS chart.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dps-server-metrics
  namespace: monitoring
  labels:
    app: dps-server
    release: kube-prometheus-stack
spec:
  selector:
    matchLabels:
      app: dps-server
  namespaceSelector:
    matchNames:
      - dps
  endpoints:
    - port: dps-http
      path: /metrics
      interval: 30s
      scrapeTimeout: 10s
      scheme: https
      tlsConfig:
        insecureSkipVerify: true

Adjust metadata.namespace, metadata.labels.release, and spec.namespaceSelector.matchNames for your environment.

Scenario 3 – Prometheus Outside the Cluster

When Prometheus runs outside the DPS cluster (for example, a central observability cluster, a bare-metal Prometheus, or a managed Prometheus service), the chart’s ServiceMonitor cannot be used. You must:

Expose the dps-server-http Service outside the cluster.
Configure the external Prometheus to scrape the resulting endpoint.

Step 1: Expose the Metrics Endpoint

Note: The chart’s built-in dps.ingress and dps.gateway resources only route to the gRPC service (<release>-server-grpc). They do not expose the HTTP service or /metrics. The HTTP service is ClusterIP only by default, so /metrics is not reachable from outside the cluster unless you add one of the resources below.

Choose one of the following based on your network:

Option A – NodePort

Apply a separate Service of type NodePort so that the existing ClusterIP service used inside the cluster is unaffected:

apiVersion: v1
kind: Service
metadata:
  name: dps-server-metrics-external
  namespace: dps
  labels:
    app: dps-server
spec:
  type: NodePort
  selector:
    app: dps-server
  ports:
    - name: dps-http
      port: 8000
      targetPort: 8000
      nodePort: 30800
      protocol: TCP

Prometheus will scrape https://<any-node-ip>:30800/metrics.

Option B – LoadBalancer

If your cluster supports cloud LoadBalancer services or MetalLB:

apiVersion: v1
kind: Service
metadata:
  name: dps-server-metrics-external
  namespace: dps
  labels:
    app: dps-server
spec:
  type: LoadBalancer
  selector:
    app: dps-server
  ports:
    - name: dps-http
      port: 8000
      targetPort: 8000
      protocol: TCP

Prometheus will scrape https://<loadbalancer-ip>:8000/metrics.

Option C – Ingress

If you already terminate TLS at an ingress controller, you can expose /metrics via an Ingress. Restrict access with the controller’s authentication / IP allow-list features so the endpoint is not publicly exposed.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: dps-server-metrics
  namespace: dps
  annotations:
    nginx.ingress.kubernetes.io/whitelist-source-range: "10.0.0.0/8"
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - dps-metrics.example.com
      secretName: dps-metrics-tls
  rules:
    - host: dps-metrics.example.com
      http:
        paths:
          - path: /metrics
            pathType: Prefix
            backend:
              service:
                name: dps-server-http
                port:
                  name: dps-http

Prometheus will scrape https://dps-metrics.example.com/metrics.

Step 2: Configure the External Prometheus

Add a static scrape job to your external Prometheus configuration. The exact form depends on whether Prometheus has Kubernetes service discovery available; the static target form below works in all cases.

scrape_configs:
  - job_name: dps-server
    metrics_path: /metrics
    scheme: https
    scrape_interval: 30s
    scrape_timeout: 10s
    tls_config:
      insecure_skip_verify: true
    static_configs:
      - targets:
          - dps-metrics.example.com   # Ingress host
          # - 10.0.0.10:30800         # NodePort example
          # - 10.0.0.20:8000          # LoadBalancer example
        labels:
          service: dps-server
          cluster: my-dps-cluster

If you operate Prometheus via the Operator and want to manage this with a CRD instead of static config, use a Probe or an additionalScrapeConfigs Secret – see the Prometheus Operator documentation.

TLS

dps-server serves /metrics over TLS by default. For Prometheus to validate the certificate, the certificate must include a SAN that matches the address Prometheus is scraping. In most cluster-issued certificates this is an in-cluster DNS name (e.g. dps-server-http.dps.svc.cluster.local), which is not reachable from outside.

Recommended approaches, in order of preference:

Terminate TLS at an Ingress with a publicly trusted certificate (Option C above) and let Prometheus validate normally (insecure_skip_verify: false).
Issue a certificate with the externally-routable SAN by adding it to dps.serverTLS.extraDNSNames, then trust the issuing CA on the Prometheus host:
```
dps:
  serverTLS:
    extraDNSNames:
      - dps-metrics.example.com
```
Use insecure_skip_verify: true for non-production setups only.

Authentication

/metrics is unauthenticated at the application layer. Anyone with network access to the endpoint can read the metrics. When exposing the endpoint outside the cluster, restrict access via:

IP allow-list at the ingress controller / load balancer
Network policy / firewall
mTLS at the ingress controller (with Prometheus configured to present a client certificate)

Verify

From the Prometheus host:

curl -sk https://dps-metrics.example.com/metrics | head

In the Prometheus UI, Status -> Targets should list the dps-server job as UP.

Troubleshooting

Symptom	Likely Cause	Fix
`ServiceMonitor` exists but Prometheus shows no target	`serviceMonitorSelector` on the Prometheus instance does not match the `ServiceMonitor` labels.	Add the matching label via `dps.serviceMonitor.additionalLabels` (commonly `release: <prometheus-release>`).
Target shown but in `DOWN` state with TLS error	Prometheus cannot validate the TLS certificate.	Either set `tlsConfig.insecureSkipVerify: true`, terminate TLS at an Ingress, or add the scraped hostname to `dps.serverTLS.extraDNSNames`.
Target in `DOWN` with connection refused	`dps.serverTLS.transportSecurity` is set to `insecure` but `serviceMonitor.scheme` is still `https` (or vice versa).	Align both: set `dps.serviceMonitor.scheme` to `http` when running insecurely, or to `https` when TLS is enabled.
`0 series` returned from `dps-server`	The pod has not finished starting, or the `app: dps-server` label is missing on the `Service` (custom override).	`kubectl get svc -n dps -l app=dps-server` should return at least `dps-server-http`.
External Prometheus cannot reach the endpoint	Network path / firewall, or the cluster’s load balancer has no external IP yet.	Verify with `curl -k https://<endpoint>/metrics` from the Prometheus host.