Prometheus Monitoring

Overview

dps-server exports operational and business metrics in Prometheus text format on the HTTP service. This page covers two deployment perspectives:

  1. Prometheus is deployed in the same Kubernetes cluster as DPS – enable the chart-managed ServiceMonitor (recommended).
  2. Prometheus is external to the cluster – expose the metrics endpoint and configure Prometheus to scrape it from outside.

Scope: This document only covers how to make DPS metrics discoverable and scrapable. It does not cover Prometheus / Prometheus Operator installation, alerting, or dashboarding. For the list of metrics emitted by dps-server, see the /metrics endpoint or the metrics reference (when available).

Metrics Endpoint Reference

Property Value
Pod label / service selector app: dps-server
Service name <release>-server-http (e.g. dps-server-http for a release named dps)
Service port name dps-http
Service port number 8000 (configurable via dps.service.http.port)
Container port 8000 (configurable via dps.containerPorts.http.port)
Path /metrics
Scheme https (default; depends on dps.serverTLS.transportSecurity)
Authentication None on /metrics itself; TLS only

Prerequisites

Scenario Requirements
In-cluster Prometheus The Prometheus Operator (e.g. kube-prometheus-stack) must be installed in the cluster, and the monitoring.coreos.com/v1 ServiceMonitor CRD must be available. The Prometheus instance must be configured to discover ServiceMonitor resources in the DPS namespace.
External Prometheus The metrics endpoint must be reachable from the Prometheus host (Ingress, LoadBalancer, NodePort, VPN, etc.). Prometheus must be able to validate the TLS certificate, or be configured with insecure_skip_verify: true.

Scenario 1 – Prometheus In the Same Cluster

Enable the Chart-Managed ServiceMonitor

The DPS Helm chart ships a ServiceMonitor template that is rendered when dps.serviceMonitor.enabled is set. By default it is disabled.

Add the following to your values.yaml:

dps:
  serviceMonitor:
    enabled: true
    interval: 30s
    scrapeTimeout: 10s
    scheme: https
    tlsConfig:
      insecureSkipVerify: true
    additionalLabels:
      release: kube-prometheus-stack

Then install or upgrade the chart:

helm upgrade --install dps ngc/dps \
  --namespace dps \
  --values values.yaml

Note: Many Prometheus Operator installations require a label such as release: <prometheus-release-name> so the Prometheus instance picks up the ServiceMonitor. Check your Prometheus’ serviceMonitorSelector to determine which labels to add via dps.serviceMonitor.additionalLabels.

Deploy the ServiceMonitor in a Different Namespace

If your Prometheus Operator is configured to only watch ServiceMonitor objects in a specific namespace (commonly the same namespace as Prometheus itself), deploy the ServiceMonitor there while leaving DPS in its own namespace:

dps:
  serviceMonitor:
    enabled: true
    namespace: monitoring
    additionalLabels:
      release: kube-prometheus-stack

When serviceMonitor.namespace differs from the DPS release namespace, the chart automatically adds a namespaceSelector pointing back at the release namespace so Prometheus knows where to scrape.

To override the namespace selector explicitly:

dps:
  serviceMonitor:
    enabled: true
    namespace: monitoring
    namespaceSelector:
      matchNames:
        - dps
        - dps-prod

Configuration Reference

All ServiceMonitor settings live under dps.serviceMonitor in the chart’s values.yaml. Always refer to the chart README and values.yaml for the authoritative, version-specific list of keys, defaults, and inline documentation – the keys highlighted on this page (enabled, namespace, namespaceSelector, additionalLabels, interval, scrapeTimeout, scheme, tlsConfig) are the most commonly tuned ones, but they may not be exhaustive in every release.

To pull the latest values for the version you are deploying:

helm repo add ngc https://helm.ngc.nvidia.com/nvidia
helm repo update ngc
helm pull ngc/dps --untar
grep -A 30 "^  serviceMonitor:" dps/values.yaml

See Configuration Reference in the Deployment Guide for the general procedure.

Verify the ServiceMonitor

kubectl get servicemonitor -n dps
kubectl describe servicemonitor -n dps dps-server-metrics

In the Prometheus UI, navigate to Status -> Targets and confirm a target with job label dps-server-metrics is UP.

You can also scrape the metrics endpoint directly from inside the cluster:

kubectl run -n dps --rm -it --restart=Never curl --image=curlimages/curl -- \
  curl -sk https://dps-server-http:8000/metrics | head

Scenario 2 – Standalone ServiceMonitor (No Helm)

If you cannot or do not want to enable the chart’s ServiceMonitor (for example, monitoring is owned by a different team that manages all ServiceMonitor objects centrally), apply this manifest directly. It targets the Service created by the DPS chart.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dps-server-metrics
  namespace: monitoring
  labels:
    app: dps-server
    release: kube-prometheus-stack
spec:
  selector:
    matchLabels:
      app: dps-server
  namespaceSelector:
    matchNames:
      - dps
  endpoints:
    - port: dps-http
      path: /metrics
      interval: 30s
      scrapeTimeout: 10s
      scheme: https
      tlsConfig:
        insecureSkipVerify: true

Adjust metadata.namespace, metadata.labels.release, and spec.namespaceSelector.matchNames for your environment.

Scenario 3 – Prometheus Outside the Cluster

When Prometheus runs outside the DPS cluster (for example, a central observability cluster, a bare-metal Prometheus, or a managed Prometheus service), the chart’s ServiceMonitor cannot be used. You must:

  1. Expose the dps-server-http Service outside the cluster.
  2. Configure the external Prometheus to scrape the resulting endpoint.

Step 1: Expose the Metrics Endpoint

Note: The chart’s built-in dps.ingress and dps.gateway resources only route to the gRPC service (<release>-server-grpc). They do not expose the HTTP service or /metrics. The HTTP service is ClusterIP only by default, so /metrics is not reachable from outside the cluster unless you add one of the resources below.

Choose one of the following based on your network:

Option A – NodePort

Apply a separate Service of type NodePort so that the existing ClusterIP service used inside the cluster is unaffected:

apiVersion: v1
kind: Service
metadata:
  name: dps-server-metrics-external
  namespace: dps
  labels:
    app: dps-server
spec:
  type: NodePort
  selector:
    app: dps-server
  ports:
    - name: dps-http
      port: 8000
      targetPort: 8000
      nodePort: 30800
      protocol: TCP

Prometheus will scrape https://<any-node-ip>:30800/metrics.

Option B – LoadBalancer

If your cluster supports cloud LoadBalancer services or MetalLB:

apiVersion: v1
kind: Service
metadata:
  name: dps-server-metrics-external
  namespace: dps
  labels:
    app: dps-server
spec:
  type: LoadBalancer
  selector:
    app: dps-server
  ports:
    - name: dps-http
      port: 8000
      targetPort: 8000
      protocol: TCP

Prometheus will scrape https://<loadbalancer-ip>:8000/metrics.

Option C – Ingress

If you already terminate TLS at an ingress controller, you can expose /metrics via an Ingress. Restrict access with the controller’s authentication / IP allow-list features so the endpoint is not publicly exposed.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: dps-server-metrics
  namespace: dps
  annotations:
    nginx.ingress.kubernetes.io/whitelist-source-range: "10.0.0.0/8"
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - dps-metrics.example.com
      secretName: dps-metrics-tls
  rules:
    - host: dps-metrics.example.com
      http:
        paths:
          - path: /metrics
            pathType: Prefix
            backend:
              service:
                name: dps-server-http
                port:
                  name: dps-http

Prometheus will scrape https://dps-metrics.example.com/metrics.

Step 2: Configure the External Prometheus

Add a static scrape job to your external Prometheus configuration. The exact form depends on whether Prometheus has Kubernetes service discovery available; the static target form below works in all cases.

scrape_configs:
  - job_name: dps-server
    metrics_path: /metrics
    scheme: https
    scrape_interval: 30s
    scrape_timeout: 10s
    tls_config:
      insecure_skip_verify: true
    static_configs:
      - targets:
          - dps-metrics.example.com   # Ingress host
          # - 10.0.0.10:30800         # NodePort example
          # - 10.0.0.20:8000          # LoadBalancer example
        labels:
          service: dps-server
          cluster: my-dps-cluster

If you operate Prometheus via the Operator and want to manage this with a CRD instead of static config, use a Probe or an additionalScrapeConfigs Secret – see the Prometheus Operator documentation.

TLS

dps-server serves /metrics over TLS by default. For Prometheus to validate the certificate, the certificate must include a SAN that matches the address Prometheus is scraping. In most cluster-issued certificates this is an in-cluster DNS name (e.g. dps-server-http.dps.svc.cluster.local), which is not reachable from outside.

Recommended approaches, in order of preference:

  1. Terminate TLS at an Ingress with a publicly trusted certificate (Option C above) and let Prometheus validate normally (insecure_skip_verify: false).

  2. Issue a certificate with the externally-routable SAN by adding it to dps.serverTLS.extraDNSNames, then trust the issuing CA on the Prometheus host:

    dps:
      serverTLS:
        extraDNSNames:
          - dps-metrics.example.com
  3. Use insecure_skip_verify: true for non-production setups only.

Authentication

/metrics is unauthenticated at the application layer. Anyone with network access to the endpoint can read the metrics. When exposing the endpoint outside the cluster, restrict access via:

  • IP allow-list at the ingress controller / load balancer
  • Network policy / firewall
  • mTLS at the ingress controller (with Prometheus configured to present a client certificate)

Verify

From the Prometheus host:

curl -sk https://dps-metrics.example.com/metrics | head

In the Prometheus UI, Status -> Targets should list the dps-server job as UP.

Troubleshooting

Symptom Likely Cause Fix
ServiceMonitor exists but Prometheus shows no target serviceMonitorSelector on the Prometheus instance does not match the ServiceMonitor labels. Add the matching label via dps.serviceMonitor.additionalLabels (commonly release: <prometheus-release>).
Target shown but in DOWN state with TLS error Prometheus cannot validate the TLS certificate. Either set tlsConfig.insecureSkipVerify: true, terminate TLS at an Ingress, or add the scraped hostname to dps.serverTLS.extraDNSNames.
Target in DOWN with connection refused dps.serverTLS.transportSecurity is set to insecure but serviceMonitor.scheme is still https (or vice versa). Align both: set dps.serviceMonitor.scheme to http when running insecurely, or to https when TLS is enabled.
0 series returned from dps-server The pod has not finished starting, or the app: dps-server label is missing on the Service (custom override). kubectl get svc -n dps -l app=dps-server should return at least dps-server-http.
External Prometheus cannot reach the endpoint Network path / firewall, or the cluster’s load balancer has no external IP yet. Verify with curl -k https://<endpoint>/metrics from the Prometheus host.

See Also