Prometheus Monitoring
Overview
dps-server exports operational and business metrics in Prometheus text format on the HTTP service. This page covers two deployment perspectives:
- Prometheus is deployed in the same Kubernetes cluster as DPS – enable the chart-managed
ServiceMonitor(recommended). - Prometheus is external to the cluster – expose the metrics endpoint and configure Prometheus to scrape it from outside.
Scope: This document only covers how to make DPS metrics discoverable and scrapable. It does not cover Prometheus / Prometheus Operator installation, alerting, or dashboarding. For the list of metrics emitted by
dps-server, see the/metricsendpoint or the metrics reference (when available).
Metrics Endpoint Reference
| Property | Value |
|---|---|
| Pod label / service selector | app: dps-server |
| Service name | <release>-server-http (e.g. dps-server-http for a release named dps) |
| Service port name | dps-http |
| Service port number | 8000 (configurable via dps.service.http.port) |
| Container port | 8000 (configurable via dps.containerPorts.http.port) |
| Path | /metrics |
| Scheme | https (default; depends on dps.serverTLS.transportSecurity) |
| Authentication | None on /metrics itself; TLS only |
Prerequisites
| Scenario | Requirements |
|---|---|
| In-cluster Prometheus | The Prometheus Operator (e.g. kube-prometheus-stack) must be installed in the cluster, and the monitoring.coreos.com/v1 ServiceMonitor CRD must be available. The Prometheus instance must be configured to discover ServiceMonitor resources in the DPS namespace. |
| External Prometheus | The metrics endpoint must be reachable from the Prometheus host (Ingress, LoadBalancer, NodePort, VPN, etc.). Prometheus must be able to validate the TLS certificate, or be configured with insecure_skip_verify: true. |
Scenario 1 – Prometheus In the Same Cluster
Enable the Chart-Managed ServiceMonitor
The DPS Helm chart ships a ServiceMonitor template that is rendered when dps.serviceMonitor.enabled is set. By default it is disabled.
Add the following to your values.yaml:
dps:
serviceMonitor:
enabled: true
interval: 30s
scrapeTimeout: 10s
scheme: https
tlsConfig:
insecureSkipVerify: true
additionalLabels:
release: kube-prometheus-stackThen install or upgrade the chart:
helm upgrade --install dps ngc/dps \
--namespace dps \
--values values.yamlNote: Many Prometheus Operator installations require a label such as
release: <prometheus-release-name>so the Prometheus instance picks up theServiceMonitor. Check your Prometheus’serviceMonitorSelectorto determine which labels to add viadps.serviceMonitor.additionalLabels.
Deploy the ServiceMonitor in a Different Namespace
If your Prometheus Operator is configured to only watch ServiceMonitor objects in a specific namespace (commonly the same namespace as Prometheus itself), deploy the ServiceMonitor there while leaving DPS in its own namespace:
dps:
serviceMonitor:
enabled: true
namespace: monitoring
additionalLabels:
release: kube-prometheus-stackWhen serviceMonitor.namespace differs from the DPS release namespace, the chart automatically adds a namespaceSelector pointing back at the release namespace so Prometheus knows where to scrape.
To override the namespace selector explicitly:
dps:
serviceMonitor:
enabled: true
namespace: monitoring
namespaceSelector:
matchNames:
- dps
- dps-prodConfiguration Reference
All ServiceMonitor settings live under dps.serviceMonitor in the chart’s values.yaml. Always refer to the chart README and values.yaml for the authoritative, version-specific list of keys, defaults, and inline documentation – the keys highlighted on this page (enabled, namespace, namespaceSelector, additionalLabels, interval, scrapeTimeout, scheme, tlsConfig) are the most commonly tuned ones, but they may not be exhaustive in every release.
To pull the latest values for the version you are deploying:
helm repo add ngc https://helm.ngc.nvidia.com/nvidia
helm repo update ngc
helm pull ngc/dps --untar
grep -A 30 "^ serviceMonitor:" dps/values.yamlSee Configuration Reference in the Deployment Guide for the general procedure.
Verify the ServiceMonitor
kubectl get servicemonitor -n dps
kubectl describe servicemonitor -n dps dps-server-metricsIn the Prometheus UI, navigate to Status -> Targets and confirm a target with job label dps-server-metrics is UP.
You can also scrape the metrics endpoint directly from inside the cluster:
kubectl run -n dps --rm -it --restart=Never curl --image=curlimages/curl -- \
curl -sk https://dps-server-http:8000/metrics | headScenario 2 – Standalone ServiceMonitor (No Helm)
If you cannot or do not want to enable the chart’s ServiceMonitor (for example, monitoring is owned by a different team that manages all ServiceMonitor objects centrally), apply this manifest directly. It targets the Service created by the DPS chart.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dps-server-metrics
namespace: monitoring
labels:
app: dps-server
release: kube-prometheus-stack
spec:
selector:
matchLabels:
app: dps-server
namespaceSelector:
matchNames:
- dps
endpoints:
- port: dps-http
path: /metrics
interval: 30s
scrapeTimeout: 10s
scheme: https
tlsConfig:
insecureSkipVerify: trueAdjust metadata.namespace, metadata.labels.release, and spec.namespaceSelector.matchNames for your environment.
Scenario 3 – Prometheus Outside the Cluster
When Prometheus runs outside the DPS cluster (for example, a central observability cluster, a bare-metal Prometheus, or a managed Prometheus service), the chart’s ServiceMonitor cannot be used. You must:
- Expose the
dps-server-httpServiceoutside the cluster. - Configure the external Prometheus to scrape the resulting endpoint.
Step 1: Expose the Metrics Endpoint
Note: The chart’s built-in
dps.ingressanddps.gatewayresources only route to the gRPC service (<release>-server-grpc). They do not expose the HTTP service or/metrics. The HTTP service isClusterIPonly by default, so/metricsis not reachable from outside the cluster unless you add one of the resources below.
Choose one of the following based on your network:
Option A – NodePort
Apply a separate Service of type NodePort so that the existing ClusterIP service used inside the cluster is unaffected:
apiVersion: v1
kind: Service
metadata:
name: dps-server-metrics-external
namespace: dps
labels:
app: dps-server
spec:
type: NodePort
selector:
app: dps-server
ports:
- name: dps-http
port: 8000
targetPort: 8000
nodePort: 30800
protocol: TCPPrometheus will scrape https://<any-node-ip>:30800/metrics.
Option B – LoadBalancer
If your cluster supports cloud LoadBalancer services or MetalLB:
apiVersion: v1
kind: Service
metadata:
name: dps-server-metrics-external
namespace: dps
labels:
app: dps-server
spec:
type: LoadBalancer
selector:
app: dps-server
ports:
- name: dps-http
port: 8000
targetPort: 8000
protocol: TCPPrometheus will scrape https://<loadbalancer-ip>:8000/metrics.
Option C – Ingress
If you already terminate TLS at an ingress controller, you can expose /metrics via an Ingress. Restrict access with the controller’s authentication / IP allow-list features so the endpoint is not publicly exposed.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: dps-server-metrics
namespace: dps
annotations:
nginx.ingress.kubernetes.io/whitelist-source-range: "10.0.0.0/8"
spec:
ingressClassName: nginx
tls:
- hosts:
- dps-metrics.example.com
secretName: dps-metrics-tls
rules:
- host: dps-metrics.example.com
http:
paths:
- path: /metrics
pathType: Prefix
backend:
service:
name: dps-server-http
port:
name: dps-httpPrometheus will scrape https://dps-metrics.example.com/metrics.
Step 2: Configure the External Prometheus
Add a static scrape job to your external Prometheus configuration. The exact form depends on whether Prometheus has Kubernetes service discovery available; the static target form below works in all cases.
scrape_configs:
- job_name: dps-server
metrics_path: /metrics
scheme: https
scrape_interval: 30s
scrape_timeout: 10s
tls_config:
insecure_skip_verify: true
static_configs:
- targets:
- dps-metrics.example.com # Ingress host
# - 10.0.0.10:30800 # NodePort example
# - 10.0.0.20:8000 # LoadBalancer example
labels:
service: dps-server
cluster: my-dps-clusterIf you operate Prometheus via the Operator and want to manage this with a CRD instead of static config, use a Probe or an additionalScrapeConfigs Secret – see the Prometheus Operator documentation.
TLS
dps-server serves /metrics over TLS by default. For Prometheus to validate the certificate, the certificate must include a SAN that matches the address Prometheus is scraping. In most cluster-issued certificates this is an in-cluster DNS name (e.g. dps-server-http.dps.svc.cluster.local), which is not reachable from outside.
Recommended approaches, in order of preference:
-
Terminate TLS at an Ingress with a publicly trusted certificate (Option C above) and let Prometheus validate normally (
insecure_skip_verify: false). -
Issue a certificate with the externally-routable SAN by adding it to
dps.serverTLS.extraDNSNames, then trust the issuing CA on the Prometheus host:dps: serverTLS: extraDNSNames: - dps-metrics.example.com -
Use
insecure_skip_verify: truefor non-production setups only.
Authentication
/metrics is unauthenticated at the application layer. Anyone with network access to the endpoint can read the metrics. When exposing the endpoint outside the cluster, restrict access via:
- IP allow-list at the ingress controller / load balancer
- Network policy / firewall
- mTLS at the ingress controller (with Prometheus configured to present a client certificate)
Verify
From the Prometheus host:
curl -sk https://dps-metrics.example.com/metrics | headIn the Prometheus UI, Status -> Targets should list the dps-server job as UP.
Troubleshooting
| Symptom | Likely Cause | Fix |
|---|---|---|
ServiceMonitor exists but Prometheus shows no target |
serviceMonitorSelector on the Prometheus instance does not match the ServiceMonitor labels. |
Add the matching label via dps.serviceMonitor.additionalLabels (commonly release: <prometheus-release>). |
Target shown but in DOWN state with TLS error |
Prometheus cannot validate the TLS certificate. | Either set tlsConfig.insecureSkipVerify: true, terminate TLS at an Ingress, or add the scraped hostname to dps.serverTLS.extraDNSNames. |
Target in DOWN with connection refused |
dps.serverTLS.transportSecurity is set to insecure but serviceMonitor.scheme is still https (or vice versa). |
Align both: set dps.serviceMonitor.scheme to http when running insecurely, or to https when TLS is enabled. |
0 series returned from dps-server |
The pod has not finished starting, or the app: dps-server label is missing on the Service (custom override). |
kubectl get svc -n dps -l app=dps-server should return at least dps-server-http. |
| External Prometheus cannot reach the endpoint | Network path / firewall, or the cluster’s load balancer has no external IP yet. | Verify with curl -k https://<endpoint>/metrics from the Prometheus host. |
See Also
- Deployment Guide – chart installation and
dps.serverTLSconfiguration.