SGLang Prometheus Metrics#

📚 Official Documentation: SGLang Production Metrics

This document describes how SGLang Prometheus metrics are exposed in Dynamo.

Overview#

When running SGLang through Dynamo, SGLang engine metrics are automatically passed through and exposed on Dynamo’s /metrics endpoint (default port 8081). This allows you to access both SGLang engine metrics (prefixed with sglang:) and Dynamo runtime metrics (prefixed with dynamo_*) from a single worker backend endpoint.

For the complete and authoritative list of all SGLang metrics, always refer to the official documentation linked above.

Dynamo runtime metrics are documented in docs/observability/metrics.md.

Metric Reference#

The official documentation includes:

  • Complete metric definitions with HELP and TYPE descriptions

  • Example metric output in Prometheus exposition format

  • Counter, Gauge, and Histogram metrics

  • Metric labels (e.g., model_name, engine_type, tp_rank, pp_rank)

  • Setup guide for Prometheus + Grafana monitoring

  • Troubleshooting tips and configuration examples

Metric Categories#

SGLang provides metrics in the following categories (all prefixed with sglang:):

  • Throughput metrics

  • Resource usage

  • Latency metrics

  • Disaggregation metrics (when enabled)

Note: Specific metrics are subject to change between SGLang versions. Always refer to the official documentation or inspect the /metrics endpoint for your SGLang version.

Enabling Metrics in Dynamo#

SGLang metrics are automatically exposed when running SGLang through Dynamo with metrics enabled.

Inspecting Metrics#

To see the actual metrics available in your SGLang version:

1. Launch SGLang with Metrics Enabled#

# Set system metrics port (automatically enables metrics server)
export DYN_SYSTEM_PORT=8081

# Start SGLang worker with metrics enabled
python -m dynamo.sglang --model <model_name> --enable-metrics

# Wait for engine to initialize

Metrics will be available at: http://localhost:8081/metrics

2. Fetch Metrics via curl#

curl http://localhost:8081/metrics | grep "^sglang:"

3. Example Output#

Note: The specific metrics shown below are examples and may vary depending on your SGLang version. Always inspect your actual /metrics endpoint for the current list.

# HELP sglang:prompt_tokens_total Number of prefill tokens processed.
# TYPE sglang:prompt_tokens_total counter
sglang:prompt_tokens_total{model_name="meta-llama/Llama-3.1-8B-Instruct"} 8128902.0
# HELP sglang:generation_tokens_total Number of generation tokens processed.
# TYPE sglang:generation_tokens_total counter
sglang:generation_tokens_total{model_name="meta-llama/Llama-3.1-8B-Instruct"} 7557572.0
# HELP sglang:cache_hit_rate The cache hit rate
# TYPE sglang:cache_hit_rate gauge
sglang:cache_hit_rate{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0075

Implementation Details#

  • SGLang uses multiprocess metrics collection via prometheus_client.multiprocess.MultiProcessCollector

  • Metrics are filtered by the sglang: prefix before being exposed

  • The integration uses Dynamo’s register_engine_metrics_callback() function

  • Metrics appear after SGLang engine initialization completes

See Also#

SGLang Metrics#

Dynamo Metrics#

  • Dynamo Metrics Guide: See docs/observability/metrics.md for complete documentation on Dynamo runtime metrics

  • Dynamo Runtime Metrics: Metrics prefixed with dynamo_* for runtime, components, endpoints, and namespaces

    • Implementation: lib/runtime/src/metrics.rs (Rust runtime metrics)

    • Metric names: lib/runtime/src/metrics/prometheus_names.rs (metric name constants)

    • Available at the same /metrics endpoint alongside SGLang metrics

  • Integration Code: components/src/dynamo/common/utils/prometheus.py - Prometheus utilities and callback registration