nat.data_models.profiler#
Classes#
Configuration for collecting Dynamo inference stack metrics. |
|
Module Contents#
- class PromptCachingConfig(/, **data: Any)#
Bases:
pydantic.BaseModel
- class BottleneckConfig(/, **data: Any)#
Bases:
pydantic.BaseModel
- class ConcurrencySpikeConfig(/, **data: Any)#
Bases:
pydantic.BaseModel
- class PrefixSpanConfig(/, **data: Any)#
Bases:
pydantic.BaseModel
- class PredictionTrieConfig(/, **data: Any)#
Bases:
pydantic.BaseModel
- class DynamoMetricsConfig(/, **data: Any)#
Bases:
pydantic.BaseModelConfiguration for collecting Dynamo inference stack metrics.
Core Optimization Metrics#
The profiler focuses on three core metrics for Dynamo LLM optimization:
KV Efficiency (KVE) (
collect_kv_cache): Token-agnostic measure of computational work saved via KV cache. Formula:KVE = cached_tokens / prompt_tokensA KVE of 0.8 means 80% of prompt tokens were served from cache. Affected by prefix routing hints (prefix_id, nvext_prefix_osl, nvext_prefix_iat).Time to First Token - TTFT (
collect_ttft): Latency from request to first token. Lower = faster initial response. Affected by queue depth, worker selection, KV cache hits.Inter-Token Latency - ITL (
collect_itl): Time between tokens during streaming. Lower = smoother streaming. Affected by batch scheduling, GPU utilization.
To collect only core metrics for optimization, use:
config = DynamoMetricsConfig.core_metrics_only()
Dynamo Endpoints#
Frontend (:8000/metrics): Latency, throughput, token stats
Worker (:8081/metrics): KV cache, SGLang stats
Router (:8082/metrics): Thompson Sampling routing
Processor (:8083/metrics): Thompson Sampling KVE
Adding New Metrics#
To add metrics from any Dynamo endpoint:
Identify the metric from the endpoint:
curl localhost:8081/metrics | grep kv
Add to DynamoMetricsResult in
src/nat/profiler/inference_optimization/dynamo_metrics.py: - Add a new field to the Pydantic model - Add the Prometheus query inMETRIC_QUERIESExample - Adding a new metric:
# In dynamo_metrics.py METRIC_QUERIES dict: "my_new_metric": "rate(dynamo_component_my_metric_total[5m])" # In DynamoMetricsResult model: my_new_metric: float | None = Field(default=None, description="My new metric")
Metric Reference by Endpoint#
Frontend (:8000):
dynamo_frontend_*(requests, latency, tokens)Worker (:8081):
dynamo_component_kvstats_*,sglang:*(KV cache, SGLang)Router (:8082):
dynamo_component_*withdynamo_component="router"labelProcessor (:8083):
dynamo_component_thompson_*(Thompson Sampling)
See
external/dynamo/monitoring/README.mdfor the complete metrics reference.Create a new model by parsing and validating input data from keyword arguments.
Raises [
ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.selfis explicitly positional-only to allowselfas a field name.- classmethod core_metrics_only( ) DynamoMetricsConfig#
Create a config that collects only the three core optimization metrics.
This is optimized for tight optimization loops where you only need: - KV Cache Efficiency - TTFT (Time to First Token) - ITL (Inter-Token Latency)
- Args:
prometheus_url: Prometheus server URL query_range: Time range for rate calculations
- Returns:
DynamoMetricsConfig with only core metrics enabled
Usage:
config = DynamoMetricsConfig.core_metrics_only() # Equivalent to: # DynamoMetricsConfig( # enable=True, # collect_kv_cache=True, # collect_ttft=True, # collect_itl=True, # collect_inflight_requests=False, # collect_throughput=False, # collect_token_throughput=False, # )
- class ProfilerConfig(/, **data: Any)#
Bases:
pydantic.BaseModel- prompt_caching_prefixes: PromptCachingConfig#
- bottleneck_analysis: BottleneckConfig#
- concurrency_spike_analysis: ConcurrencySpikeConfig#
- prefix_span_analysis: PrefixSpanConfig#
- prediction_trie: PredictionTrieConfig#
- dynamo_metrics: DynamoMetricsConfig#