nat.data_models.profiler#

Classes#

`PromptCachingConfig`
`BottleneckConfig`
`ConcurrencySpikeConfig`
`PrefixSpanConfig`
`PredictionTrieConfig`
`DynamoMetricsConfig`	Configuration for collecting Dynamo inference stack metrics.
`ProfilerConfig`

Module Contents#

class PromptCachingConfig(/, **data: Any)#

Bases: pydantic.BaseModel

enable: bool = False#

min_frequency: float = 0.5#

class BottleneckConfig(/, **data: Any)#

Bases: pydantic.BaseModel

enable_simple_stack: bool = False#

enable_nested_stack: bool = False#

class ConcurrencySpikeConfig(/, **data: Any)#

Bases: pydantic.BaseModel

enable: bool = False#

spike_threshold: int = 1#

class PrefixSpanConfig(/, **data: Any)#

Bases: pydantic.BaseModel

enable: bool = False#

min_support: float = 2#

min_coverage: float = 0#

max_text_len: int = 1000#

top_k: int = 10#

chain_with_common_prefixes: bool = False#

class PredictionTrieConfig(/, **data: Any)#

Bases: pydantic.BaseModel

enable: bool = False#

output_filename: str = 'prediction_trie.json'#

auto_sensitivity: bool = True#

sensitivity_scale: int = 5#

w_critical: float = 0.5#

w_fanout: float = 0.3#

w_position: float = 0.2#

w_parallel: float = 0.0#

class DynamoMetricsConfig(/, **data: Any)#

Bases: pydantic.BaseModel

Configuration for collecting Dynamo inference stack metrics.

Core Optimization Metrics#

The profiler focuses on three core metrics for Dynamo LLM optimization:

KV Efficiency (KVE) (collect_kv_cache): Token-agnostic measure of computational work saved via KV cache. Formula: KVE = cached_tokens / prompt_tokens A KVE of 0.8 means 80% of prompt tokens were served from cache. Affected by prefix routing hints (prefix_id, nvext_prefix_osl, nvext_prefix_iat).
Time to First Token - TTFT (collect_ttft): Latency from request to first token. Lower = faster initial response. Affected by queue depth, worker selection, KV cache hits.
Inter-Token Latency - ITL (collect_itl): Time between tokens during streaming. Lower = smoother streaming. Affected by batch scheduling, GPU utilization.

To collect only core metrics for optimization, use:

config = DynamoMetricsConfig.core_metrics_only()

Dynamo Endpoints#

Frontend (:8000/metrics): Latency, throughput, token stats
Worker (:8081/metrics): KV cache, SGLang stats
Router (:8082/metrics): Thompson Sampling routing
Processor (:8083/metrics): Thompson Sampling KVE

Adding New Metrics#

To add metrics from any Dynamo endpoint:

Identify the metric from the endpoint:
```
curl localhost:8081/metrics | grep kv
```
Add to DynamoMetricsResult in src/nat/profiler/inference_optimization/dynamo_metrics.py: - Add a new field to the Pydantic model - Add the Prometheus query in METRIC_QUERIES

Example - Adding a new metric:

# In dynamo_metrics.py METRIC_QUERIES dict:
"my_new_metric": "rate(dynamo_component_my_metric_total[5m])"

# In DynamoMetricsResult model:
my_new_metric: float | None = Field(default=None, description="My new metric")

Metric Reference by Endpoint#

Frontend (:8000): dynamo_frontend_* (requests, latency, tokens)
Worker (:8081): dynamo_component_kvstats_*, sglang:* (KV cache, SGLang)
Router (:8082): dynamo_component_* with dynamo_component="router" label
Processor (:8083): dynamo_component_thompson_* (Thompson Sampling)

See external/dynamo/monitoring/README.md for the complete metrics reference.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

enable: bool = None#

prometheus_url: str = None#

collect_kv_cache: bool = None#

collect_ttft: bool = None#

collect_itl: bool = None#

collect_inflight_requests: bool = None#

collect_throughput: bool = None#

collect_token_throughput: bool = None#

query_range: str = None#

lookback_seconds: float = None#

workflow_start_timestamp: float | None = None#

workflow_end_timestamp: float | None = None#

classmethod core_metrics_only( prometheus_url: str = 'http://localhost:9090', query_range: str = '30s', ) → DynamoMetricsConfig#

Create a config that collects only the three core optimization metrics.

This is optimized for tight optimization loops where you only need: - KV Cache Efficiency - TTFT (Time to First Token) - ITL (Inter-Token Latency)

Args:: prometheus_url: Prometheus server URL query_range: Time range for rate calculations
Returns:: DynamoMetricsConfig with only core metrics enabled

Usage:

config = DynamoMetricsConfig.core_metrics_only()
# Equivalent to:
# DynamoMetricsConfig(
#     enable=True,
#     collect_kv_cache=True,
#     collect_ttft=True,
#     collect_itl=True,
#     collect_inflight_requests=False,
#     collect_throughput=False,
#     collect_token_throughput=False,
# )

class ProfilerConfig(/, **data: Any)#

Bases: pydantic.BaseModel

base_metrics: bool = False#

token_usage_forecast: bool = False#

token_uniqueness_forecast: bool = False#

workflow_runtime_forecast: bool = False#

compute_llm_metrics: bool = False#

csv_exclude_io_text: bool = False#

prompt_caching_prefixes: PromptCachingConfig#

bottleneck_analysis: BottleneckConfig#

concurrency_spike_analysis: ConcurrencySpikeConfig#

prefix_span_analysis: PrefixSpanConfig#

prediction_trie: PredictionTrieConfig#

dynamo_metrics: DynamoMetricsConfig#