nat.data_models.profiler#

Classes#

Module Contents#

class PromptCachingConfig(/, **data: Any)#

Bases: pydantic.BaseModel

enable: bool = False#
min_frequency: float = 0.5#
class BottleneckConfig(/, **data: Any)#

Bases: pydantic.BaseModel

enable_simple_stack: bool = False#
enable_nested_stack: bool = False#
class ConcurrencySpikeConfig(/, **data: Any)#

Bases: pydantic.BaseModel

enable: bool = False#
spike_threshold: int = 1#
class PrefixSpanConfig(/, **data: Any)#

Bases: pydantic.BaseModel

enable: bool = False#
min_support: float = 2#
min_coverage: float = 0#
max_text_len: int = 1000#
top_k: int = 10#
chain_with_common_prefixes: bool = False#
class PredictionTrieConfig(/, **data: Any)#

Bases: pydantic.BaseModel

enable: bool = False#
output_filename: str = 'prediction_trie.json'#
auto_sensitivity: bool = True#
sensitivity_scale: int = 5#
w_critical: float = 0.5#
w_fanout: float = 0.3#
w_position: float = 0.2#
w_parallel: float = 0.0#
class DynamoMetricsConfig(/, **data: Any)#

Bases: pydantic.BaseModel

Configuration for collecting Dynamo inference stack metrics.

Core Optimization Metrics#

The profiler focuses on three core metrics for Dynamo LLM optimization:

  1. KV Efficiency (KVE) (collect_kv_cache): Token-agnostic measure of computational work saved via KV cache. Formula: KVE = cached_tokens / prompt_tokens A KVE of 0.8 means 80% of prompt tokens were served from cache. Affected by prefix routing hints (prefix_id, nvext_prefix_osl, nvext_prefix_iat).

  2. Time to First Token - TTFT (collect_ttft): Latency from request to first token. Lower = faster initial response. Affected by queue depth, worker selection, KV cache hits.

  3. Inter-Token Latency - ITL (collect_itl): Time between tokens during streaming. Lower = smoother streaming. Affected by batch scheduling, GPU utilization.

To collect only core metrics for optimization, use:

config = DynamoMetricsConfig.core_metrics_only()

Dynamo Endpoints#

  • Frontend (:8000/metrics): Latency, throughput, token stats

  • Worker (:8081/metrics): KV cache, SGLang stats

  • Router (:8082/metrics): Thompson Sampling routing

  • Processor (:8083/metrics): Thompson Sampling KVE

Adding New Metrics#

To add metrics from any Dynamo endpoint:

  1. Identify the metric from the endpoint:

    curl localhost:8081/metrics | grep kv
    
  2. Add to DynamoMetricsResult in src/nat/profiler/inference_optimization/dynamo_metrics.py: - Add a new field to the Pydantic model - Add the Prometheus query in METRIC_QUERIES

  3. Example - Adding a new metric:

    # In dynamo_metrics.py METRIC_QUERIES dict:
    "my_new_metric": "rate(dynamo_component_my_metric_total[5m])"
    
    # In DynamoMetricsResult model:
    my_new_metric: float | None = Field(default=None, description="My new metric")
    

Metric Reference by Endpoint#

  • Frontend (:8000): dynamo_frontend_* (requests, latency, tokens)

  • Worker (:8081): dynamo_component_kvstats_*, sglang:* (KV cache, SGLang)

  • Router (:8082): dynamo_component_* with dynamo_component="router" label

  • Processor (:8083): dynamo_component_thompson_* (Thompson Sampling)

See external/dynamo/monitoring/README.md for the complete metrics reference.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

enable: bool = None#
prometheus_url: str = None#
collect_kv_cache: bool = None#
collect_ttft: bool = None#
collect_itl: bool = None#
collect_inflight_requests: bool = None#
collect_throughput: bool = None#
collect_token_throughput: bool = None#
query_range: str = None#
lookback_seconds: float = None#
workflow_start_timestamp: float | None = None#
workflow_end_timestamp: float | None = None#
classmethod core_metrics_only(
prometheus_url: str = 'http://localhost:9090',
query_range: str = '30s',
) DynamoMetricsConfig#

Create a config that collects only the three core optimization metrics.

This is optimized for tight optimization loops where you only need: - KV Cache Efficiency - TTFT (Time to First Token) - ITL (Inter-Token Latency)

Args:

prometheus_url: Prometheus server URL query_range: Time range for rate calculations

Returns:

DynamoMetricsConfig with only core metrics enabled

Usage:

config = DynamoMetricsConfig.core_metrics_only()
# Equivalent to:
# DynamoMetricsConfig(
#     enable=True,
#     collect_kv_cache=True,
#     collect_ttft=True,
#     collect_itl=True,
#     collect_inflight_requests=False,
#     collect_throughput=False,
#     collect_token_throughput=False,
# )
class ProfilerConfig(/, **data: Any)#

Bases: pydantic.BaseModel

base_metrics: bool = False#
token_usage_forecast: bool = False#
token_uniqueness_forecast: bool = False#
workflow_runtime_forecast: bool = False#
compute_llm_metrics: bool = False#
csv_exclude_io_text: bool = False#
prompt_caching_prefixes: PromptCachingConfig#
bottleneck_analysis: BottleneckConfig#
concurrency_spike_analysis: ConcurrencySpikeConfig#
prefix_span_analysis: PrefixSpanConfig#
prediction_trie: PredictionTrieConfig#
dynamo_metrics: DynamoMetricsConfig#