Function Autoscaler Observability

View as Markdown

The function autoscaler emits structured logs, Prometheus metrics that explain dependency health statuses and scaling decisions, and OpenTelemetry spans for outbound calls to its dependencies. The Prometheus exporter serves metrics on the address configured in server.metrics.exporters. The local settings file at crates/server/resources/settings-local.yaml uses 0.0.0.0:41338.

Job and namespace labels follow the standard NVCF naming convention for the cluster that runs the function autoscaler.

Metric reference

Metric nameMetric typeDescription
nvcf_autoscaler.autoscaling.statusGaugeScaling status per function, encoded as a reason code.
nvcf_autoscaler.scaling.current_instancesGaugeCurrent instance count per function as read from the timeseries database.
nvcf_autoscaler.scaling.desired_instancesGaugeDesired instance count computed by the scaling decision.
nvcf_autoscaler.scaling.utilizationGaugeUtilization percentage per function used in the scaling decision.
nvcf_autoscaler.requests.queued_totalCounterScaling requests queued for processing.
nvcf_autoscaler.requests.processed_totalCounterScaling requests processed.
nvcf_autoscaler.requests.rejected_totalCounterScaling requests rejected by the policy or guard rails.
nvcf_autoscaler.requests.rate_limited_totalCounterScaling requests rate-limited downstream.
nvcf_autoscaler.queue.sizeGaugeCurrent depth of the scaling work queue.
nvcf_autoscaler.queue.capacityGaugeConfigured capacity of the scaling work queue.
nvcf_autoscaler.function_table_stateGaugeState of the active function table entry per function.
nvcf_autoscaler.function_discovery_duration_secondsHistogramDuration of each discovery loop run.
nvcf_autoscaler.timeseries_db.requests_totalCounterTimeseries database requests, labeled by status.
nvcf_autoscaler.timeseries_db.request_duration_millisecondsHistogramTimeseries database request latency.
nvcf_autoscaler.timeseries_db.auth_failure_totalCounterTimeseries database authentication failures.
nvcf_autoscaler.timeseries_db.server_side_failure_totalCounterTimeseries database server-side query failures.
nvcf_autoscaler.nvcf_api.request_duration_millisecondsHistogramNVCF API request latency.
nvcf_autoscaler.oauth2_api.request_duration_millisecondsHistogramOAuth2 token endpoint request latency.
nvcf_autoscaler.oauth2_client.token_refresh_failure_totalCounterOAuth2 client token refresh failures.
nvcf_autoscaler.cassandra.health_statusGaugeCassandra client health. 1 indicates healthy, 0 indicates unhealthy.
nvcf_autoscaler.health.overall_statusGaugeOverall service health status.
nvcf_autoscaler.health.component_statusGaugePer-component health status.
nvcf_autoscaler.distributed_lockGaugeState of the discovery distributed lock for this replica.
nvcf_autoscaler.distributed_lock.acquisition_failures_totalCounterDiscovery lock acquisition failures.
nvcf_autoscaler.processing.utilization_data_age_millisecondsHistogramAge of the utilization data used in each scaling decision.

Tracing

The function autoscaler emits OpenTelemetry spans for outbound calls to the timeseries database and the NVCF API, with the OTLP endpoint and span filter configurable under server.tracing.

Logging

The function autoscaler writes structured logs to stdout. Set log filter directives in the server.envfilter_directive configuration field. The format follows the tracing_subscriber env filter syntax (Rust ecosystem standard):

1server:
2 envfilter_directive: "server=info,rs_autoscaler=debug,rs_autoscaler::cassandra=warn,info"

The same syntax applies to server.tracing.logging_envfilter_directive if you separate logging and tracing filters.

Useful target prefixes:

TargetCovers
serverBinary entry point: startup, server lifecycle.
rs_autoscalerTop-level function autoscaler library crate.
rs_autoscaler::workScaling loop, discovery loop, bucket reshuffles.
rs_autoscaler::cassandraCassandra client, LWT lock operations.
rs_autoscaler::nvcf_apiOAuth2, NVCF API calls.
rs_autoscaler::timeseries_dbTimeseries database query traces.

See also