Function Monitoring

Troubleshooting

For troubleshooting deployment failures see Deployment Failures.

For troubleshooting invocation failures see Statuses and Errors.

See below for adding logging to your inference container, and viewing metrics.

Logging and Metrics

This section gives an overview of available metrics and logs within the Cloud Functions UI. Note that for full observability of production workloads, it’s recommended to emit logs, metrics, analytics etc. to third-party monitoring tools from within your container.

Emit and View Inference Container Logs

View inference container logs in the Cloud Functions UI via the “Logs” tab in the function details page. To get here, click any function version from the “Functions” list and click “View Details” on the side panel to the right.

Function Details Logs Tab

Logs are currently available with up to 48 hours history, with the ability to view as expanded rows for scanning, or as a “window” view for ease of copying and pasting.

Warning

Note as a prerequisite, your inference container will have to be instrumented to emit logs. This is highly recommended.

How to Add Logs to Your Inference Container

Here is an example of adding NVCF-compatible logs. The helper function for logging below, along with other helper functions, can be imported from the Helper Functions repository.

 1 import logging
 2
 3 def get_logger() -> logging.Logger:
 4     """
 5     gets a Logger that logs in a format compatible with NVCF
 6     :return: logging.Logger
 7     """
 8     sys.stdout.reconfigure(encoding="utf-8")
 9     logging.basicConfig(
10         level=logging.INFO,
11         format="%(asctime)s [%(levelname)s] [INFERENCE] %(message)s",
12         handlers=[logging.StreamHandler(sys.stdout)],
13     )
14     logger = logging.getLogger(__name__)
15     return logger
16
17 class MyServer:
18
19     def __init__(self):
20         self.logger = get_logger()
21
22     def _infer_fn(self, request):
23         self.logger.info("Got a request!")

View Function Metrics

NVCF exposes the following metrics by default.

  • Instance counts (current, min and max)

  • Invocation activity and queue depth

  • Total invocation count, success rate and failure count

  • Average inference time

Metrics are viewable upon clicking any function from the “Functions” list page. The function overview page will display aggregated values across all function versions.

Function Overview Metrics

When clicking on a function version’s details page, you will then see metrics for this specific function version.

Function Details Metrics

Warning

There may be up to a 5-minute delay on metric ingestion. Any time-series queries within the page are aggregated at 5-minute intervals with a step set to show 500 data points. All stat queries are based on the total selected period and reduced to either show the latest total value or a mean value.

Instrument with OpenTelemetry

Users can (auto)instrument their container functions with OpenTelemetry SDK and have the signals (logs, traces and metrics) to observability backend such as Grafana Cloud.

See examples of container functions in GitHub.

Logging and Metrics (Internal)

Internal NVIDIAN users have access to additional logging and metrics. Refer to Function Monitoring & Reliability.

Besides additional logging and metrics, internal NVIDIAN users can leverage a Helm Chart Observability <https://gitlab-master.nvidia.com/nvcf/monitoring/helm-chart-observability#helm-chart-observability> to collect and export metrics and traces from their Helm Chart functions to Kratos and Lightstep.

Documentation for this Helm Chart is available here.