Observability (Legacy Managed)
Observability (Legacy Managed)
Observability (Legacy Managed)
NVIDIA Cloud Functions provides a comprehensive observability solution through two main approaches:
NGC UI/CLI Observability
See details on using the built-in observability features below.
External Observability Integration
See external-observability for detailed instructions.
The NGC UI provides basic observability through three main tabs:
Overview Tab
Logs Tab
Metrics Tab
You can access these tabs by navigating to your function in the NGC UI. Each tab offers specific insights into your function’s operation and performance.
The Overview tab provides access to your function’s current status and performance metrics, offering real-time insights into your function’s operation and health.
Basic Function or Task Metrics
Instance Counts
Request Statistics
The Logs tab enables monitoring through detailed log access.
NVCF displays logs related to:
Deployment stages
Function or task invocation logs
Real-time logs (listed in the UI under the Live Tail tab)
The Metrics view displays:
Summary Statistics
Time Series Graphs
Use the time range selector (e.g., Past 1 Hour) in the top right to adjust the view period.
NGC UI access is limited to NGC account holders. For broader observability access, work with your account administrator to configure external observability endpoints.
Configure external observability endpoints to monitor your NVIDIA Cloud Functions. By setting up telemetry endpoints, you can stream metrics (see appendix-b), logs, and traces to popular observability platforms like Grafana Cloud and Datadog. This extends beyond the basic metrics in NGC UI, giving you deeper insights into your functions’ performance.
To export function or task telemetry through external observability platforms, your source code must be instrumented using OpenTelemetry. Without proper OpenTelemetry instrumentation, only system-level metrics will be available.
The OpenTelemetry collector uses the following ports:
OTLP (OpenTelemetry Protocol)
Metrics
Health Check
These ports are reserved for the OpenTelemetry collector and should not be used by your functions or tasks.
Telemetry endpoints can only be configured when creating a new function or deploying a new version. You cannot add a telemetry endpoint to an existing function deployment.
A Telemetry Endpoint is a configuration that specifies where telemetry data is sent. This is allowed for all functions or tasks to be configured to send telemetry data to an external observability platform.
Remember that to collect custom metrics, logs, and traces from your function’s or task’s code, you must instrument your application using OpenTelemetry. System-level metrics (CPU, memory, GPU) are collected automatically.
You can configure telemetry endpoints using either the web UI or the NGC CLI:
Web UI Method:

Follow these steps to set up Grafana Cloud integration with NVCF:
Web UI Method:
Access Grafana Cloud
For new users:
Complete the free Grafana Cloud registration process
For existing users:
Log in with your credentials
Configure OpenTelemetry
In the top menu bar, locate “My Account”
Expand the Details section by clicking the icon

Access OpenTelemetry Settings
In your Grafana Cloud stack, locate the OpenTelemetry card
Click “Configure” to access the OpenTelemetry configuration
You will see options for configuring:
Metrics
Logs
Traces

Locate OTLP Configuration Details
The OTLP endpoint section will display:
OTLP Endpoint URL (e.g., https://otlp-gateway-prod-us-west-0.grafana.net/otlp)
Instance ID (a numeric identifier for your instance)
API Token section with option to “Generate now”
Use the “Copy to Clipboard” buttons to easily copy these values into the NVCF Telemetry Endpoint configuration.
Alternative: Create Grafana Telemetry Endpoint via CLI
As an alternative to the web UI, you can create the Grafana Cloud telemetry endpoint using the NGC CLI:
Keep your API Token secure and never share it publicly. If your token is compromised, you can generate a new one and update your configuration.
Follow these steps to set up Datadog integration with NVCF:
Web UI Method:
Sign Up for Datadog
Visit the Datadog Getting Started page
Complete the registration process for a new Datadog account
Configure API Key
Log in to your Datadog account
Navigate to Organization Settings (found in the bottom left corner of the page)
Select API Keys from the left menu
Either click “+New Key” to create a new API key or copy an existing one from the list
Get Telemetry Endpoint
Your endpoint URL will be displayed in the browser address bar
Available endpoints based on your instance location:
datadoghq.com (US1)
us3.datadoghq.com (US3)
us5.datadoghq.com (US5)
datadoghq.eu (EU1)
ddog-gov.com (US1-FED)
For more details on Datadog sites and endpoints, see the Datadog site documentation
Configure in NVCF Web UI
Input the configuration details:
API Key (copied from step 2)
Endpoint URL (selected from step 3)
Select telemetry type(s):
Choose “Logs” to send log data
Choose “Metrics” to send metrics data
You can select both to send both types of telemetry
Save the configuration

Alternative: Create Datadog Telemetry Endpoint via CLI
As an alternative to the web UI, you can create the Datadog telemetry endpoint using the NGC CLI:
Make sure to keep your API key secure and never share it publicly. If your key is compromised, you can generate a new one and update your configuration.
CLI Method:
As an alternative to the web UI, you can use the NGC CLI to manage telemetry endpoints. Here are the basic CLI commands:
Add Telemetry Endpoint to Function or Task
Telemetry endpoints can only be configured when creating a new function or deploying a new version. You cannot add a telemetry endpoint to an existing function deployment.
Web UI Method:
When creating a new function or deploying a new version:
If you need to change the telemetry endpoint for an existing function, you must deploy a new version of that function with the updated telemetry configuration.
Verify Deployment
After deploying the function with the telemetry endpoint, verify that the telemetry data is flowing correctly to your observability platform.
If you don’t see your custom metrics, logs, or traces in your observability platform, verify that:
Log in to your Grafana Cloud account
Navigate to the Metrics Explorer
Search for the following metrics to verify data flow:
DCGM_FI_DEV_GPU_UTIL - Shows GPU utilization percentage
container_fs_reads_bytes_total - Shows container filesystem read metrics
container_fs_writes_bytes_total - Shows container filesystem write metrics

Log in to your Datadog account
Navigate to the Metrics Explorer
Search for “nvidia.cloud.function” to find your function’s metrics
You can view metrics such as:
GPU utilization
Function or task invocations
Request latency
Resource usage

The OpenTelemetry collector version, image and configuration are managed entirely by NVCF and cannot be modified by users.
Delete a Function or Task and Remove Telemetry Endpoint
To remove a telemetry endpoint, you must first cancel all deployments and remove all functions that use that endpoint. The endpoint cannot be removed while any functions are still using it, even if those functions are not currently deployed.
Web UI method:
Navigate to the Functions list page
Click on the function you want to delete
Navigate to the Deployments tab
For each deployment:
Navigate to the Settings tab
Click “Delete Function” and confirm
Verify the function is completely removed
After all functions using the telemetry endpoint have been removed:
CLI method:
All deployments must be fully cancelled before function removal. The function must be completely removed before the endpoint can be removed. Removing a telemetry endpoint will permanently delete the endpoint configuration. Make sure to export any necessary telemetry data before removing endpoints.
When you select a telemetry endpoint, NVCF:
In the pod for each function or task, an OpenTelemetry collector is deployed. This collector has automatic memory management and built-in resource protection to ensure reliable telemetry collection without impacting function or task performance. NVCF manages all resource allocation for the collector, so you don’t need to worry about resource configuration.
NVCF ensures secure telemetry handling by storing credentials securely in the NGC Encrypted Secrets Store, Each collector only accesses its own function’s or task’s data, and authentication is handled automatically. Credentials are rotated securely to maintain security and integrity.
BYOC cluster registered with NVCA on 2.46.10+ version
Ensure the Bring Your Own Observability cluster feature is enabled. If you are running a cluster agent version older than 2.50.0, refer to the Configuration page for managing feature flags.

If issues occur with telemetry collection:
The collector’s health can be monitored through:
The following metrics are collected through the OpenTelemetry collector deployed with your function when using External Observability and exported through your configured Telemetry Endpoints. The metrics exported depend on the Kubernetes deployment used by the function or task.
Key metrics include:
Metrics are filtered based on deployment type and configuration. Not all metrics may be available for all deployment scenarios.
Only present if the container is performing IO operations:
Only present if the container is performing network operations:
Only present if helm-based function has a deployment k8s object:
Only present if helm-based function has a replicaset k8s object:
Only present if helm-based function has a stateful k8s object:
Only present if the helm-based function has a job/cronjob k8s object:
Only present if function has a configmap k8s object:
Only present if function has a secret k8s object:
Only present if function has a pod k8s object:
Only present if function/task helm deployments:
Only present if function/task helm defined an init container:
Always present for container and helm:
For detailed information about all available DCGM field IDs and GPU metrics, see the NVIDIA DCGM API Field IDs documentation.
Streaming metrics are only present for streaming functions.
All NVCF metrics include the label origin: nvcf-byoo.
Present for cluster management:
Always present for container and helm. The final list of metrics depends on telemetries received & exported by function/task:
All logs and metrics have the following attributes added to their metadata:
The platform metrics have the following attributes when available:
job attribute is available in Grafana Cloudservice is used in Datadog instead of attribute jobYou can export custom metrics/logs/traces to your external observability platform by sending them to the OpenTelemetry collector. Refer to the following table for the available environment variables that you can specify: