Observability Guide#

Overview#

NVIDIA Cloud Functions provides a comprehensive observability solution through two main approaches:

NGC UI/CLI Observability
1. Basic metrics in the Overview tab
2. Log data with time ranges in the Logs tab
3. Limited to NGC account holders
4. Enabled by default for all functions or tasks
See details on using the built-in observability features below.
External Observability Integration
1. Send telemetry data to your organization’s observability platforms
2. Support for logs, metrics, and traces
3. Requires explicit configuration of telemetry endpoints
See External Observability for detailed instructions.

NGC UI Observability#

The NGC UI provides basic observability through three main tabs:

Overview Tab
1. Provides real-time status and performance metrics
2. Shows instance counts and request statistics
3. Displays basic function or task information
Logs Tab
1. Access to container and event logs
2. Real-time log streaming capabilities
3. Search and filtering functionality
Metrics Tab
1. Detailed performance indicators
2. Time-series data visualization
3. Resource utilization trends

You can access these tabs by navigating to your function in the NGC UI. Each tab offers specific insights into your function’s operation and performance.

Overview#

The Overview tab provides access to your function’s current status and performance metrics, offering real-time insights into your function’s operation and health.

Basic Function or Task Metrics
1. Current function or task status (Running, Stopped, Error)
2. Last updated timestamp
3. Function or task version
4. Runtime environment details
Instance Counts
1. Active instances
2. Pending instances
3. Failed instances
4. Historical instance trends
Request Statistics
1. Total requests processed
2. Current request rate
3. Success/failure ratios
4. Average response times

How to Access#

Navigate to the Functions list page
Click on your function
The Overview tab is displayed by default
Use the refresh button to update the data

Important

Overview data updates every 30 seconds
Historical data is available for the last 24 hours
Some metrics may have a slight delay in reporting

Logs#

The Logs tab enables monitoring through detailed log access.

NVCF displays logs related to:

Deployment stages
- Function or Task Creation
- Function or Task Deployment
Function or task invocation logs
Real-time logs (listed in the UI under the Live Tail tab)
- For detailed information about real-time logging capabilities, see Debuggability Guide

Viewing Metrics#

Navigate to the Functions list page
Click on your function
Select the Metrics tab

The Metrics view displays:

Summary Statistics

Total Invocations - Number of function calls in the selected time period
Average Inference Time - Mean processing time for function calls
Total Instance Count - Current number of running instances
Failures - Count of failed executions

Time Series Graphs

Invocation Activity and Queue Depth - Shows request patterns and queued requests
Average Inference Time - Processing duration trends
Instances Over Time - Shows scaling behavior
Success Rate - Function reliability metrics

Use the time range selector (e.g., Past 1 Hour) in the top right to adjust the view period.

Important

Minor discrepancies may occur in aggregated invocations due to rounding, especially with smaller values
Most recent metrics may be delayed
Metrics have a 5-minute ingestion delay

Note

NGC UI access is limited to NGC account holders. For broader observability access, work with your account administrator to configure external observability endpoints.

External Observability#

Configure external observability endpoints to monitor your NVIDIA Cloud Functions. By setting up telemetry endpoints, you can stream metrics (see Appendix B: Available Metrics), logs, and traces to popular observability platforms like Grafana Cloud and Datadog. This extends beyond the basic metrics in NGC UI, giving you deeper insights into your functions’ performance.

Important

To export function or task telemetry through external observability platforms, your source code must be instrumented using OpenTelemetry. Without proper OpenTelemetry instrumentation, only system-level metrics will be available.

Ports#

The OpenTelemetry collector uses the following ports:

OTLP (OpenTelemetry Protocol)
- OTLP gRPC: Port 14357
- OTLP HTTP: Port 14358
Metrics
- Port 18888 - Used for collector metrics
Health Check
- Port 13133 - Used for health check endpoint

Note

These ports are reserved for the OpenTelemetry collector and should not be used by your functions or tasks.

Configuration#

Telemetry endpoints can only be configured when creating a new function or deploying a new version. You cannot add a telemetry endpoint to an existing function deployment.

A Telemetry Endpoint is a configuration that specifies where telemetry data is sent. This is allowed for all functions or tasks to be configured to send telemetry data to an external observability platform.

Configure External Telemetry Endpoints

Note

Remember that to collect custom metrics, logs, and traces from your function’s or task’s code, you must instrument your application using OpenTelemetry. System-level metrics (CPU, memory, GPU) are collected automatically.

You can configure telemetry endpoints using either the web UI or the NGC CLI:

Web UI Method:
- Navigate to your NGC organization settings
- Select “Settings” in your Cloud Functions NGC organization
- Scroll to the bottom of the page
- Click “Add Telemetry Endpoint”
- Select your desired endpoint type (Grafana Cloud or Datadog)
- Configure the endpoint with the required credentials
Grafana Cloud
Follow these steps to set up Grafana Cloud integration with NVCF:

Web UI Method:
1. Access Grafana Cloud
  1. For new users:
    
    Visit https://grafana.com/auth/sign-up/create-user
    
    Complete the free Grafana Cloud registration process
  2. For existing users:
    
    Visit https://grafana.com/auth/sign-in
    
    Log in with your credentials
2. Configure OpenTelemetry
  1. In the top menu bar, locate “My Account”
  2. Expand the Details section by clicking the icon
3. Access OpenTelemetry Settings
  1. In your Grafana Cloud stack, locate the OpenTelemetry card
  2. Click “Configure” to access the OpenTelemetry configuration
  3. You will see options for configuring:
    
    Metrics
    
    Logs
    
    Traces
4. Locate OTLP Configuration Details
  1. The OTLP endpoint section will display:
    
    OTLP Endpoint URL (e.g., https://otlp-gateway-prod-us-west-0.grafana.net/otlp)
    
    Instance ID (a numeric identifier for your instance)
    
    API Token section with option to “Generate now”
  2. Use the “Copy to Clipboard” buttons to easily copy these values into the NVCF Telemetry Endpoint configuration.
Alternative: Create Grafana Telemetry Endpoint via CLI

As an alternative to the web UI, you can create the Grafana Cloud telemetry endpoint using the NGC CLI:
ngc cloud-function telemetry-endpoint create --name grafana-cloud-metrics \ --type METRICS \ --provider GRAFANA_CLOUD \ --protocol HTTP \ --endpoint https://otlp-gateway-prod-us-west-0.grafana.net/otlp \ --key your-grafana-api-token
Warning

Keep your API Token secure and never share it publicly. If your token is compromised, you can generate a new one and update your configuration.
Datadog
Follow these steps to set up Datadog integration with NVCF:

Web UI Method:
1. Sign Up for Datadog
  1. Visit the Datadog Getting Started page
  2. Complete the registration process for a new Datadog account
2. Configure API Key
  1. Log in to your Datadog account
  2. Navigate to Organization Settings (found in the bottom left corner of the page)
  3. Select API Keys from the left menu
  4. Either click “+New Key” to create a new API key or copy an existing one from the list
3. Get Telemetry Endpoint
  1. Your endpoint URL will be displayed in the browser address bar
  2. Available endpoints based on your instance location:
    
    datadoghq.com (US1)
    
    us3.datadoghq.com (US3)
    
    us5.datadoghq.com (US5)
    
    datadoghq.eu (EU1)
    
    ddog-gov.com (US1-FED)
  3. For more details on Datadog sites and endpoints, see the Datadog site documentation
4. Configure in NVCF Web UI
  1. Input the configuration details:
    
    API Key (copied from step 2)
    
    Endpoint URL (selected from step 3)
    
    Select telemetry type(s):
    
    Choose “Logs” to send log data
    
    Choose “Metrics” to send metrics data
    
    You can select both to send both types of telemetry
  2. Save the configuration
Alternative: Create Datadog Telemetry Endpoint via CLI

As an alternative to the web UI, you can create the Datadog telemetry endpoint using the NGC CLI:
# Example ngc cloud-function telemetry-endpoint create --name datadog-metrics \ --type METRICS \ --provider DATADOG \ --protocol HTTP \ --endpoint datadoghq.com \ --key your-datadog-api-key
Note

Make sure to keep your API key secure and never share it publicly. If your key is compromised, you can generate a new one and update your configuration.
CLI Method:

As an alternative to the web UI, you can use the NGC CLI to manage telemetry endpoints. Here are the basic CLI commands:
```
# List existing telemetry endpoints
ngc cloud-function telemetry-endpoint list

# Create a new telemetry endpoint
ngc cloud-function telemetry-endpoint create --name <endpoint-name> \
  --type <LOGS|METRICS> \
  --provider <GRAFANA_CLOUD|DATADOG> \
  --protocol <GRPC|HTTP> \
  --endpoint <endpoint-url> \
  --key <api-key>

# Remove a telemetry endpoint
ngc cloud-function telemetry-endpoint remove <endpoint-name>
```
Note
- Endpoint names must be unique within your NGC organization
- API tokens and keys are stored securely in NGC Encrypted Secrets Store and can be updated if needed
- Endpoint configurations cannot be updated - delete and recreate to change settings
Add Telemetry Endpoint to Function or Task

Telemetry endpoints can only be configured when creating a new function or deploying a new version. You cannot add a telemetry endpoint to an existing function deployment.

Web UI Method:

When creating a new function or deploying a new version:
- In the function creation/deployment form
- Look for the Telemetry Endpoints section
- Select the desired telemetry endpoint from the dropdown
- Complete the rest of the function creation/deployment process
Note

If you need to change the telemetry endpoint for an existing function, you must deploy a new version of that function with the updated telemetry configuration.
Verify Deployment

After deploying the function with the telemetry endpoint, verify that the telemetry data is flowing correctly to your observability platform.
Important

If you don’t see your custom metrics, logs, or traces in your observability platform, verify that:
1. Your function’s or task’s code is properly instrumented with OpenTelemetry
2. The telemetry endpoint is correctly configured
3. The function or task deployment is active and running
Grafana Cloud
1. Log in to your Grafana Cloud account
2. Navigate to the Metrics Explorer
3. Search for the following metrics to verify data flow:
  1. DCGM_FI_DEV_GPU_UTIL - Shows GPU utilization percentage
  2. container_fs_reads_bytes_total - Shows container filesystem read metrics
  3. container_fs_writes_bytes_total - Shows container filesystem write metrics
Datadog
Log in to your Datadog account

Navigate to the Metrics Explorer

Search for “nvidia.cloud.function” to find your function’s metrics

You can view metrics such as:

GPU utilization

Function or task invocations

Request latency

Resource usage
Note

The OpenTelemetry collector version, image and configuration are managed entirely by NVCF and cannot be modified by users.
Delete a Function or Task and Remove Telemetry Endpoint

To remove a telemetry endpoint, you must first cancel all deployments and remove all functions that use that endpoint. The endpoint cannot be removed while any functions are still using it, even if those functions are not currently deployed.

Web UI method:
1. Navigate to the Functions list page
2. Click on the function you want to delete
3. Navigate to the Deployments tab
4. For each deployment:
  1. Click “Cancel Deployment” and confirm
  2. Wait for all deployments to be fully cancelled
5. Navigate to the Settings tab
6. Click “Delete Function” and confirm
7. Verify the function is completely removed
8. After all functions using the telemetry endpoint have been removed:
  1. Navigate to your NGC organization settings
  2. Select “Settings” in your Cloud Functions NGC organization
  3. Scroll to the Telemetry Endpoints section
  4. Find the endpoint you want to remove
  5. Click the delete icon next to the endpoint
  6. Confirm the deletion
CLI method:
```
# First, cancel all deployments for a function version
ngc cloud-function function deploy remove <function-id>:<function-version-id>

# Wait for deployments to be fully cancelled, then remove the function
ngc cloud-function function remove <function-id>

# After all functions using the telemetry endpoint have been removed, delete the endpoint
ngc cloud-function telemetry-endpoint remove <endpoint-name>
```
Warning

All deployments must be fully cancelled before function removal. The function must be completely removed before the endpoint can be removed. Removing a telemetry endpoint will permanently delete the endpoint configuration. Make sure to export any necessary telemetry data before removing endpoints.

When you select a telemetry endpoint, NVCF:

Deploys a dedicated OpenTelemetry collector with your function or task
Automatically configures authentication and endpoint connections
Enables collection of metrics, logs, and traces from your function or task
Directs telemetry data to your organization’s observability platform

Resource Management#

In the pod for each function or task, an OpenTelemetry collector is deployed. This collector has automatic memory management and built-in resource protection to ensure reliable telemetry collection without impacting function or task performance. NVCF manages all resource allocation for the collector, so you don’t need to worry about resource configuration.

Security#

NVCF ensures secure telemetry handling by storing credentials securely in the NGC Encrypted Secrets Store, as outlined in the Secret Management section. Each collector only accesses its own function’s or task’s data, and authentication is handled automatically. Credentials are rotated securely to maintain security and integrity.

How to Set Up External Observability on a BYOC Cluster#

BYOC Steps#

BYOC cluster registered with NVCA on 2.46.10+ version
- Upgrade instructions

Ensure the following feature flags on NVCA are set to true, see Managing Feature Flags:
- BYOObservability

Error Handling#

If issues occur with telemetry collection:

Your function or task continues to run normally
Error messages are logged for troubleshooting
Health status is monitored and reported
Automatic retry logic handles temporary failures

The collector’s health can be monitored through:

Status checks in the NGC UI
Metrics in your observability platform
Built-in health endpoints

Appendix A: Terminology#

Term	Definition
NGC	NVIDIA GPU Cloud which provides a way for users to set up and manage access to NVIDIA cloud services
NVCF	NVIDIA Cloud Functions and Tasks
OpenTelemetry	An open source standard for telemetry data collection and transmission
OTLP	OpenTelemetry Protocol - the data transfer protocol used by OpenTelemetry for sending telemetry data
OTel Collector	The OpenTelemetry Collector component that receives, processes, and exports telemetry data
Telemetry Endpoint	A configuration that specifies where telemetry data (metrics, logs, and traces) is sent for external observability platforms

Appendix B: Available Metrics#

The following metrics are collected through the OpenTelemetry collector deployed with your function when using External Observability and exported through your configured Telemetry Endpoints. The metrics exported depend on the Kubernetes deployment used by the function or task.

Key metrics include:

Function or task invocation metrics
Resource utilization metrics
Platform metrics related to the function or task

Note

Metrics are filtered based on deployment type and configuration. Not all metrics may be available for all deployment scenarios.

CPU Metrics#

Metric	Description
container_cpu_cfs_throttled_periods_total	Number of periods the container was throttled (only present if container was throttled)
container_cpu_cfs_throttled_seconds_total	Total time the container was throttled in seconds (only present if container was throttled)
container_cpu_usage_seconds_total	Total CPU time used by the container in seconds

Memory Metrics#

Metric	Description
container_memory_cache	Memory used by the page cache in bytes
container_memory_rss	Resident Set Size: total memory allocated for the container
container_memory_swap	Swap memory used by the container in bytes
container_memory_usage_bytes	Total memory usage of the container in bytes
container_memory_working_set_bytes	Memory working set: memory actively used by the container

Filesystem Metrics#

Only present if the container is performing IO operations:

Metric	Description
container_fs_limit_bytes	Total filesystem limit in bytes
container_fs_usage_bytes	Total filesystem usage in bytes
container_fs_reads_total	Total number of filesystem read operations
container_fs_writes_total	Total number of filesystem write operations
container_fs_writes_bytes_total	Total bytes written to the filesystem
container_fs_reads_bytes_total	Total bytes read from the filesystem

Network Metrics#

Only present if the container is performing network operations:

Metric	Description
container_network_receive_bytes_total	Total bytes received over the network
container_network_receive_errors_total	Total number of network receive errors
container_network_receive_packets_dropped_total	Total number of received packets dropped
container_network_receive_packets_total	Total number of packets received
container_network_transmit_bytes_total	Total bytes transmitted over the network
container_network_transmit_errors_total	Total number of network transmit errors
container_network_transmit_packets_dropped_total	Total number of transmitted packets dropped
container_network_transmit_packets_total	Total number of packets transmitted

Kubernetes State Metrics#

Only present if helm-based function has a deployment k8s object:

Metric	Description
kube_deployment_status_replicas	Total number of replicas in the deployment
kube_deployment_status_replicas_available	Number of available replicas in the deployment
kube_deployment_status_replicas_unavailable	Number of unavailable replicas in the deployment
kube_deployment_status_replicas_updated	Number of updated replicas in the deployment
kube_deployment_status_replicas_ready	Number of ready replicas in the deployment
kube_service_created	Timestamp when the service was created

Only present if helm-based function has a replicaset k8s object:

Metric	Description
kube_replicaset_status_replicas	Total number of replicas in the replicaset
kube_replicaset_status_ready_replicas	Number of ready replicas in the replicaset

Only present if helm-based function has a stateful k8s object:

Metric	Description
kube_statefulset_status_replicas	Total number of replicas in the statefulset
kube_statefulset_status_replicas_ready	Number of ready replicas in the statefulset

Only present if the helm-based function has a job/cronjob k8s object:

Metric	Description
kube_job_status_active	Number of active jobs
kube_job_status_failed	Number of failed jobs
kube_job_status_succeeded	Number of succeeded jobs
kube_cronjob_status_active	Number of active cronjobs

Only present if function has a configmap k8s object:

Metric	Description
kube_configmap_created	Timestamp when the configmap was created

Only present if function has a secret k8s object:

Metric	Description
kube_secret_created	Timestamp when the secret was created

Only present if function has a pod k8s object:

Metric	Description
kube_pod_container_info	Information about the container in the pod
kube_pod_container_resource_limits	Resource limits for the container
kube_pod_container_resource_requests	Resource requests for the container (only present if resources were requested)
kube_pod_container_status_last_terminated_exitcode	Exit code of the last terminated container (only present if an error happened)
kube_pod_container_status_last_terminated_reason	Reason for the last container termination (only present if an error happened)
kube_pod_container_status_restarts_total	Total number of container restarts
kube_pod_container_status_running	Whether the container is running
kube_pod_container_status_terminated	Whether the container has terminated (only present if terminated)
kube_pod_container_status_terminated_reason	Reason for container termination (only present if terminated)
kube_pod_container_status_waiting	Whether the container is waiting (only present if pod is waiting)
kube_pod_container_status_waiting_reason	Reason for container waiting (only present if pod is waiting)

Only present if function/task helm deployments:

Metric	Description
kube_pod_info	Information about the pod
kube_pod_status_reason	Reason for the pod status

Only present if function/task helm defined an init container:

Metric	Description
kube_pod_init_container_info	Information about the init container
kube_pod_init_container_status_ready	Whether the init container is ready
kube_pod_init_container_status_restarts_total	Total number of init container restarts
kube_pod_init_container_status_running	Whether the init container is running
kube_pod_init_container_last_status_terminated_reason	Reason for the last init container termination
kube_pod_init_container_status_waiting_reason	Reason for init container waiting

GPU Metrics#

Always present for container and helm:

Metric	Description
DCGM_FI_DEV_GPU_UTIL	GPU utilization percentage

NVCF Worker Service Metrics#

Streaming metrics are only present for streaming functions.

Metric	Description
nvcf_worker_service_request_total	Total number of service requests processed
nvcf_worker_service_stream_latency_seconds_bucket	Histogram buckets for stream request latency in seconds
nvcf_worker_service_stream_latency_seconds_count	Total count of stream latency measurements
nvcf_worker_service_stream_latency_seconds_sum	Total sum of stream latency measurements in seconds
nvcf_worker_service_stream_session_duration_seconds_bucket	Histogram buckets for streaming session duration in seconds
nvcf_worker_service_stream_session_duration_seconds_count	Total count of streaming session duration measurements
nvcf_worker_service_stream_session_duration_seconds_sum	Total sum of streaming session duration measurements in seconds
nvcf_worker_service_stream_streaming_app_ready	Indicates whether the streaming application is ready (1) or not (0)

Note

All NVCF metrics include the label origin: nvcf-byoo.

NVCA Instance Type Metrics#

Present for cluster management:

Metric	Description
nvca_instance_type_capacity	Count of instances that could be deployed on schedulable node resources by instance type
nvca_instance_type_allocatable	Count of instances that can be deployed on available schedulable node resources by instance type
nvca_instance_type_unschedulable	Count of instances that could be deployed on unschedulable node resources by instance type

OpenTelemetry Collector Metrics#

Always present for container and helm. The final list of metrics depends on telemetries received & exported by function/task:

Metric	Description
otelcol_receiver_refused_metric_points_total	Total number of metric points refused by the receiver
otelcol_receiver_refused_log_records_total	Total number of log records refused by the receiver
otelcol_receiver_refused_spans_total	Total number of spans refused by the receiver
otelcol_receiver_accepted_metric_points_total	Total number of metric points accepted by the receiver
otelcol_receiver_accepted_log_records_total	Total number of log records accepted by the receiver
otelcol_receiver_accepted_spans_total	Total number of spans accepted by the receiver
otelcol_exporter_sent_metric_points_total	Total number of metric points sent by the exporter
otelcol_exporter_sent_spans_total	Total number of spans sent by the exporter
otelcol_exporter_sent_log_records_total	Total number of log records sent by the exporter
otelcol_exporter_send_failed_metric_points_total	Total number of metric points that failed to send
otelcol_exporter_send_failed_spans_total	Total number of spans that failed to send
otelcol_exporter_send_failed_log_records_total	Total number of log records that failed to send
otelcol_processor_outgoing_items_total	Total number of items processed and sent out
otelcol_processor_incoming_items_total	Total number of items received for processing

Resource Attributes#

All logs and metrics have the following attributes added to their metadata:

Attribute	Description
function_id	Unique identifier for the function or task
function_version_id	Version identifier for the function or task
instance_id	Unique identifier for the function or task instance
nca_id	NVIDIA Cloud Account identifier
cloud_region	Cloud region where the function or task is deployed (non-GFN)
zone_name	Zone name where the function or task is deployed (GFN)
cloud_provider	Cloud provider where the function or task is deployed

The platform metrics have the following attributes when available:

Source	Attributes
cadvisor	container, cpu, device, image, job, service, interface, pod
kube state metrics	container, job, service, pod, reason, condition, configmap, created_by_kind, created_by_name, deployment, host_network, image, phase, qos_class, replicaset, resource, secret, statefulset, status and unit
DCGM	container, DCGM_FI_DRIVER_VERSION, device, job, service, modelName, pci_bus_id and pod

Note

job attribute is available in Grafana Cloud
service is used in Datadog instead of attribute job