Observability Guide#

Overview#

NVIDIA Cloud Functions provides a comprehensive observability solution through two main approaches:

  1. NGC UI/CLI Observability

    1. Basic metrics in the Overview tab

    2. Log data with time ranges in the Logs tab

    3. Limited to NGC account holders

    4. Enabled by default for all functions or tasks

    See details on using the built-in observability features below.

  2. External Observability Integration

    1. Send telemetry data to your organization’s observability platforms

    2. Support for logs, metrics, and traces

    3. Requires explicit configuration of telemetry endpoints

    See External Observability for detailed instructions.

NGC UI Observability#

The NGC UI provides basic observability through three main tabs:

  1. Overview Tab

    1. Provides real-time status and performance metrics

    2. Shows instance counts and request statistics

    3. Displays basic function or task information

  2. Logs Tab

    1. Access to container and event logs

    2. Real-time log streaming capabilities

    3. Search and filtering functionality

  3. Metrics Tab

    1. Detailed performance indicators

    2. Time-series data visualization

    3. Resource utilization trends

You can access these tabs by navigating to your function in the NGC UI. Each tab offers specific insights into your function’s operation and performance.

Overview#

The Overview tab provides access to your function’s current status and performance metrics, offering real-time insights into your function’s operation and health.

Key Information#

  1. Basic Function or Task Metrics

    1. Current function or task status (Running, Stopped, Error)

    2. Last updated timestamp

    3. Function or task version

    4. Runtime environment details

  2. Instance Counts

    1. Active instances

    2. Pending instances

    3. Failed instances

    4. Historical instance trends

  3. Request Statistics

    1. Total requests processed

    2. Current request rate

    3. Success/failure ratios

    4. Average response times

How to Access#

  1. Navigate to the Functions list page

  2. Click on your function

  3. The Overview tab is displayed by default

  4. Use the refresh button to update the data

Important

  • Overview data updates every 30 seconds

  • Historical data is available for the last 24 hours

  • Some metrics may have a slight delay in reporting

Logs#

The Logs tab enables monitoring through detailed log access.

Log Categories#

NVCF displays logs related to:

  • Deployment stages

    • Function or Task Creation

    • Function or Task Deployment

  • Function or task invocation logs

Ports#

The OpenTelemetry collector uses the following ports:

  • OTLP (OpenTelemetry Protocol)

    • OTLP gRPC: Port 14357

    • OTLP HTTP: Port 14358

  • Metrics

    • Port 18888 - Used for collector metrics

  • Health Check

    • Port 13133 - Used for health check endpoint

Telemetry Endpoint Updates#

Telemetry endpoints and their associated secrets can be updated by the NCA admin at any time. When updates occur:

  • Previous configurations are replaced, not preserved

  • The OTel collector requires a restart to apply new configurations

  • Running functions or tasks will continue using the previous configuration until redeployed

Note

Functions or tasks must be redeployed to use updated telemetry configurations.

Metrics#

The complete list of available metrics and their attributes is maintained in the NVCF telemetry configuration repository. The metrics exported depend on the Kubernetes deployment used by the function or task.

Key metrics include:

  • Function or task invocation metrics

  • Resource utilization metrics

  • Platform metrics related to the function or task

Note

Metrics are filtered based on deployment type and configuration. Not all metrics may be available for all deployment scenarios.

Viewing Metrics#

  1. Navigate to the Functions list page

  2. Click on your function

  3. Select the Metrics tab

The Metrics view displays:

Summary Statistics

  • Total Invocations - Number of function calls in the selected time period

  • Average Inference Time - Mean processing time for function calls

  • Total Instance Count - Current number of running instances

  • Failures - Count of failed executions

Time Series Graphs

  • Invocation Activity and Queue Depth - Shows request patterns and queued requests

  • Average Inference Time - Processing duration trends

  • Instances Over Time - Shows scaling behavior

  • Success Rate - Function reliability metrics

Use the time range selector (e.g., Past 1 Hour) in the top right to adjust the view period.

Important

  • Minor discrepancies may occur in aggregated invocations due to rounding, especially with smaller values

  • Most recent metrics may be delayed

  • Metrics have a 5-minute ingestion delay

Note

NGC UI access is limited to NGC account holders. For broader observability access, work with your account administrator to configure external observability endpoints.

External Observability#

Important

External Observability is a beta feature. We encourage users to try it out and submit feedback to help us improve the experience. Please use the feedback form on the right side of the screen to share your thoughts and suggestions.

Configure external observability endpoints to monitor your NVIDIA Cloud Functions. By setting up telemetry endpoints, you can stream metrics (see Appendix B: Available Metrics), logs, and traces to popular observability platforms like Grafana Cloud and Datadog. This extends beyond the basic metrics in NGC UI, giving you deeper insights into your functions’ performance.

Important

To export function or task telemetry through external observability platforms (BYOO), your source code must be instrumented using OpenTelemetry. Without proper OpenTelemetry instrumentation, only system-level metrics will be available.

Ports#

The OpenTelemetry collector uses the following ports:

  • OTLP (OpenTelemetry Protocol)

    • OTLP gRPC: Port 14357

    • OTLP HTTP: Port 14358

  • Metrics

    • Port 18888 - Used for collector metrics

  • Health Check

    • Port 13133 - Used for health check endpoint

Note

These ports are reserved for the OpenTelemetry collector and should not be used by your functions or tasks.

Getting Started#

Telemetry endpoints can only be configured when creating a new function or deploying a new version. You cannot add a telemetry endpoint to an existing function deployment.

A Telemetry Endpoint is a configuration that specifies where telemetry data is sent. This is allowed for all functions or tasks to be configured to send telemetry data to an external observability platform.

  1. Configure External Telemetry Endpoints

    Note

    Remember that to collect custom metrics, logs, and traces from your function’s or task’s code, you must instrument your application using OpenTelemetry. System-level metrics (CPU, memory, GPU) are collected automatically.

    You can configure telemetry endpoints using either the web UI or the NGC CLI:

    Web UI Method:

    • Navigate to your NGC organization settings

    • Select “Settings” in your Cloud Functions NGC organization

    • Scroll to the bottom of the page

    • Click “Add Telemetry Endpoint”

      NVCF interface showing the Add Telemetry Endpoint button
    • Select your desired endpoint type (Grafana Cloud or Datadog)

    • Configure the endpoint with the required credentials

    Grafana Cloud

    Follow these steps to set up Grafana Cloud integration with NVCF:

    Web UI Method:

    1. Access Grafana Cloud

      1. For new users:

        1. Visit https://grafana.com/auth/sign-up/create-user

        2. Complete the free Grafana Cloud registration process

      2. For existing users:

        1. Visit https://grafana.com/auth/sign-in

        2. Log in with your credentials

    2. Configure OpenTelemetry

      1. In the top menu bar, locate “My Account”

      2. Expand the Details section by clicking the icon

        Grafana Cloud Portal interface showing organization settings and stack management
    3. Access OpenTelemetry Settings

      1. In your Grafana Cloud stack, locate the OpenTelemetry card

      2. Click “Configure” to access the OpenTelemetry configuration

      3. You will see options for configuring:

        1. Metrics

        2. Logs

        3. Traces

        Grafana Cloud stack interface showing OpenTelemetry configuration options
    4. Locate OTLP Configuration Details

      1. The OTLP endpoint section will display:

        1. OTLP Endpoint URL (e.g., https://otlp-gateway-prod-us-west-0.grafana.net/otlp)

        2. Instance ID (a numeric identifier for your instance)

        3. API Token section with option to “Generate now”

      2. Use the “Copy to Clipboard” buttons to easily copy these values into the NVCF Telemetry Endpoint configuration.

    Alternative: Create Grafana Telemetry Endpoint via CLI

    As an alternative to the web UI, you can create the Grafana Cloud telemetry endpoint using the NGC CLI:

    ngc cloud-function telemetry-endpoint create --name grafana-cloud-metrics \
      --type METRICS \
      --provider GRAFANA_CLOUD \
      --protocol HTTP \
      --endpoint https://otlp-gateway-prod-us-west-0.grafana.net/otlp \
      --key your-grafana-api-token
    

    Warning

    Keep your API Token secure and never share it publicly. If your token is compromised, you can generate a new one and update your configuration.

    Datadog

    Follow these steps to set up Datadog integration with NVCF:

    Web UI Method:

    1. Sign Up for Datadog

      1. Visit the Datadog Getting Started page

      2. Complete the registration process for a new Datadog account

    2. Configure API Key

      1. Log in to your Datadog account

      2. Navigate to Organization Settings (found in the bottom left corner of the page)

      3. Select API Keys from the left menu

      4. Either click “+New Key” to create a new API key or copy an existing one from the list

    3. Get Telemetry Endpoint

      1. Your endpoint URL will be displayed in the browser address bar

      2. Available endpoints based on your instance location:

        1. datadoghq.com (US1)

        2. us3.datadoghq.com (US3)

        3. us5.datadoghq.com (US5)

        4. datadoghq.eu (EU1)

        5. ddog-gov.com (US1-FED)

      3. For more details on Datadog sites and endpoints, see the Datadog site documentation

    4. Configure in NVCF Web UI

      1. Input the configuration details:

        1. API Key (copied from step 2)

        2. Endpoint URL (selected from step 3)

        3. Select telemetry type(s):

          1. Choose “Logs” to send log data

          2. Choose “Metrics” to send metrics data

          3. You can select both to send both types of telemetry

      2. Save the configuration

        NVCF interface showing Datadog endpoint configuration

    Alternative: Create Datadog Telemetry Endpoint via CLI

    As an alternative to the web UI, you can create the Datadog telemetry endpoint using the NGC CLI:

    # Example
    ngc cloud-function telemetry-endpoint create --name datadog-metrics \
      --type METRICS \
      --provider DATADOG \
      --protocol HTTP \
      --endpoint datadoghq.com \
      --key your-datadog-api-key
    

    Note

    Make sure to keep your API key secure and never share it publicly. If your key is compromised, you can generate a new one and update your configuration.

    CLI Method:

    As an alternative to the web UI, you can use the NGC CLI to manage telemetry endpoints. Here are the basic CLI commands:

    # List existing telemetry endpoints
    ngc cloud-function telemetry-endpoint list
    
    # Create a new telemetry endpoint
    ngc cloud-function telemetry-endpoint create --name <endpoint-name> \
      --type <LOGS|METRICS> \
      --provider <GRAFANA_CLOUD|DATADOG> \
      --protocol <GRPC|HTTP> \
      --endpoint <endpoint-url> \
      --key <api-key>
    
    # Remove a telemetry endpoint
    ngc cloud-function telemetry-endpoint remove <endpoint-name>
    

    Note

    • Endpoint names must be unique within your NGC organization

    • API tokens and keys are stored securely in NGC Encrypted Secrets Store and can be updated if needed

    • Endpoint configurations cannot be updated - delete and recreate to change settings

  2. Add Telemetry Endpoint to Function or Task

    Telemetry endpoints can only be configured when creating a new function or deploying a new version. You cannot add a telemetry endpoint to an existing function deployment.

    Web UI Method:

    When creating a new function or deploying a new version:

    • In the function creation/deployment form

    • Look for the Telemetry Endpoints section

    • Select the desired telemetry endpoint from the dropdown

    • Complete the rest of the function creation/deployment process

    Note

    If you need to change the telemetry endpoint for an existing function, you must deploy a new version of that function with the updated telemetry configuration.

  3. Verify Deployment

    After deploying the function with the telemetry endpoint, verify that the telemetry data is flowing correctly to your observability platform.

    Important

    If you don’t see your custom metrics, logs, or traces in your observability platform, verify that:

    1. Your function’s or task’s code is properly instrumented with OpenTelemetry

    2. The telemetry endpoint is correctly configured

    3. The function or task deployment is active and running

    Grafana Cloud
    1. Log in to your Grafana Cloud account

    2. Navigate to the Metrics Explorer

    3. Search for the following metrics to verify data flow:

      1. DCGM_FI_DEV_GPU_UTIL - Shows GPU utilization percentage

      2. container_fs_reads_bytes_total - Shows container filesystem read metrics

      3. container_fs_writes_bytes_total - Shows container filesystem write metrics

    Grafana Metrics Explorer showing GPU utilization and container filesystem metrics over time
    Datadog
    1. Log in to your Datadog account

    2. Navigate to the Metrics Explorer

    3. Search for “nvidia.cloud.function” to find your function’s metrics

    4. You can view metrics such as:

      1. GPU utilization

      2. Function or task invocations

      3. Request latency

      4. Resource usage

    Datadog Metrics Explorer showing NVIDIA Cloud Function metrics

    Note

    The OpenTelemetry collector version, image and configuration are managed entirely by NVCF and cannot be modified by users.

  4. Delete a Function or Task and Remove Telemetry Endpoint

    To remove a telemetry endpoint, you must first cancel all deployments and remove all functions that use that endpoint. The endpoint cannot be removed while any functions are still using it, even if those functions are not currently deployed.

    Web UI method:

    1. Navigate to the Functions list page

    2. Click on the function you want to delete

    3. Navigate to the Deployments tab

    4. For each deployment:

      1. Click “Cancel Deployment” and confirm

      2. Wait for all deployments to be fully cancelled

    5. Navigate to the Settings tab

    6. Click “Delete Function” and confirm

    7. Verify the function is completely removed

    8. After all functions using the telemetry endpoint have been removed:

      1. Navigate to your NGC organization settings

      2. Select “Settings” in your Cloud Functions NGC organization

      3. Scroll to the Telemetry Endpoints section

      4. Find the endpoint you want to remove

      5. Click the delete icon next to the endpoint

      6. Confirm the deletion

    CLI method:

    # First, cancel all deployments for a function version
    ngc cloud-function function deploy remove <function-id>:<function-version-id>
    
    # Wait for deployments to be fully cancelled, then remove the function
    ngc cloud-function function remove <function-id>
    
    # After all functions using the telemetry endpoint have been removed, delete the endpoint
    ngc cloud-function telemetry-endpoint remove <endpoint-name>
    

    Warning

    All deployments must be fully cancelled before function removal. The function must be completely removed before the endpoint can be removed. Removing a telemetry endpoint will permanently delete the endpoint configuration. Make sure to export any necessary telemetry data before removing endpoints.

When you select a telemetry endpoint, NVCF:

  • Deploys a dedicated OpenTelemetry collector with your function or task

  • Automatically configures authentication and endpoint connections

  • Enables collection of metrics, logs, and traces from your function or task

  • Directs telemetry data to your organization’s observability platform

Resource Management#

In the pod for each function or task, an OpenTelemetry collector is deployed. This collector has automatic memory management and built-in resource protection to ensure reliable telemetry collection without impacting function or task performance. NVCF manages all resource allocation for the collector, so you don’t need to worry about resource configuration.

Security#

NVCF ensures secure telemetry handling by storing credentials securely in the NGC Encrypted Secrets Store, as outlined in the Secret Management section. Each collector only accesses its own function’s or task’s data, and authentication is handled automatically. Credentials are rotated securely to maintain security and integrity.

Error Handling#

If issues occur with telemetry collection:

  • Your function or task continues to run normally

  • Error messages are logged for troubleshooting

  • Health status is monitored and reported

  • Automatic retry logic handles temporary failures

The collector’s health can be monitored through:

  • Status checks in the NGC UI

  • Metrics in your observability platform

  • Built-in health endpoints

Appendix A: Terminology#

Term

Definition

NGC

NVIDIA GPU Cloud which provides a way for users to set up and manage access to NVIDIA cloud services

NVCF

NVIDIA Cloud Functions and Tasks

OpenTelemetry

An open source standard for telemetry data collection and transmission

OTLP

OpenTelemetry Protocol - the data transfer protocol used by OpenTelemetry for sending telemetry data

OTel Collector

The OpenTelemetry Collector component that receives, processes, and exports telemetry data

Telemetry Endpoint

A configuration that specifies where telemetry data (metrics, logs, and traces) is sent for external observability platforms

Appendix B: Available Metrics#

The OpenTelemetry collector deployed with your function collects and exports the following metrics:

CPU Metrics#

Metric

Description

container_cpu_cfs_throttled_periods_total

Number of periods the container was throttled (only present if container was throttled)

container_cpu_cfs_throttled_seconds_total

Total time the container was throttled in seconds (only present if container was throttled)

container_cpu_usage_seconds_total

Total CPU time used by the container in seconds

Memory Metrics#

Metric

Description

container_memory_cache

Memory used by the page cache in bytes

container_memory_rss

Resident Set Size: total memory allocated for the container

container_memory_swap

Swap memory used by the container in bytes

container_memory_usage_bytes

Total memory usage of the container in bytes

container_memory_working_set_bytes

Memory working set: memory actively used by the container

Filesystem Metrics#

Only present if the container is performing IO operations:

Metric

Description

container_fs_limit_bytes

Total filesystem limit in bytes

container_fs_usage_bytes

Total filesystem usage in bytes

container_fs_reads_total

Total number of filesystem read operations

container_fs_writes_total

Total number of filesystem write operations

container_fs_writes_bytes_total

Total bytes written to the filesystem

container_fs_reads_bytes_total

Total bytes read from the filesystem

Network Metrics#

Only present if the container is performing network operations:

Metric

Description

container_network_receive_bytes_total

Total bytes received over the network

container_network_receive_errors_total

Total number of network receive errors

container_network_receive_packets_dropped_total

Total number of received packets dropped

container_network_receive_packets_total

Total number of packets received

container_network_transmit_bytes_total

Total bytes transmitted over the network

container_network_transmit_errors_total

Total number of network transmit errors

container_network_transmit_packets_dropped_total

Total number of transmitted packets dropped

container_network_transmit_packets_total

Total number of packets transmitted

Kubernetes State Metrics#

Only present if helm-based function has a deployment k8s object:

Metric

Description

kube_deployment_status_replicas

Total number of replicas in the deployment

kube_deployment_status_replicas_available

Number of available replicas in the deployment

kube_deployment_status_replicas_unavailable

Number of unavailable replicas in the deployment

kube_deployment_status_replicas_updated

Number of updated replicas in the deployment

kube_deployment_status_replicas_ready

Number of ready replicas in the deployment

kube_service_created

Timestamp when the service was created

Only present if helm-based function has a replicaset k8s object:

Metric

Description

kube_replicaset_status_replicas

Total number of replicas in the replicaset

kube_replicaset_status_ready_replicas

Number of ready replicas in the replicaset

Only present if helm-based function has a stateful k8s object:

Metric

Description

kube_statefulset_status_replicas

Total number of replicas in the statefulset

kube_statefulset_status_replicas_ready

Number of ready replicas in the statefulset

Only present if the helm-based function has a job/cronjob k8s object:

Metric

Description

kube_job_status_active

Number of active jobs

kube_job_status_failed

Number of failed jobs

kube_job_status_succeeded

Number of succeeded jobs

kube_cronjob_status_active

Number of active cronjobs

Only present if function has a configmap k8s object:

Metric

Description

kube_configmap_created

Timestamp when the configmap was created

Only present if function has a secret k8s object:

Metric

Description

kube_secret_created

Timestamp when the secret was created

Only present if function has a pod k8s object:

Metric

Description

kube_pod_container_info

Information about the container in the pod

kube_pod_container_resource_limits

Resource limits for the container

kube_pod_container_resource_requests

Resource requests for the container (only present if resources were requested)

kube_pod_container_status_last_terminated_exitcode

Exit code of the last terminated container (only present if an error happened)

kube_pod_container_status_last_terminated_reason

Reason for the last container termination (only present if an error happened)

kube_pod_container_status_restarts_total

Total number of container restarts

kube_pod_container_status_running

Whether the container is running

kube_pod_container_status_terminated

Whether the container has terminated (only present if terminated)

kube_pod_container_status_terminated_reason

Reason for container termination (only present if terminated)

kube_pod_container_status_waiting

Whether the container is waiting (only present if pod is waiting)

kube_pod_container_status_waiting_reason

Reason for container waiting (only present if pod is waiting)

Only present if function/task helm deployments:

Metric

Description

kube_pod_info

Information about the pod

kube_pod_status_reason

Reason for the pod status

Only present if function/task helm defined an init container:

Metric

Description

kube_pod_init_container_info

Information about the init container

kube_pod_init_container_status_ready

Whether the init container is ready

kube_pod_init_container_status_restarts_total

Total number of init container restarts

kube_pod_init_container_status_running

Whether the init container is running

kube_pod_init_container_last_status_terminated_reason

Reason for the last init container termination

kube_pod_init_container_status_waiting_reason

Reason for init container waiting

GPU Metrics#

Always present for container and helm:

Metric

Description

DCGM_FI_DEV_GPU_UTIL

GPU utilization percentage

OpenTelemetry Collector Metrics#

Always present for container and helm. The final list of metrics depends on telemetries received & exported by function/task:

Metric

Description

otelcol_receiver_refused_metric_points_total

Total number of metric points refused by the receiver

otelcol_receiver_refused_log_records_total

Total number of log records refused by the receiver

otelcol_receiver_refused_spans_total

Total number of spans refused by the receiver

otelcol_receiver_accepted_metric_points_total

Total number of metric points accepted by the receiver

otelcol_receiver_accepted_log_records_total

Total number of log records accepted by the receiver

otelcol_receiver_accepted_spans_total

Total number of spans accepted by the receiver

otelcol_exporter_sent_metric_points_total

Total number of metric points sent by the exporter

otelcol_exporter_sent_spans_total

Total number of spans sent by the exporter

otelcol_exporter_sent_log_records_total

Total number of log records sent by the exporter

otelcol_exporter_send_failed_metric_points_total

Total number of metric points that failed to send

otelcol_exporter_send_failed_spans_total

Total number of spans that failed to send

otelcol_exporter_send_failed_log_records_total

Total number of log records that failed to send

otelcol_processor_outgoing_items_total

Total number of items processed and sent out

otelcol_processor_incoming_items_total

Total number of items received for processing

Resource Attributes#

All logs and metrics have the following attributes added to their metadata:

Attribute

Description

function_id

Unique identifier for the function or task

function_version_id

Version identifier for the function or task

instance_id

Unique identifier for the function or task instance

nca_id

NVIDIA Cloud Account identifier

cloud_region

Cloud region where the function or task is deployed (non-GFN)

zone_name

Zone name where the function or task is deployed (GFN)

cloud_provider

Cloud provider where the function or task is deployed

The platform metrics have the following attributes when available:

Source

Attributes

cadvisor

container, cpu, device, image, job, service, interface, pod

kube state metrics

container, job, service, pod, reason, condition, configmap, created_by_kind, created_by_name, deployment, host_network, image, phase, qos_class, replicaset, resource, secret, statefulset, status and unit

DCGM

container, DCGM_FI_DRIVER_VERSION, device, job, service, modelName, pci_bus_id and pod

Note

  • job attribute is available in Grafana Cloud

  • service is used in Datadog instead of attribute job