For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
    • Overview
    • Quickstart
  • Before You Deploy
    • Infrastructure Sizing
    • Manifest
  • Deployment
    • Installation Overview
    • Image Mirroring
    • Helmfile Installation
  • GPU Cluster Setup
    • GPU Cluster Setup
    • Self-Managed Clusters
  • Configuration
    • Optional Enhancements
    • LLM Function Enablement
    • Gateway Routing
    • Third-Party Registries
    • Registry Allowlist
    • Cluster Configuration
    • KAI Scheduler
  • Using Cloud Functions
    • API
    • Service Keys
    • Function Creation
    • LLM Gateway
    • Generic HTTP Function Invocation
    • gRPC Function Invocation
    • Container Functions
    • Helm Functions
    • Streaming Functions
    • CLI
  • Observability
    • Observability
    • Example Dashboards
  • Operations
    • Control Plane Operations
    • Cluster Monitoring
    • Troubleshooting
  • Runbooks
    • Runbooks
    • Key Rotation
  • Reference
    • Cluster Reference
    • gRPC Load Testing
    • gRPC Load Test SLI Guide
    • HTTP Load Testing
    • HTTP Load Test SLI Guide
    • HTTP Soak Testing
  • Development
    • Architecture Overview
    • Local Development
    • Fake GPU Operator
    • Release Process
  • Managed (Legacy)
    • Function Lifecycle
    • Observability
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoCloud Functions
On this page
  • Overview
  • NGC UI Observability
  • Overview
  • How to Access
  • Logs
  • Viewing Metrics
  • External Observability
  • Ports
  • Configuration
  • Resource Management
  • Security
  • How to Set Up External Observability on a BYOC Cluster
  • BYOC Steps
  • Error Handling
  • Appendix A: Terminology
  • Appendix B: Available Metrics
  • CPU Metrics
  • Memory Metrics
  • Filesystem Metrics
  • Network Metrics
  • Kubernetes State Metrics
  • GPU Metrics
  • NVCF Worker Service Metrics
  • NVCA Instance Type Metrics
  • OpenTelemetry Collector Metrics
  • Resource Attributes
  • Appendix C: Adding Custom Application Metrics/Logs/Traces
Managed (Legacy)

Observability (Legacy Managed)

||View as Markdown|
Previous

Function Lifecycle

Overview

NVIDIA Cloud Functions provides a comprehensive observability solution through two main approaches:

  1. NGC UI/CLI Observability

    1. Basic metrics in the Overview tab
    2. Log data with time ranges in the Logs tab
    3. Limited to NGC account holders
    4. Enabled by default for all functions or tasks

    See details on using the built-in observability features below.

  2. External Observability Integration

    1. Send telemetry data to your organization’s observability platforms
    2. Support for logs, metrics, and traces
    3. Requires explicit configuration of telemetry endpoints

    See external-observability for detailed instructions.

NGC UI Observability

The NGC UI provides basic observability through three main tabs:

  1. Overview Tab

    1. Provides real-time status and performance metrics
    2. Shows instance counts and request statistics
    3. Displays basic function or task information
  2. Logs Tab

    1. Access to container and event logs
    2. Real-time log streaming capabilities
    3. Search and filtering functionality
  3. Metrics Tab

    1. Detailed performance indicators
    2. Time-series data visualization
    3. Resource utilization trends

You can access these tabs by navigating to your function in the NGC UI. Each tab offers specific insights into your function’s operation and performance.

Overview

The Overview tab provides access to your function’s current status and performance metrics, offering real-time insights into your function’s operation and health.

  1. Basic Function or Task Metrics

    1. Current function or task status (Running, Stopped, Error)
    2. Last updated timestamp
    3. Function or task version
    4. Runtime environment details
  2. Instance Counts

    1. Active instances
    2. Pending instances
    3. Failed instances
    4. Historical instance trends
  3. Request Statistics

    1. Total requests processed
    2. Current request rate
    3. Success/failure ratios
    4. Average response times

How to Access

  1. Navigate to the Functions list page
  2. Click on your function
  3. The Overview tab is displayed by default
  4. Use the refresh button to update the data
  • Overview data updates every 30 seconds
  • Historical data is available for the last 24 hours
  • Some metrics may have a slight delay in reporting

Logs

The Logs tab enables monitoring through detailed log access.

NVCF displays logs related to:

  • Deployment stages

    • Function or Task Creation
    • Function or Task Deployment
  • Function or task invocation logs

  • Real-time logs (listed in the UI under the Live Tail tab)

    • For detailed information about real-time logging capabilities, see the NGC UI Logs tab described above.

Viewing Metrics

  1. Navigate to the Functions list page
  2. Click on your function
  3. Select the Metrics tab

The Metrics view displays:

Summary Statistics

  • Total Invocations - Number of function calls in the selected time period
  • Average Inference Time - Mean processing time for function calls
  • Total Instance Count - Current number of running instances
  • Failures - Count of failed executions

Time Series Graphs

  • Invocation Activity and Queue Depth - Shows request patterns and queued requests
  • Average Inference Time - Processing duration trends
  • Instances Over Time - Shows scaling behavior
  • Success Rate - Function reliability metrics

Use the time range selector (e.g., Past 1 Hour) in the top right to adjust the view period.

  • Minor discrepancies may occur in aggregated invocations due to rounding, especially with smaller values
  • Most recent metrics may be delayed
  • Metrics have a 5-minute ingestion delay

NGC UI access is limited to NGC account holders. For broader observability access, work with your account administrator to configure external observability endpoints.

External Observability

Configure external observability endpoints to monitor your NVIDIA Cloud Functions. By setting up telemetry endpoints, you can stream metrics (see appendix-b), logs, and traces to popular observability platforms like Grafana Cloud and Datadog. This extends beyond the basic metrics in NGC UI, giving you deeper insights into your functions’ performance.

To export function or task telemetry through external observability platforms, your source code must be instrumented using OpenTelemetry. Without proper OpenTelemetry instrumentation, only system-level metrics will be available.

Ports

The OpenTelemetry collector uses the following ports:

  • OTLP (OpenTelemetry Protocol)

    • OTLP gRPC: Port 14357
    • OTLP HTTP: Port 14358
  • Metrics

    • Port 18888 - Used for collector metrics
  • Health Check

    • Port 13133 - Used for health check endpoint

These ports are reserved for the OpenTelemetry collector and should not be used by your functions or tasks.

Configuration

Telemetry endpoints can only be configured when creating a new function or deploying a new version. You cannot add a telemetry endpoint to an existing function deployment.

A Telemetry Endpoint is a configuration that specifies where telemetry data is sent. This is allowed for all functions or tasks to be configured to send telemetry data to an external observability platform.

  1. Configure External Telemetry Endpoints

Remember that to collect custom metrics, logs, and traces from your function’s or task’s code, you must instrument your application using OpenTelemetry. System-level metrics (CPU, memory, GPU) are collected automatically.

You can configure telemetry endpoints using either the web UI or the NGC CLI:

Web UI Method:

  • Navigate to your NGC organization settings
  • Select “Settings” in your Cloud Functions NGC organization
  • Scroll to the bottom of the page
  • Click “Add Telemetry Endpoint”

nvcf_add_telemetry_endpoint.png

  • Select your desired endpoint type (Grafana Cloud or Datadog)
  • Configure the endpoint with the required credentials
Grafana Cloud

Follow these steps to set up Grafana Cloud integration with NVCF:

Web UI Method:

  1. Access Grafana Cloud

  2. For new users:

  3. Visit https://grafana.com/auth/sign-up/create-user

  4. Complete the free Grafana Cloud registration process

  5. For existing users:

  6. Visit https://grafana.com/auth/sign-in

  7. Log in with your credentials

  8. Configure OpenTelemetry

  9. In the top menu bar, locate “My Account”

  10. Expand the Details section by clicking the icon

grafana_cloud_portal.png

  1. Access OpenTelemetry Settings

  2. In your Grafana Cloud stack, locate the OpenTelemetry card

  3. Click “Configure” to access the OpenTelemetry configuration

  4. You will see options for configuring:

  5. Metrics

  6. Logs

  7. Traces

grafana_cloud_stack.png

  1. Locate OTLP Configuration Details

  2. The OTLP endpoint section will display:

  3. OTLP Endpoint URL (e.g., https://otlp-gateway-prod-us-west-0.grafana.net/otlp)

  4. Instance ID (a numeric identifier for your instance)

  5. API Token section with option to “Generate now”

  6. Use the “Copy to Clipboard” buttons to easily copy these values into the NVCF Telemetry Endpoint configuration.

Alternative: Create Grafana Telemetry Endpoint via CLI

As an alternative to the web UI, you can create the Grafana Cloud telemetry endpoint using the NGC CLI:

$ngc cloud-function telemetry-endpoint create --name grafana-cloud-metrics \
>--type METRICS \
>--provider GRAFANA_CLOUD \
>--protocol HTTP \
>--endpoint https://otlp-gateway-prod-us-west-0.grafana.net/otlp \
>--key your-grafana-api-token

Keep your API Token secure and never share it publicly. If your token is compromised, you can generate a new one and update your configuration.

Datadog

Follow these steps to set up Datadog integration with NVCF:

Web UI Method:

  1. Sign Up for Datadog

  2. Visit the Datadog Getting Started page

  3. Complete the registration process for a new Datadog account

  4. Configure API Key

  5. Log in to your Datadog account

  6. Navigate to Organization Settings (found in the bottom left corner of the page)

  7. Select API Keys from the left menu

  8. Either click “+New Key” to create a new API key or copy an existing one from the list

  9. Get Telemetry Endpoint

  10. Your endpoint URL will be displayed in the browser address bar

  11. Available endpoints based on your instance location:

  12. datadoghq.com (US1)

  13. us3.datadoghq.com (US3)

  14. us5.datadoghq.com (US5)

  15. datadoghq.eu (EU1)

  16. ddog-gov.com (US1-FED)

  17. For more details on Datadog sites and endpoints, see the Datadog site documentation

  18. Configure in NVCF Web UI

  19. Input the configuration details:

  20. API Key (copied from step 2)

  21. Endpoint URL (selected from step 3)

  22. Select telemetry type(s):

  23. Choose “Logs” to send log data

  24. Choose “Metrics” to send metrics data

  25. You can select both to send both types of telemetry

  26. Save the configuration

nvcf_datadog_endpoint.png

Alternative: Create Datadog Telemetry Endpoint via CLI

As an alternative to the web UI, you can create the Datadog telemetry endpoint using the NGC CLI:

$# Example
$ngc cloud-function telemetry-endpoint create --name datadog-metrics \
>--type METRICS \
>--provider DATADOG \
>--protocol HTTP \
>--endpoint datadoghq.com \
>--key your-datadog-api-key

Make sure to keep your API key secure and never share it publicly. If your key is compromised, you can generate a new one and update your configuration.

CLI Method:

As an alternative to the web UI, you can use the NGC CLI to manage telemetry endpoints. Here are the basic CLI commands:

$ # List existing telemetry endpoints
$ ngc cloud-function telemetry-endpoint list
$
$ # Create a new telemetry endpoint
$ ngc cloud-function telemetry-endpoint create --name <endpoint-name> \
> --type <LOGS|METRICS> \
> --provider <GRAFANA_CLOUD|DATADOG> \
> --protocol <GRPC|HTTP> \
> --endpoint <endpoint-url> \
> --key <api-key>
$
$ # Remove a telemetry endpoint
$ ngc cloud-function telemetry-endpoint remove <endpoint-name>
  • Endpoint names must be unique within your NGC organization
  • API tokens and keys are stored securely in NGC Encrypted Secrets Store and can be updated if needed
  • Endpoint configurations cannot be updated - delete and recreate to change settings
  1. Add Telemetry Endpoint to Function or Task

    Telemetry endpoints can only be configured when creating a new function or deploying a new version. You cannot add a telemetry endpoint to an existing function deployment.

    Web UI Method:

    When creating a new function or deploying a new version:

    • In the function creation/deployment form
    • Look for the Telemetry Endpoints section
    • Select the desired telemetry endpoint from the dropdown
    • Complete the rest of the function creation/deployment process

If you need to change the telemetry endpoint for an existing function, you must deploy a new version of that function with the updated telemetry configuration.

  1. Verify Deployment

    After deploying the function with the telemetry endpoint, verify that the telemetry data is flowing correctly to your observability platform.

If you don’t see your custom metrics, logs, or traces in your observability platform, verify that:

  1. Your function’s or task’s code is properly instrumented with OpenTelemetry
  2. The telemetry endpoint is correctly configured
  3. The function or task deployment is active and running
Grafana Cloud
  1. Log in to your Grafana Cloud account

  2. Navigate to the Metrics Explorer

  3. Search for the following metrics to verify data flow:

  4. DCGM_FI_DEV_GPU_UTIL - Shows GPU utilization percentage

  5. container_fs_reads_bytes_total - Shows container filesystem read metrics

  6. container_fs_writes_bytes_total - Shows container filesystem write metrics

grafana_verify_metrics.png

Datadog
  1. Log in to your Datadog account

  2. Navigate to the Metrics Explorer

  3. Search for “nvidia.cloud.function” to find your function’s metrics

  4. You can view metrics such as:

  5. GPU utilization

  6. Function or task invocations

  7. Request latency

  8. Resource usage

datadog_metrics.png

The OpenTelemetry collector version, image and configuration are managed entirely by NVCF and cannot be modified by users.

  1. Delete a Function or Task and Remove Telemetry Endpoint

    To remove a telemetry endpoint, you must first cancel all deployments and remove all functions that use that endpoint. The endpoint cannot be removed while any functions are still using it, even if those functions are not currently deployed.

    Web UI method:

    1. Navigate to the Functions list page

    2. Click on the function you want to delete

    3. Navigate to the Deployments tab

    4. For each deployment:

      1. Click “Cancel Deployment” and confirm
      2. Wait for all deployments to be fully cancelled
    5. Navigate to the Settings tab

    6. Click “Delete Function” and confirm

    7. Verify the function is completely removed

    8. After all functions using the telemetry endpoint have been removed:

      1. Navigate to your NGC organization settings
      2. Select “Settings” in your Cloud Functions NGC organization
      3. Scroll to the Telemetry Endpoints section
      4. Find the endpoint you want to remove
      5. Click the delete icon next to the endpoint
      6. Confirm the deletion

    CLI method:

$ # First, cancel all deployments for a function version
$ ngc cloud-function function deploy remove <function-id>:<function-version-id>
$
$ # Wait for deployments to be fully cancelled, then remove the function
$ ngc cloud-function function remove <function-id>
$
$ # After all functions using the telemetry endpoint have been removed, delete the endpoint
$ ngc cloud-function telemetry-endpoint remove <endpoint-name>

All deployments must be fully cancelled before function removal. The function must be completely removed before the endpoint can be removed. Removing a telemetry endpoint will permanently delete the endpoint configuration. Make sure to export any necessary telemetry data before removing endpoints.

When you select a telemetry endpoint, NVCF:

  • Deploys a dedicated OpenTelemetry collector with your function or task
  • Automatically configures authentication and endpoint connections
  • Enables collection of metrics, logs, and traces from your function or task
  • Directs telemetry data to your organization’s observability platform

Resource Management

In the pod for each function or task, an OpenTelemetry collector is deployed. This collector has automatic memory management and built-in resource protection to ensure reliable telemetry collection without impacting function or task performance. NVCF manages all resource allocation for the collector, so you don’t need to worry about resource configuration.

Security

NVCF ensures secure telemetry handling by storing credentials securely in the NGC Encrypted Secrets Store, Each collector only accesses its own function’s or task’s data, and authentication is handled automatically. Credentials are rotated securely to maintain security and integrity.

How to Set Up External Observability on a BYOC Cluster

BYOC Steps

  • BYOC cluster registered with NVCA on 2.46.10+ version

    • See the NGC-Managed Clusters page for upgrade instructions.
  • Ensure the Bring Your Own Observability cluster feature is enabled. If you are running a cluster agent version older than 2.50.0, refer to the Configuration page for managing feature flags.

cluster-byoo-flag.png

Error Handling

If issues occur with telemetry collection:

  • Your function or task continues to run normally
  • Error messages are logged for troubleshooting
  • Health status is monitored and reported
  • Automatic retry logic handles temporary failures

The collector’s health can be monitored through:

  • Status checks in the NGC UI
  • Metrics in your observability platform
  • Built-in health endpoints

Appendix A: Terminology

TermDefinition
NGCNVIDIA GPU Cloud which provides a way for users to set up and manage access to NVIDIA cloud services
NVCFNVIDIA Cloud Functions and Tasks
OpenTelemetryAn open source standard for telemetry data collection and transmission
OTLPOpenTelemetry Protocol - the data transfer protocol used by OpenTelemetry for sending telemetry data
OTel CollectorThe OpenTelemetry Collector component that receives, processes, and exports telemetry data
Telemetry EndpointA configuration that specifies where telemetry data (metrics, logs, and traces) is sent for external observability platforms

Appendix B: Available Metrics

The following metrics are collected through the OpenTelemetry collector deployed with your function when using External Observability and exported through your configured Telemetry Endpoints. The metrics exported depend on the Kubernetes deployment used by the function or task.

Key metrics include:

  • Function or task invocation metrics
  • Resource utilization metrics
  • Platform metrics related to the function or task

Metrics are filtered based on deployment type and configuration. Not all metrics may be available for all deployment scenarios.

CPU Metrics

MetricDescription
container_cpu_cfs_throttled_periods_totalNumber of periods the container was throttled (only present if container was throttled)
container_cpu_cfs_throttled_seconds_totalTotal time the container was throttled in seconds (only present if container was throttled)
container_cpu_usage_seconds_totalTotal CPU time used by the container in seconds

Memory Metrics

MetricDescription
container_memory_cacheMemory used by the page cache in bytes
container_memory_rssResident Set Size: total memory allocated for the container
container_memory_swapSwap memory used by the container in bytes
container_memory_usage_bytesTotal memory usage of the container in bytes
container_memory_working_set_bytesMemory working set: memory actively used by the container

Filesystem Metrics

Only present if the container is performing IO operations:

MetricDescription
container_fs_limit_bytesTotal filesystem limit in bytes
container_fs_usage_bytesTotal filesystem usage in bytes
container_fs_reads_totalTotal number of filesystem read operations
container_fs_writes_totalTotal number of filesystem write operations
container_fs_writes_bytes_totalTotal bytes written to the filesystem
container_fs_reads_bytes_totalTotal bytes read from the filesystem

Network Metrics

Only present if the container is performing network operations:

MetricDescription
container_network_receive_bytes_totalTotal bytes received over the network
container_network_receive_errors_totalTotal number of network receive errors
container_network_receive_packets_dropped_totalTotal number of received packets dropped
container_network_receive_packets_totalTotal number of packets received
container_network_transmit_bytes_totalTotal bytes transmitted over the network
container_network_transmit_errors_totalTotal number of network transmit errors
container_network_transmit_packets_dropped_totalTotal number of transmitted packets dropped
container_network_transmit_packets_totalTotal number of packets transmitted

Kubernetes State Metrics

Only present if helm-based function has a deployment k8s object:

MetricDescription
kube_deployment_status_replicasTotal number of replicas in the deployment
kube_deployment_status_replicas_availableNumber of available replicas in the deployment
kube_deployment_status_replicas_unavailableNumber of unavailable replicas in the deployment
kube_deployment_status_replicas_updatedNumber of updated replicas in the deployment
kube_deployment_status_replicas_readyNumber of ready replicas in the deployment
kube_service_createdTimestamp when the service was created

Only present if helm-based function has a replicaset k8s object:

MetricDescription
kube_replicaset_status_replicasTotal number of replicas in the replicaset
kube_replicaset_status_ready_replicasNumber of ready replicas in the replicaset

Only present if helm-based function has a stateful k8s object:

MetricDescription
kube_statefulset_status_replicasTotal number of replicas in the statefulset
kube_statefulset_status_replicas_readyNumber of ready replicas in the statefulset

Only present if the helm-based function has a job/cronjob k8s object:

MetricDescription
kube_job_status_activeNumber of active jobs
kube_job_status_failedNumber of failed jobs
kube_job_status_succeededNumber of succeeded jobs
kube_cronjob_status_activeNumber of active cronjobs

Only present if function has a configmap k8s object:

MetricDescription
kube_configmap_createdTimestamp when the configmap was created

Only present if function has a secret k8s object:

MetricDescription
kube_secret_createdTimestamp when the secret was created

Only present if function has a pod k8s object:

MetricDescription
kube_pod_container_infoInformation about the container in the pod
kube_pod_container_resource_limitsResource limits for the container
kube_pod_container_resource_requestsResource requests for the container (only present if resources were requested)
kube_pod_container_status_last_terminated_exitcodeExit code of the last terminated container (only present if an error happened)
kube_pod_container_status_last_terminated_reasonReason for the last container termination (only present if an error happened)
kube_pod_container_status_restarts_totalTotal number of container restarts
kube_pod_container_status_runningWhether the container is running
kube_pod_container_status_terminatedWhether the container has terminated (only present if terminated)
kube_pod_container_status_terminated_reasonReason for container termination (only present if terminated)
kube_pod_container_status_waitingWhether the container is waiting (only present if pod is waiting)
kube_pod_container_status_waiting_reasonReason for container waiting (only present if pod is waiting)

Only present if function/task helm deployments:

MetricDescription
kube_pod_infoInformation about the pod
kube_pod_status_reasonReason for the pod status

Only present if function/task helm defined an init container:

MetricDescription
kube_pod_init_container_infoInformation about the init container
kube_pod_init_container_status_readyWhether the init container is ready
kube_pod_init_container_status_restarts_totalTotal number of init container restarts
kube_pod_init_container_status_runningWhether the init container is running
kube_pod_init_container_last_status_terminated_reasonReason for the last init container termination
kube_pod_init_container_status_waiting_reasonReason for init container waiting

GPU Metrics

Always present for container and helm:

MetricDescription
DCGM_FI_DEV_GPU_UTILGPU utilization percentage
DCGM_FI_PROF_PIPE_TENSOR_ACTIVETensor core active percentage - time over the past sample period during which tensor cores were active
DCGM_FI_PROF_DRAM_ACTIVEDRAM active percentage - time over the past sample period during which device memory was being read or written
DCGM_FI_PROF_SM_ACTIVEStreaming multiprocessor (SM) active percentage - time over the past sample period during which SMs were active
DCGM_FI_PROF_SM_OCCUPANCYSM occupancy - average percentage of active warps per scheduler over the past sample period
DCGM_FI_PROF_PCIE_TX_BYTESPCIe transmit bytes - number of bytes transmitted over PCIe from GPU to host during the past sample period
DCGM_FI_PROF_PCIE_RX_BYTESPCIe receive bytes - number of bytes received over PCIe by GPU from host during the past sample period
DCGM_FI_PROF_NVLINK_TX_BYTESNVLink transmit bytes - number of bytes transmitted over NVLink from GPU to peer during the past sample period
DCGM_FI_PROF_NVLINK_RX_BYTESNVLink receive bytes - number of bytes received over NVLink by GPU from peer during the past sample period
DCGM_FI_DEV_POWER_USAGEPower usage - current power consumption of the GPU in watts
DCGM_FI_DEV_VGPU_MEMORY_USAGEvGPU memory usage - amount of framebuffer memory used by the virtual GPU instance

For detailed information about all available DCGM field IDs and GPU metrics, see the NVIDIA DCGM API Field IDs documentation.

NVCF Worker Service Metrics

Streaming metrics are only present for streaming functions.

MetricDescription
nvcf_worker_service_request_totalTotal number of service requests processed
nvcf_worker_service_response_totalTotal number of service responses processed including error code as a label
nvcf_worker_service_stream_latency_seconds_bucketHistogram buckets for stream request latency in seconds
nvcf_worker_service_stream_latency_seconds_countTotal count of stream latency measurements
nvcf_worker_service_stream_latency_seconds_sumTotal sum of stream latency measurements in seconds
nvcf_worker_service_stream_session_duration_seconds_bucketHistogram buckets for streaming session duration in seconds
nvcf_worker_service_stream_session_duration_seconds_countTotal count of streaming session duration measurements
nvcf_worker_service_stream_session_duration_seconds_sumTotal sum of streaming session duration measurements in seconds
nvcf_worker_service_stream_streaming_app_readyIndicates whether the streaming application is ready (1) or not (0)

All NVCF metrics include the label origin: nvcf-byoo.

NVCA Instance Type Metrics

Present for cluster management:

MetricDescription
nvca_instance_type_capacityCount of instances that could be deployed on schedulable node resources by instance type
nvca_instance_type_allocatableCount of instances that can be deployed on available schedulable node resources by instance type
nvca_instance_type_unschedulableCount of instances that could be deployed on unschedulable node resources by instance type

OpenTelemetry Collector Metrics

Always present for container and helm. The final list of metrics depends on telemetries received & exported by function/task:

MetricDescription
otelcol_receiver_refused_metric_points_totalTotal number of metric points refused by the receiver
otelcol_receiver_refused_log_records_totalTotal number of log records refused by the receiver
otelcol_receiver_refused_spans_totalTotal number of spans refused by the receiver
otelcol_receiver_accepted_metric_points_totalTotal number of metric points accepted by the receiver
otelcol_receiver_accepted_log_records_totalTotal number of log records accepted by the receiver
otelcol_receiver_accepted_spans_totalTotal number of spans accepted by the receiver
otelcol_exporter_sent_metric_points_totalTotal number of metric points sent by the exporter
otelcol_exporter_sent_spans_totalTotal number of spans sent by the exporter
otelcol_exporter_sent_log_records_totalTotal number of log records sent by the exporter
otelcol_exporter_send_failed_metric_points_totalTotal number of metric points that failed to send
otelcol_exporter_send_failed_spans_totalTotal number of spans that failed to send
otelcol_exporter_send_failed_log_records_totalTotal number of log records that failed to send
otelcol_processor_outgoing_items_totalTotal number of items processed and sent out
otelcol_processor_incoming_items_totalTotal number of items received for processing

Resource Attributes

All logs and metrics have the following attributes added to their metadata:

AttributeDescription
function_idUnique identifier for the function or task
function_version_idVersion identifier for the function or task
instance_idUnique identifier for the function or task instance
nca_idNVIDIA Cloud Account identifier
cloud_regionCloud region where the function or task is deployed (non-GFN)
zone_nameZone name where the function or task is deployed (GFN)
cloud_providerCloud provider where the function or task is deployed

The platform metrics have the following attributes when available:

SourceAttributes
cadvisorcontainer, cpu, device, image, job, service, interface, pod
kube state metricscontainer, job, service, pod, reason, condition, configmap, created_by_kind, created_by_name, deployment, host_network, image, phase, qos_class, replicaset, resource, secret, statefulset, status and unit
DCGMcontainer, DCGM_FI_DRIVER_VERSION, device, job, service, modelName, pci_bus_id and pod
  • job attribute is available in Grafana Cloud
  • service is used in Datadog instead of attribute job

Appendix C: Adding Custom Application Metrics/Logs/Traces

You can export custom metrics/logs/traces to your external observability platform by sending them to the OpenTelemetry collector. Refer to the following table for the available environment variables that you can specify:

VariableDefinitionExample
OTEL_EXPORTER_OTLP_LOGS_ENDPOINTOpenTelemetry Protocol (OTLP) endpoint for exporting log datahttp://127.0.0.1:14358/v1/logs
OTEL_EXPORTER_OTLP_TRACES_ENDPOINTOpenTelemetry Protocol (OTLP) endpoint for exporting trace datahttp://127.0.0.1:14358/v1/traces
OTEL_EXPORTER_OTLP_METRICS_ENDPOINTOpenTelemetry Protocol (OTLP) endpoint for exporting metrics datahttp://127.0.0.1:14358/v1/metrics
OTEL_EXPORTER_OTLP_LOGS_PROTOCOLProtocol used for exporting logs to OTLP endpointshttp
OTEL_EXPORTER_OTLP_TRACES_PROTOCOLProtocol used for exporting traces to OTLP endpointshttp
OTEL_EXPORTER_OTLP_METRICS_PROTOCOLProtocol used for exporting metrics to OTLP endpointshttp
OTEL_HEALTH_CHECK_ENDPOINTHealth check endpoint for OpenTelemetry collectorhttp://127.0.0.1:13133/health