For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
    • Overview
    • Quickstart
  • Before You Deploy
    • Infrastructure Sizing
    • Manifest
  • Deployment
    • Installation Overview
    • Image Mirroring
    • Helmfile Installation
  • GPU Cluster Setup
    • GPU Cluster Setup
    • Self-Managed Clusters
  • Configuration
    • Optional Enhancements
    • LLM Function Enablement
    • Gateway Routing
    • Third-Party Registries
    • Registry Allowlist
    • Cluster Configuration
    • KAI Scheduler
  • Using Cloud Functions
    • API
    • Function Creation
    • LLM Gateway
    • Generic HTTP Function Invocation
    • gRPC Function Invocation
    • Container Functions
    • Helm Functions
    • Streaming Functions
    • Configure Autoscaling
    • CLI
  • Function Autoscaling
    • Function Autoscaling Overview
    • Architecture
    • Operations
    • Observability
  • Observability
    • Observability
    • Example Dashboards
  • Operations
    • Control Plane Operations
    • Cluster Monitoring
    • Troubleshooting
  • Runbooks
    • Runbooks
    • Key Rotation
  • Reference
    • Cluster Reference
    • gRPC Load Testing
    • gRPC Load Test SLI Guide
    • HTTP Load Testing
    • HTTP Load Test SLI Guide
    • HTTP Soak Testing
  • Development
    • Architecture Overview
    • Fake GPU Operator
    • Release Process
  • Managed (Legacy)
    • Function Lifecycle
    • Service Keys
    • Observability
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoCloud Functions
On this page
  • Three-Plane Overview
  • Component Responsibilities
  • Single Request Lifecycle
  • Scale-to-Zero
  • Multi-Cluster Routing
  • Function Workload Types
  • Key Custom Resource Definitions
  • Per-Component Documentation
Development

NVCF Architecture

||View as Markdown|
Previous

HTTP Soak Testing

Next

Local Development

NVCF is built around three independently scalable planes connected through NATS JetStream as the shared messaging layer. A single control plane manages multiple GPU worker clusters, each registered via the NVIDIA Cluster Agent (NVCA).

Three-Plane Overview

The diagram below shows the primary components and how requests flow across planes. The control plane contains additional services beyond those shown. See the manifest page for the full component list.

NVCF Three-Plane Architecture

Component Responsibilities

For a full breakdown of component responsibilities by plane, see the manifest page.

Single Request Lifecycle

A synchronous HTTP function invocation flows through all three planes:

Scale-to-Zero

NVCF uses NATS JetStream as a durable request buffer, enabling true scale-to-zero without dropping requests:

  1. Autoscaler detects zero utilization and drives desired instance count to 0.
  2. No function pods are running.
  3. A new request arrives and is published to the NATS JetStream stream. The stream persists it durably.
  4. Autoscaler detects queue depth > 0 and sets desired instances to 1 or more.
  5. NVCA receives a creation message and launches the pod.
  6. Pod connects via WorkerService gRPC and pulls the buffered message.
  7. Response is returned to the caller through the still-open Invocation Service connection.

Multi-Cluster Routing

Each GPU cluster runs its own NVCA instance. NATS JetStream subjects are scoped per cluster:

Creation stream: Create.NVCA.*.{clusterID}.*.*
Termination: Terminate.NVCA.{clusterID}
Consumer name: {streamName}-{clusterID} (durable, per cluster)

Creation messages are only delivered to the consumer of the addressed cluster. The invocation plane selects the target cluster based on the function deployment specification (GPU type, region, cluster group).

Function Workload Types

NVCF supports four workload types on the compute plane:

TypePackagingInvocationUse case
Container functionDocker imageHTTP / gRPC / streamingLong-running inference service
Helm functionHelm chartHTTPMulti-container or operator-based workload
Container taskDocker imageAsync (run-to-completion)Batch inference, fine-tuning
Helm taskHelm chartAsync (run-to-completion)Distributed batch jobs

Key Custom Resource Definitions

CRDAPI GroupPurpose
NVCFBackendnvcf/v1Cluster-level spec (NVCA image, GPU discovery, feature flags)
ICMSRequestnvca/v2beta1Per-request state machine (Pending -> Completed/Failed)
MiniServicenvca/v1alpha1Per-Helm-function lifecycle (Installing -> Running -> Failed)

Per-Component Documentation

Each component ships an AGENTS.md with detailed internals, data flows, and API contracts:

  • src/compute-plane-services/nvca/AGENTS.md
  • src/compute-plane-services/ess-agent/AGENTS.md
  • src/compute-plane-services/byoo-otel-collector/AGENTS.md
  • src/invocation-plane-services/http-invocation/AGENTS.md
  • src/invocation-plane-services/grpc-proxy/AGENTS.md
  • src/invocation-plane-services/llm-api-gateway/AGENTS.md
  • src/invocation-plane-services/ratelimiter/AGENTS.md
  • src/control-plane-services/nats-auth-callout/AGENTS.md
  • src/control-plane-services/function-autoscaler/AGENTS.md
  • examples/AGENTS.md