Domain: https://api.nvcf.nvidia.com

NGC Domain (required for some APIs): https://api.ngc.nvidia.com

OpenAPI Specification

This page is a brief overview of using the NVCF API and does not cover all endpoints.

Please refer to the OpenAPI Spec for the latest API information. The quickest way to begin using NVCF APIs is via the Postman Collection.

The NVCF API is divided into the following sets of APIs:



Function Invocation

Execution of a function that runs on a worker node. Usually an inference call.

Asset Management

Used to manage large files for uploading for a requested and downloading results of a function.

Cluster Groups & GPUs

Defines endpoints to list Cluster Groups and GPUs as targets for function deployment.

Queue Details

Used to view information about your environment such as queues & GPUs.

Function Management

The creation, modification and deletion of functions

Function Deployment

Endpoints for creating and managing function deployments.

API Versioning

All API endpoints include versioning in the path prefix.



The NVCF API supports NGC API key-based authorization for calling the API directly, or indirectly via the NGC CLI and NGC SDK.

The generated NGC Personal API Key will also be used for pushing and pulling containers, models, resources and helm charts to the NGC Private Registry to use during function creation.

Generate an NGC Personal API Key

The API Key can be generated via your account in the Personal Keys Page.

NGC Setup Page NGC API Key Dialog

It’s recommended that the API Key that you generate includes both Cloud Functions and Private Registry scopes for seamless usage with the NGC CLI.


API Key scopes are static. This means if the key is lost, it must be destroyed and recreated.

For more information about NGC API Key management see the NGC API Key documentation.

API Key Usage

The API Key is passed in the Authorization header.

Authorization: Bearer $API_KEY

API Key Scopes and Domains


There are multiple API key types within NGC. We strongly recommend using the NGC Personal API Key for complete NVCF API compatibility.

Required domain names are documented below and are also pre-filled within our Postman Collection.

Our OpenAPI Spec also describes the scopes required for each endpoint.

Scope Name

Domain Name

API Category



Function Management



Function Management



Queue Details



Function Management



Cluster Groups and GPUs



Function Invocation and Asset Management



Function Deployment



Function Management

JWT Based Authorization

The NVCF API also supports JWT-based authorization for all endpoints. This is managed via the creation of a Service Account Client associated with your organization. A signed JWT issued by the NVCF Service Account authorizes against the Cloud Functions API. This token is obtained by posting the Client’s clientId and secret along with the required scopes to the Service Account.

When the token expires, the client application will need to generate a new token. Secrets also expire on a regular cadence and require rotation.


This type of authorization requires additional maintenance due to token generation and key rotation.

Speak with your Account Manager if you need access to a Service Account Client for JWT-based authorization.

JWT Token Generation

When generating a token, the client application can specify only the scopes that are needed to perform the desired operations, limiting the “blast radius” if the token is leaked. The token will expire by default within 15 minutes.

The token generation endpoint jwtTokenProvider, clientId, and secret will be shared with you during the Service Account Client setup.


It’s a best practice to only request the scopes your client needs each time when generating a token.

See API Key Scopes and Domains for all available scopes. The Authorization header is a base64 encoded string of the following format: clientId:secret Here is an example of generating a token:

curl --location 'https://{jwtTokenProvider}.ssa.nvidia.com/token' \
--header 'Content-Type: application/x-www-form-urlencoded' \
--header 'Authorization: Basic <Base64 encoded "{clientId}:{secret}">' \
--data-urlencode 'scope=register_function' \
--data-urlencode 'grant_type=client_credentials'


    "access_token": "<generated $JWT_Token>",
    "token_type": "bearer",
    "expires_in": 3600,
    "scope": "register_function"

JWT Token Usage

The JWT token is passed in the Authorization header.

Authorization: Bearer $JWT_Token

Using the NVCF Invocation API

Invocation refers to the execution of an inference call to a function deployed into a cluster.


  • The body of your request must be valid JSON and max 5MB.

  • If calling the invocation API with no function version ID specified, and multiple function version IDs are deployed, then the inference call may go to instances hosting either function version.

  • Cloud Functions use HTTP/2 persistent connections. For best performance, it is expected that clients will not close connections until it is determined that no further communication with a server is necessary.

Cloud Functions invocation supports the following use cases:

  • HTTP Streaming: Uses HTTP/2 persistent connections for continuous data transmission, maintaining open connections for optimal performance until no further communication with the server is necessary.

  • HTTP (Polling): NVCF responds with either an HTTP Status 200 for a completed result, or an HTTP Status 202 for a response that will require polling for the result on the client.

  • gRPC: Allows users to invoke functions with authentication and function ID in the gRPC metadata, utilizing generic data based on Protobuf messages.


HTTP Invocation Example using Function ID

 1curl --request POST \
 2    --url https://api.nvcf.nvidia.com/v2/nvcf/pexec/functions/eb1100de-60bf-4e9a-8617-b7d4652e0c37 \
 3    --header 'Authorization: Bearer $API_KEY' \
 4    --header 'Accept: application/json' \
 5    --header 'Content-Type: application/json' \
 6    --data '{
 7        "messages": [
 8            {
 9            "role": "user",
10            "content": "Hello"
11            }
12        ],
13        "temperature": 0.2,
14        "top_p": 0.7,
15        "max_tokens": 512
16    }'

HTTP (Polling)

NVCF employs long polling for function invocation and result retrieval. However, the invocation API can be used as a synchronous, blocking API up to the max timeout of 20 minutes.

The polling response timeout is set to 1 minute by default, and configurable up to 20 minutes via setting the HTTP header NVCF-POLL-SECONDS on the request, refer to the API documentation.

When you make a function invocation request, NVCF will hold your request open for the polling response period before returning with either:

  • HTTP Status 200 completed result

  • HTTP Status 202 polling response

    • On receipt of a polling response your client should immediately poll NVCF to retrieve your result.


Below is an invocation of the “echo” function built from any of the “echo” containers from the examples repository.

 1curl --location 'https://api.nvcf.nvidia.com/v2/nvcf/pexec/functions/{functionId}' \
 2--header 'Content-Type: application/json' \
 3--header 'Authorization: Bearer $API_KEY' \
 4--data '{
 5    "inputs": [
 6        {
 7            "name": "message",
 8            "shape": [
 9                1
10            ],
11            "datatype": "BYTES",
12            "data": [
13                "Hello"
14            ]
15        },
16        {
17            "name": "response_delay_in_seconds",
18            "shape": [
19                1
20            ],
21            "datatype": "FP32",
22            "data": [
23                0.1
24            ]
25        }
26    ],
27    "outputs": [
28        {
29            "name": "echo",
30            "datatype": "BYTES",
31            "shape": [
32                1
33            ]
34        }
35    ]


In the event, that NVCF responds erroneously your client should but is not required to protect itself by ensuring it does not make more than one polling request per second.

This can be achieved by keeping a start_time when you make a polling request and sleeping for up to 1 second from that start_time before making another request.

This does not mean your client should always be adding a sleep.

  • For example, if 0.1 seconds have passed since making your last polling request your client should sleep for 0.9 seconds.

  • If 5 seconds have passed since making your last polling request your client should not sleep.

Polling After Initial Invocation

When making a function invocation request using the pexec endpoint, and an HTTP Status 202 is returned, the following headers will be included in the response:

  • NVCF-REQID: Invocation request ID, referred to as requestId

  • NVCF-STATUS: Invocation status

  • NVCF-PERCENT-COMPLETE: Percentage complete

The client is then expected to poll for a response using the requestId.


1curl --location 'https://api.nvcf.nvidia.com/v2/nvcf/pexec/status/{requestId}' \
2--header 'Authorization: Bearer $API_KEY'

Endpoint: GET /v2/nvcf/pexec/status/{requestId}


  • NVCF-POLL-SECONDS (optional): HTTP polling response timeout, if other than default, in seconds


  • requestId (path, required): Function invocation request id, string($uuid)


  • 200: Invocation is fulfilled. The response body will be a passthrough of the response returned from your container.

  • 202: Result is pending. The client should continue to poll using the returned request ID.

  • 302: In this case, the result is in a different region or is a large response. The client should use the fully-qualified endpoint specified in the Location response header to fetch the result. The client can use the same API Key in the Authorization header when retrieving the result from the redirected region.

Large Responses (302 Status Code)

The result payload size may not exceed 5GB. If your payload exceeds 5MB, i.e. 5MB < result size < 5GB, you will receive a reference in the response to download the payload.

When using the pexec invocation API, either during the initial invocation API call or when polling (GET /v2/nvcf/pexec/status/{requestId}), this will be indicated by a response with the HTTP 302 status code. The Location response header will contain the fully-qualified endpoint, there will be no response body.

Your client should be configured to make a new HTTP request to the URL given in the Location response header. The new HTTP request must include an Authorization request header.

The result retrieval URL’s Time-To-Live (TTL) is 24 hours. To read more about assets, refer to the Assets API and the asset flow example.

HTTP Streaming

This feature allows clients to receive data as an event stream, eliminating the need for polling, or making repeated requests to check for new data. The server sends events to the client over a long-lived connection, allowing the client to receive updates in real-time. Note that HTTP streaming requests use the same invocation API endpoint.


  • Cloud function deployed on NVCF

  • Familiarity with the basic HTTP pexec invocation API HTTP (Polling) usage documented above

  • Understanding of Server-Sent Events (SSE)

Client Configuration

  1. The client initiates a connection by making a POST request to the NVCF pexec invocation API endpoint, including the header Accept: text/event-stream.

     1curl --request POST \
     2    --url https://api.nvcf.nvidia.com/v2/nvcf/pexec/functions/eb1100de-60bf-4e9a-8617-b7d4652e0c37 \
     3    --header 'Authorization: Bearer $API_KEY' \
     4    --header 'Accept: application/json' \
     5    --header 'Content-Type: text/event-stream' \
     6    --data '{
     7        "messages": [
     8            {
     9            "role": "user",
    10            "content": "Hello"
    11            }
    12        ],
    13        "temperature": 0.2,
    14        "top_p": 0.7,
    15        "max_tokens": 512
    16    }'
  2. Upon receiving this request with the appropriate header, NVCF knows the client is prepared to receive streamed data.

Handling Server Responses

  1. If the response from the inference container includes the header Content-Type: text/event-stream, the client keeps the connection open to the API and listens for data.


    The NVCF worker will read events from the inference container for up to a default of 20 minutes until the inference container closes the connection, whichever is earlier. Do not create an infinite event stream. Even if the client disconnects, the worker will still read events, which can tie up your function until the worker stops reading as described above, or the request times out.

  2. Data read from the inference container’s response is buffered by the event and sent as an in-progress response to the NVCF API. Smaller events are more interactive, but they shouldn’t be too small. If they are, the size of the event stream wrapper might exceed the actual data, causing increased data transmission for your clients. The maximum event size allowed is 4MB.

  3. This process continues until the stream is complete.


See the streaming container and client in our example containers repository.

Shutdown Behavior

  • During a graceful shutdown, the NVCF API waits for all ongoing event stream requests to complete.

  • There’s a 5-minute global request timeout for these event stream requests.

Advantages of HTTP Streaming

  • Reduces latency: Clients receive data as it becomes available.

  • Lowers overhead: Eliminates the need for repeated polling requests.

  • Flexibility: The inference container controls if the response will be streamed or not, allowing the client-side implementation to remain consistent regardless of server-side changes.


This feature introduces the possibility of long-lived “blocking” client requests, which the system must manage efficiently, especially during a shutdown sequence.


Users can invoke functions by including their authentication information and a specific function ID in the gRPC metadata.

  • The data being transmitted is generic and based on Protobuf messages.

  • Each model or container will have its own unique API, defined by the Protobuf messages it implements.

  • gRPC connections will be kept alive for 30 seconds if idle, this is not configurable.

  • gRPC functions have no input request size limit.

Proxy Host & Endpoint

  • The gRPC proxy host is grpc.nvcf.nvidia.com:443.

  • Use this host when calling your gRPC endpoint.

  • The Cloud Functions gRPC proxy will attempt to open a connection to your function instance for 30 seconds before timeout.

API Key & Metadata Keys

  • Set your API Key as Call Credentials. Either use gRPC’s support for Call Credentials, sometimes called Per RPC Credentials, to pass the API Key or manually set the authorization metadata as Bearer $API_Key.

  • Set the function-id metadata key.

  • Optionally, set the function-version-id metadata key.

  • When the client is finished making gRPC calls close the gRPC client connection so that you do not tie up your function’s workers longer than needed.


See a complete grpc server and client example in our example containers repository.

 1  def call_grpc(
 2          create_grpc_function: CreateFunctionResponse, # function def info
 3  ) -> None:
 4      channel = grpc.secure_channel("grpc.nvcf.nvidia.com:443",
 5                                  grpc.ssl_channel_credentials())
 6      # proto generated grpc client
 7      grpc_client = grpc_service_pb2_grpc.GRPCInferenceServiceStub(channel)
 9      function_id = create_grpc_function.function.id
10      function_version_id = create_grpc_function.function.version_id
12      apiKey = "$API_KEY"
13      metadata = [("function-id", function_id), # required
14                  ("function-version-id", function_version_id), # optional
15                  ("authorization", "Bearer " + apiKey)] # required
17      # make 100 unary inferences in a row
18      for i in range(ITERATIONS):
19          # this would be your client, request, and body.
20          # it does not have any proto def restriction.
21          infer = grpc_client.ModelInfer(MODEL_INFER_REQUEST,
22                                      metadata=metadata)
23          _ = infer
24      logging.info(f"finished invoking {ITERATIONS} times")


The official term for authorization handling using gRPC is “Call Credentials”. More details can be found at grpc.io documentation on credential types. The Python example provided does not showcase this. Instead, it demonstrates manually setting the “authorization” with an API Key. Using call credentials would implicitly handle this.

Statuses and Errors

Below is the list of statuses and error codes the API can produce.

Function Invocation Response Status

If the client receives an HTTP Status code 202 from a pexec invocation API call, the client is expected to poll or issue a GET request using this NVCF-REQID defined in the header.

The NVCF-STATUS header can have the following values:

  • pending-evaluation - The worker has not yet accepted the request.

  • fulfilled - The process has been completed with results.

  • rejected - The request was rejected by the service.

  • errored - An error occurred during worker processing.

  • in-progress - A worker is processing the request.

Statuses fulfilled, rejected and errored are completed states, and you should not continue to poll.

Inference Container Status Codes and Responses

Error messages generated in your inference endpoint response are propagated from your inference container. Here is an example:

2  "type": "urn:inference-service:problem-details:bad-request",
3  "title": "Bad Request",
4  "status": 400,
5  "detail": "invalid datatype for input message",
6  "instance": "/v2/nvcf/pexec/functions/{functionId}",
7  "requestId": "{requestId}"

Error responses are formed as follows:

  • The type field in an error response will always include inference-service if the error is originating from your inference container.

  • The response status code will be set to the status code your inference container returns. This is what is returned in the status and title fields as well.

  • The instance and requestId fields are autofilled by the worker.

  • The detail field includes the error message body that your inference container returns.

Setting the Error Detail Field

Your inference container error response format must return JSON and must set the error field:

2  "error": "put your error here"

If this field is not set, the detail field in any error responses will be set to a generic Inference error string.


It’s highly encouraged to emit logs from your inference container. See Logging and Metrics for setting and viewing logs within the Cloud Functions UI.

NVCF API Status Codes

Please refer to the OpenAPI Docs for other possible status code failure reasons in cases where they are not generated from your inference container.

For Function States, see Function Lifecycle.


To easily differentiate between errors originating from within NVCF’s API or control plane, and your own inference container, determine if the type field includes inference-service (indicating the error is from your inference container)

Common Function Invocation Errors

Failure Type


Invocation response returning 4xx or 5xx status code

Check the “type” of the error message response, if the type includes inference-service this indicates the error is coming from your inference container. Please check the OpenAPI Specification and for other possible status code failure reasons in the case where they are not generated from your inference container.

Invocation request taking a long to get a result

Check the capacity of your function using the Function Metrics UI or API, to see if your function is queuing. Consider instrumenting your container with additional metrics to your chosen monitoring solution for further debugging - NVCF containers allow public egress. Set NVCF-POLL-SECONDS header to 300 (maximum) to wait for a sync response for up to 20 min to rule out errors in your client’s polling logic.

Invocation response returning 401 or 403

This indicates that the caller is unauthorized, ensure the Authorization header is set correctly with a valid API Key.

Container OOM

This is difficult to detect without instrumenting your container with additional metrics unless your container is emitting logs that indicate out of memory. We recommend profiling the memory usage locally. For testing locally and in the function, you can look at a profile of the memory allocation using this guide.