Domain: https://api.nvcf.nvidia.com

The API is divided into the following sets of APIs:



Function Lifecycle The creation, modification and deployment of a function.
Function Invocation Execution of a Function that runs on a Worker node. Usually an inference call.
Asset Management Used to manage large files for uploading for a requested and downloading results of a function.
Visibility Used to view information about your environment such as queues & GPUs.
Account Management Used to associate NVIDIA Cloud Accounts with Cloud Functions.

Please access the Open API Docs and a Postman Collection.

Always refer to the Open API Docs for the latest API information.

API Versioning

All API endpoints include versioning in the path prefix.



API Endpoints

The below provides a brief overview only. Refer to the Open API docs for the latest API and detailed information.

The API supports two types of Authorization: API Key & Service Account. With some key differences

API Key has limited capabilities compared to Service Account.

Available Scopes for API keys and Service Accounts

Auth Supported



Service Account, API Key invoke_function Invoke Functions
Service Account, API Key list_functions List Functions
Service Account, API Key queue_details Check Queue Depth
Service Account delete_function Delete a Function
Service Account update_function Modify a Function
Service Account register_function Create a Function
Service Account authorize_clients Authorize other parties to invoke your function


This is the simpler of the two Authorization flows but has static scopes assigned to it. If lost, it must be destroyed and re-created.

The API Key is passed in the Authorization header as a base64 encoded string bearer token.

The API Key can be generated via your account at https://ngc.nvidia.com/.


`Authorization header: Bearer <API Key>`

Service Account

A signed JWT issued by Service Account authorizes against the Cloud Functions API. This token is obtained by posting the “Client Id” and “Secret” along with the scopes to the Service Account.

Speak with your Account Manager if you need access to Service Accounts for authorization.



curl --include \ --user <client_id>:<client_secret> \ --request POST https://bvuehlajqrvfbmcxe2idnqs61ld1lk.ssa.nvidia.com/token \ -H 'Content-Type: application/x-www-form-urlencoded' \ --data-urlencode 'grant_type=client_credentials' \ --data-urlencode 'scope=<space-separated-list-of-scopes>'

Once you have the token, you can pass it in the ‘Authorization’ header on subsequent requests to Cloud Functions.

You will need to check the validity of the JWT TTL, and when the time is close to expiring, you should request a new one.

When using a JWT, you can limit the scopes to what is only needed, thus reducing the blast radius if your token is leaked.

Invocation refers to the execution of the inference call in Cloud Functions.

Please note the body of your request must be valid JSON and max 200KB. If calling the invocation API with no function version ID specified, and multiple function version IDs are deployed, then the inference call may go to instances hosting either function version.

Cloud Functions Invocation has the following features:

  • HTTP Streaming: Uses HTTP/2 persistent connections for continuous data transmission, maintaining open connections for optimal performance until no further communication with the server is necessary.

  • HTTP Polling: NVCF responds with either an HTTP Status 200 for a completed result, or an HTTP Status 202 for a response that will require polling for the result on the client.

  • gRPC: Allows users to invoke functions with authentication and functionID in the gRPC metadata, utilizing generic data based on Protobuf messages.


Invocation Example with Function ID


curl --request POST \ --url https://api.nvcf.nvidia.com/v2/nvcf/pexec/functions/eb1100de-60bf-4e9a-8617-b7d4652e0c37 \ --header 'Authorization: Bearer $API_KEY' \ --header 'accept: application/json' \ --header 'Content-Type: application/json' \ --data '{ "messages": [ { "role": "user", "content": "Hello" } ], "temperature": 0.2, "top_p": 0.7, "max_tokens": 512 }'

HTTP Polling

NVCF employs long polling <https://www.pubnub.com/blog/http-long-polling/> for Function invocation and result retrieval.

When you make a Function invocation request, NVCF will hold your request open for a period of time before returning with either:

  • HTTP Status 200 completed result

  • HTTP Status 202 polling response

    • On receipt of a polling response your client should immediately poll NVCF to retrieve your result.

NVCF will continue to wait for an invocation response and return either an HTTP Status 200 or an HTTP Status 202.

By default, the HTTP polling response timeout is 5 minutes. This can be configured to be shorter via setting the HTTP header NVCF-POLL-SECONDS on the request.

In the event NVCF responds erroneously your client should but is not required to protect itself by ensuring it does not make more than one polling request per second.

This can be achieved by keeping a start_time when you make a polling request and sleeping for up to 1 second from that start_time before making another request.

This does not mean your client should always be adding a sleep.

  • For example if 0.1 seconds have passed since making your last polling request your client should sleep for 0.9 seconds.

  • If 5 seconds have passed since making your last polling request your client should not sleep.

Polling with PEXEC Endpoint

When calling the pexec endpoint, properties such as reqId, status and percentComplete, will be presented as HTTP response headers.

Endpoint: GET /v2/nvcf/pexec/status/{requestId}


  • NVCF-POLL-SECONDS (optional): HTTP polling response timeout, if other than default, in seconds


  • requestId (path, required): Function invocation request id, string($uuid)


  • 200: Invocation is fulfilled. Response headers include NVCF-REQID (Invocation Request Id), NVCF-PERCENT-COMPLETE (Percentage complete), NVCF-STATUS (Invocation status).

  • 202: Result is pending. Client should poll using the requestId. Response headers include NVCF-REQID (Invocation Request Id), NVCF-PERCENT-COMPLETE (Percentage complete), NVCF-STATUS (Invocation status).

  • 302: In this case the result is in a different region or is a large response. Client should use the fully-qualified endpoint specified in ‘Location’ response header to fetch the result. Client can use the same Bearer token in ‘Authorization’ header when retrieving the result from the redirected region.


{ "reqId": "ef4c1967-d543-467c-af7e-8c7899d75be8", "status": "pending-evaluation" }


curl --location 'https://api.nvcf.nvidia.com/v2/nvcf/pexec/status/{requestId}' \ --header 'Authorization: Bearer <JWT>'

Cloud Functions use HTTP/2 persistent connections.

For best performance, it is expected that clients will not close connections until it is determined that no further communication with a server is necessary.

200 Response Fulfilled

Your invocation has been successfully fulfilled if the client receives a 200 response.


{ "reqId": "cd3c48c8-29f7-4f42-8d5e-db78a711b8e3", "status": "fulfilled", "response": { { "choices": [ { "index": 0, "message": { "role": "assistant", "content"response" } } ] } }, "percentComplete": 100, "errorCode": 0 }

HTTP Streaming

This feature allows clients to receive data as an event stream, eliminating the need for polling, or making repeated requests to check for new data. The server sends events to the client over a long-lived connection, allowing the client to receive updates in real-time.


  • Cloud Functions environment set up.

  • Familiarity with the Cloud Functions POST pexec API (documentation coming)

  • Understanding of Server-Sent Events (SSE)

Client Configuration

  1. The client initiates a connection by making a POST request to the NVCF pexec API endpoint, including the header Accept: text/event-stream.


POST /v2/nvcf/pexec/... Accept: text/event-stream

  1. Upon receiving this request with the appropriate header, NVCF knows the client is prepared to receive streamed data.

Handling Server Responses

  1. If the response from the inference container includes the header Content-Type: text/event-stream, the client keeps the connection open to the API and listens for data. The worker sends back each event to the API.


    The worker will read events from the inference container for up to 5 minutes, until the inference container closes the connection, whichever is earlier. Do not create an infinite event stream. Even if the client disconnects, the worker will still read events, which can tie up your function until the worker stops reading as described above, or the request times out (default timeout is 5 minutes).

  2. Data read from the “worker” container’s response is buffered by event and sent as an in-progress response to the NVCF API. Smaller events are more interactive, but they shouldn’t be too small. If they are, the size of the event stream wrapper might exceed the actual data, causing increased data transmission for your clients. The maximum event size allowed is 4MB.

  3. This process continues until the stream is complete. The completion of the event stream is marked by an empty “complete” response from the “utils” container.


    The NVCF API doesn’t send an empty event to the client; it uses this as a sentinel value to close the stream.




curl --location 'https://api.nvcf.nvidia.com/v2/nvcf/pexec/functions/e8229433-c3d2-4730-9642-7b73e6df9477' \ --header 'Accept: text/event-stream' \ --header 'Content-Type: application/json' \ --header 'Authorization: Bearer <>' \ --data '{"your-body": true}'



data: {"id":"d8f3c7ae-4452-44c6-a757-3e4aba3200e6","object":"text_completion","created":4636450,"model":"gpt3","choices":[{"index":0,"text":"this","logprobs":null,"finish_reason":null}]} data: {"id":"d8f3c7ae-4452-44c6-a757-3e4aba3200e6","object":"text_completion","created":4636450,"model":"gpt3","choices":[{"index":0,"text":" is","logprobs":null,"finish_reason":null}]} data: {"id":"d8f3c7ae-4452-44c6-a757-3e4aba3200e6","object":"text_completion","created":4636450,"model":"gpt3","choices":[{"index":0,"text":" a.","logprobs":null,"finish_reason":null}]} data: {"id":"d8f3c7ae-4452-44c6-a757-3e4aba3200e6","object":"text_completion","created":4636450,"model":"gpt3","choices":[{"index":0,"text":"start","logprobs":null,"finish_reason":null}]}

Shutdown Behavior

  • During a graceful shutdown, the NVCF API waits for all ongoing event stream requests to complete.

  • There’s a 5-minute global request timeout for these event stream requests.

Advantages of HTTP Streaming

  • Reduces latency: Clients receive data as it becomes available.

  • Lowers overhead: Eliminates the need for repeated polling requests.

  • Flexibility: The “worker” container controls if the response will be streamed or not, allowing the client-side implementation to remain consistent regardless of server-side changes.


  • The “utils” container invoking the “worker” container must support reading an event stream.

  • This feature introduces the possibility of long-lived “blocking” client requests, which the system must manage efficiently, especially during a shutdown sequence.



Users can invoke functions by including their authentication information and a specific functionID in the gRPC metadata.

  • The data being transmitted is generic and based on Protobuf messages.

  • Each model or container will have its own unique API, defined by the Protobuf messages it employs.

  • The example below is for Staging as gRPC is not deployed to Production as of yet. Once in Production, this guide will be updated.

  • gRPC connections will be kept alive for 30 seconds if idle

  1. Setting Up:

    • Ensure you have the Go Worker in use (see next step) since it is necessary for using gRPC.

    • Until the Go worker becomes the default, make sure you use the API to set up the function. Do not use the UI.

    • This is covered in the next step where we specify the use of the Go Worker.

  2. Creating the Function:


There might be an addition of an explicit inferenceType in the future, which you would set to GRPC. For now, it uses the inference path implicitly.

  1. Proxy Host & Endpoint:

    • The proxy host is grpc.nvcf.nvidia.com:443.

    • Use this host when calling your gRPC endpoint.

  2. Bearer Token & Metadata Keys:

  • See Authorization for details on auth.

  • Set your bearer token as Call Credentials. Either use gRPC’s support for Call Credentials, sometimes called Per RPC Credentials, to pass the token or manually set the authorization metadata as “Bearer ” + token.

  • Set the function-id metadata key.

  • Optionally, set the function-version-id metadata key.

  • When you are finished making your gRPC calls, close the gRPC client connection so that you do not tie up your workers longer than needed.

  • The other processes will remain the same as in your regular gRPC client.

  1. Python Example:

    Here’s a basic Python example based on functional tests:


    def test_worker_grpc( nvcf_client: nvcf.NVCFClient, # oauth2 client to provide SSA tokens create_grpc_function: CreateFunctionResponse, # function def info ) -> None: channel = grpc.secure_channel("stg.grpc.nvcf.nvidia.com:443", grpc.ssl_channel_credentials()) # proto generated grpc client grpc_client = grpc_service_pb2_grpc.GRPCInferenceServiceStub(channel) function_id = create_grpc_function.function.id function_version_id = create_grpc_function.function.version_id token = nvcf_client.get_token().get("access_token") metadata = [("function-id", function_id), # required ("function-version-id", function_version_id), # optional ("authorization", "Bearer " + token)] # required # make 100 unary inferences in a row for i in range(ITERATIONS): # this would be your client, request, and body. # it does not have any proto def restriction. infer = grpc_client.ModelInfer(MODEL_INFER_REQUEST, metadata=metadata) _ = infer logging.info(f"finished invoking{ITERATIONS}times")


The official term for this is “Call Credentials”. More details can be found at grpc.io documentation on credential types. This Python example provided does not showcase this. Instead, it demonstrates manually setting the “authorization” with a bearer token. Using call credentials would implicitly handle this.

  1. Running the Function:

    • Once set up, you can now run the test_worker_grpc function to make inferences using the gRPC client.

    • Ensure you have the necessary dependencies and configurations in place before execution.

  2. Handling Responses:

    • After making the inferences call, handle or process the results as needed in your application.

Large Responses

If your payload exceeds 5MB, you will receive a reference in your results to download the payload. The URL’s Time-To-Live (TTL) is 24 hours. To understand more about assets, refer to the Assets API.

When using pexec, a result stored will return an HTTP 302 status code. The Location response header will contain the fully-qualified endpoint, there will be no response body.

Below is defined the list of statuses and error codes the API can produce.

Function Invocation Response Status

If the client receives an HTTP Status code 202 with the “reqId” in the body from the Invoke Function, the client is expected to poll or issue a GET request specifying this ID.

The Function Response Status can have the following values:

  • pending-evaluation - Worker node has not yet accepted the request.

  • fulfilled - The process has been completed with results.

  • rejected - The request was rejected by the service.

  • errored - An error occurred during Worker node processing.

  • in-progress - A Worker node is processing the request.

Statuses fulfilled, rejected and errored are completed states, and you should not continue to poll.

302 HTTP Status Code

  1. What Does a 302 Imply?

    • A 302 HTTP status code, when received from the Cloud Functions API’s polling endpoint (GET /v2/nvcf/pexec/status/{requestId}), indicates that the client must follow the URL in the Location header of the response.

  2. Handling of the 302 Response:

    • On encountering a 302 status code, inspect the Location header. This header will point to the DNS of the appropriate region that holds the result.

    • Your client should be configured to make a new HTTP request to the URL given in the Location response header. The new HTTP request must include Authorization request header.

API Errors

The following error codes may be returned from the Utils container with errors outside of what the inference container may return:

  • 900 - Unknown/unexpected error

  • 901 - No request ID in message

  • 902 - No inference URL in message

  • 903 - No response queue in message

  • 904 - No function ID in message

  • 905 - No function name in message

  • 906 - No nca ID in message

  • 907 - No sub in message

  • 910 - Asset download failure

  • 911 - Large response upload failure

Client Error Codes

  • 400 Bad Request

  • 401 Unauthorized

  • 403 Forbidden

Server Error Codes

  • 5XX Server Error

For Function States, see Function Lifecycle.

Previous Overview
Next Function Management
© Copyright 2023-2024, NVIDIA. Last updated on Feb 16, 2024.