Domain: https://api.nvcf.nvidia.com
NGC Domain (required for some APIs): https://api.ngc.nvidia.com
This page is a brief overview of using the NVCF API, and does not cover all endpoints.
Please refer to the OpenAPI Spec for the latest API information. The quickest way to begin using NVCF APIs is via the Postman Collection.
The NVCF API is divided into the following sets of APIs:
APIs | Usage |
---|---|
Function Invocation | Execution of a function that runs on a worker node. Usually an inference call. |
Asset Management | Used to manage large files for uploading for a requested and downloading results of a function. |
Cluster Groups & GPUs | Defines endpoints to list Cluster Groups and GPUs as targets for function deployment. |
Queue Details | Used to view information about your environment such as queues & GPUs. |
Function Management | The creation, modification and deletion of functions |
Function Deployment | Endpoints for creating and managing function deployments. |
API Versioning
All API endpoints include versioning in the path prefix.
/v2/nvcf
The NVCF API supports NGC API key based authorization for calling the API directly, or indirectly via the NGC CLI and NGC SDK.
The generated NGC Personal API Key will also be used for pushing and pulling containers, models, resources and helm charts to the NGC Private Registry to use during function creation.
Generate an NGC Personal API Key
The API Key can be generated via your account in the Personal Keys Page.
![ngc_setup-personal-key.png](https://docscontent.nvidia.com/dims4/default/6006679/2147483647/strip/true/crop/2672x802+0+0/resize/1440x432!/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fsphinx%2F0000018f-f319-d1d5-a3bf-f3b9b89f0000%2Fcloud-functions%2Fuser-guide%2Flatest%2F_images%2Fngc_setup-personal-key.png)
![ngc_setup-personal-key-2.png](https://docscontent.nvidia.com/dims4/default/324b659/2147483647/strip/true/crop/1032x974+0+0/resize/1032x974!/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fsphinx%2F0000018f-f319-d1d5-a3bf-f3b9b89f0000%2Fcloud-functions%2Fuser-guide%2Flatest%2F_images%2Fngc_setup-personal-key-2.png)
It’s recommended that the API Key that you generate includes both Cloud Functions and Private Registry scopes for seamless usage with the NGC CLI.
API Key scopes are static. This means if the key is lost, it must be destroyed and recreated.
For more information about NGC API Key management see the NGC API Key documentation.
API Key Usage
The API Key is passed in the Authorization
header.
Authorization: Bearer $API_KEY
API Key Scopes and Domains
There are multiple API key types within NGC. We strongly recommend using the NGC Personal API Key for complete NVCF API compatability.
Required domain names are documented below and are also pre-filled within our Postman Collection.
Our OpenAPI Spec also describes the scopes required for each endpoint.
Scope Name | Domain Name | API Category |
---|---|---|
update_function | https://api.ngc.nvidia.com | Function Management |
register_function | https://api.ngc.nvidia.com | Function Management |
queue_details | https://api.nvcf.nvidia.com | Queue Details |
list_functions | https://api.nvcf.nvidia.com | Function Management |
list_cluster_groups | https://api.ngc.nvidia.com | Cluster Groups and GPUs |
invoke_function | https://api.nvcf.nvidia.com | Function Invocation and Asset Management |
deploy_function | https://api.ngc.nvidia.com | Function Deployment |
delete_function | https://api.ngc.nvidia.com | Function Management |
JWT Based Authorization
The NVCF API also supports JWT based authorization for all endpoints. This is managed via creation of a Service Account Client associated to your organization. A signed JWT issued by the NVCF Service Account authorizes against the Cloud Functions API. This token is obtained by posting the Client’s clientId
and secret
along with the required scopes to the Service Account.
When the token expires, the client application will need to generate a new token. Secrets also expire on a regular cadence and require rotation.
This type of authorization requires additional maintenance due to token generation and key rotation.
Speak with your Account Manager if you need access to a Service Account Client for JWT based authorization.
JWT Token Generation
When generating a token, the client application can specify only the scopes that are needed to perform the desired operations, limiting “blast radius” if the token is leaked. The token will expire by default within 15 minutes.
The token generation endpoint jwtTokenProvider
, clientId
, and secret
will be shared to you during Service Account Client setup.
It’s a best practice to only request the scopes your client needs each time when generating a token.
See API Key Scopes and Domains for all available scopes. The Authorization
header is a base64 encoded string of the following format: clientId:secret
Here is an example of generating a token:
curl --location 'https://{jwtTokenProvider}.ssa.nvidia.com/token' \
--header 'Content-Type: application/x-www-form-urlencoded' \
--header 'Authorization: Basic <Base64 encoded "{clientId}:{secret}">' \
--data-urlencode 'scope=register_function' \
--data-urlencode 'grant_type=client_credentials'
Response:
{
"access_token": "<generated $JWT_Token>",
"token_type": "bearer",
"expires_in": 3600,
"scope": "register_function"
}
JWT Token Usage
The JWT token is passed in the Authorization
header.
Authorization: Bearer $JWT_Token
Invocation refers to the execution of an inference call to a function deployed into a cluster.
Considerations
The body of your request must be valid JSON and max 200KB.
If calling the invocation API with no function version ID specified, and multiple function version IDs are deployed, then the inference call may go to instances hosting either function version.
Cloud Functions use HTTP/2 persistent connections. For best performance, it is expected that clients will not close connections until it is determined that no further communication with a server is necessary.
Cloud Functions invocation supports the following use cases:
HTTP Streaming: Uses HTTP/2 persistent connections for continuous data transmission, maintaining open connections for optimal performance until no further communication with the server is necessary.
HTTP (Polling): NVCF responds with either an HTTP Status 200 for a completed result, or an HTTP Status 202 for a response that will require polling for the result on the client.
gRPC: Allows users to invoke functions with authentication and function ID in the gRPC metadata, utilizing generic data based on Protobuf messages.
![papi-invoke-sak.png](https://docscontent.nvidia.com/dims4/default/eed1d77/2147483647/strip/true/crop/1395x1094+0+0/resize/1395x1094!/quality/90/?url=https%3A%2F%2Fk3-prod-nvidia-docs.s3.us-west-2.amazonaws.com%2Fbrightspot%2Fsphinx%2F0000018f-f319-d1d5-a3bf-f3b9b89f0000%2Fcloud-functions%2Fuser-guide%2Flatest%2F_images%2Fpapi-invoke-sak.png)
HTTP Invocation Example using Function ID
curl --request POST \
--url https://api.nvcf.nvidia.com/v2/nvcf/pexec/functions/eb1100de-60bf-4e9a-8617-b7d4652e0c37 \
--header 'Authorization: Bearer $API_KEY' \
--header 'Accept: application/json' \
--header 'Content-Type: application/json' \
--data '{
"messages": [
{
"role": "user",
"content": "Hello"
}
],
"temperature": 0.2,
"top_p": 0.7,
"max_tokens": 512
}'
HTTP (Polling)
NVCF employs long polling for function invocation and result retrieval. However, the invocation API can be used as a synchronous, blocking API up to the max timeout of 5 minutes.
The polling response timeout is set to 1 minute by default, and configurable up to 5 minutes via setting the HTTP header NVCF-POLL-SECONDS
on the request, refer to the API documentation.
When you make a function invocation request, NVCF will hold your request open for the polling response time period before returning with either:
HTTP Status 200
completed resultHTTP Status 202
polling responseOn receipt of a polling response your client should immediately poll NVCF to retrieve your result.
Example
Below is an invocation of the “echo” function built from any of the “echo” containers from the examples repository.
curl --location 'https://api.nvcf.nvidia.com/v2/nvcf/pexec/functions/{functionId}' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $API_KEY' \
--data '{
"inputs": [
{
"name": "message",
"shape": [
1
],
"datatype": "BYTES",
"data": [
"Hello"
]
},
{
"name": "response_delay_in_seconds",
"shape": [
1
],
"datatype": "FP32",
"data": [
0.1
]
}
],
"outputs": [
{
"name": "echo",
"datatype": "BYTES",
"shape": [
1
]
}
]
}'
In the event NVCF responds erroneously your client should but is not required to protect itself by ensuring it does not make more than one polling request per second.
This can be achieved by keeping a start_time when you make a polling request and sleeping for up to 1 second from that start_time before making another request.
This does not mean your client should always be adding a sleep.
For example if 0.1 seconds have passed since making your last polling request your client should sleep for 0.9 seconds.
If 5 seconds have passed since making your last polling request your client should not sleep.
Polling After Initial Invocation
When making a function invocation request using the pexec
endpoint, and a HTTP Status 202 is returned, the following headers will be included in the response:
NVCF-REQID
: Invocation request ID, referred to asrequestId
NVCF-STATUS
: Invocation statusNVCF-PERCENT-COMPLETE
: Percentage complete
The client is then expected to poll for a response using the requestId
.
Example
curl --location 'https://api.nvcf.nvidia.com/v2/nvcf/pexec/status/{requestId}' \
--header 'Authorization: Bearer $API_KEY'
Endpoint: GET /v2/nvcf/pexec/status/{requestId}
Headers:
NVCF-POLL-SECONDS
(optional): HTTP polling response timeout, if other than default, in seconds
Parameters:
requestId
(path, required): Function invocation request id, string($uuid)
Responses:
200
: Invocation is fulfilled. The response body will be a passthrough of the response returned from your container.202
: Result is pending. Client should continue to poll using the returned request ID.302
: In this case, the result is in a different region or is a large response. Client should use the fully-qualified endpoint specified inLocation
response header to fetch the result. The client can use the same API Key inAuthorization
header when retrieving the result from the redirected region.
Large Responses (302 Status Code)
The result payload size may not exceed 5GB. If your payload exceeds 5MB, i.e. 5MB < result size < 5GB , you will receive a reference in the response to download the payload.
When using the pexec
invocation API, either during the initial invocation API call or when polling (GET /v2/nvcf/pexec/status/{requestId}
), this will be indicated by a response with the HTTP 302
status code. The Location
response header will contain the fully-qualified endpoint, there will be no response body.
Your client should be configured to make a new HTTP request to the URL given in the Location
response header. The new HTTP request must include Authorization
request header.
The result retrieval URL’s Time-To-Live (TTL) is 24 hours. To read more about assets, refer to the Assets API and the asset flow example.
HTTP Streaming
This feature allows clients to receive data as an event stream, eliminating the need for polling, or making repeated requests to check for new data. The server sends events to the client over a long-lived connection, allowing the client to receive updates in real-time. Note that HTTP streaming requests use the same invocation API endpoint.
Prerequisites
Cloud function deployed on NVCF
Familiarity with the basic HTTP pexec invocation API HTTP (Polling) usage documented above
Understanding of Server-Sent Events (SSE)
Client Configuration
The client initiates a connection by making a POST request to the NVCF pexec invocation API endpoint, including the header
Accept: text/event-stream
.
curl --request POST \
--url https://api.nvcf.nvidia.com/v2/nvcf/pexec/functions/eb1100de-60bf-4e9a-8617-b7d4652e0c37 \
--header 'Authorization: Bearer $API_KEY' \
--header 'Accept: application/json' \
--header 'Content-Type: text/event-stream' \
--data '{
"messages": [
{
"role": "user",
"content": "Hello"
}
],
"temperature": 0.2,
"top_p": 0.7,
"max_tokens": 512
}'
Upon receiving this request with the appropriate header, NVCF knows the client is prepared to receive streamed data.
Handling Server Responses
If the response from the inference container includes the header
Content-Type: text/event-stream
, the client keeps the connection open to the API and listens for data.NoteThe NVCF worker will read events from the inference container for up to a default of 5 minutes, until the inference container closes the connection, whichever is earlier. Do not create an infinite event stream. Even if the client disconnects, the worker will still read events, which can tie up your function until the worker stops reading as described above, or the request times out.
Data read from the inference container’s response is buffered by event and sent as an in-progress response to the NVCF API. Smaller events are more interactive, but they shouldn’t be too small. If they are, the size of the event stream wrapper might exceed the actual data, causing increased data transmission for your clients. The maximum event size allowed is 4MB.
This process continues until the stream is complete.
Example
See the streaming container and client in our example containers repository.
Shutdown Behavior
During a graceful shutdown, the NVCF API waits for all ongoing event stream requests to complete.
There’s a 5-minute global request timeout for these event stream requests.
Advantages of HTTP Streaming
Reduces latency: Clients receive data as it becomes available.
Lowers overhead: Eliminates the need for repeated polling requests.
Flexibility: The inference container controls if the response will be streamed or not, allowing the client-side implementation to remain consistent regardless of server-side changes.
This feature introduces the possibility of long-lived “blocking” client requests, which the system must manage efficiently, especially during a shutdown sequence.
gRPC
Users can invoke functions by including their authentication information and a specific function ID in the gRPC metadata.
The data being transmitted is generic and based on Protobuf messages.
Each model or container will have its own unique API, defined by the Protobuf messages it implements.
gRPC connections will be kept alive for 30 seconds if idle, this is not configurable.
gRPC functions have no input request size limit.
Proxy Host & Endpoint
The gRPC proxy host is
grpc.nvcf.nvidia.com:443
.Use this host when calling your gRPC endpoint.
The Cloud Functions gRPC proxy will attempt to open a connection to your function instance for 30 seconds before timeout.
API Key & Metadata Keys
Set your API Key as Call Credentials. Either use gRPC’s support for Call Credentials, sometimes called Per RPC Credentials, to pass the API Key or manually set the
authorization
metadata asBearer $API_Key
.Set the
function-id
metadata key.Optionally, set the
function-version-id
metadata key.When the client is finished making gRPC calls, close the gRPC client connection so that you do not tie up your function’s workers longer than needed.
Example
See a complete grpc server and client example in our example containers repository.
def call_grpc(
create_grpc_function: CreateFunctionResponse, # function def info
) -> None:
channel = grpc.secure_channel("grpc.nvcf.nvidia.com:443",
grpc.ssl_channel_credentials())
# proto generated grpc client
grpc_client = grpc_service_pb2_grpc.GRPCInferenceServiceStub(channel)
function_id = create_grpc_function.function.id
function_version_id = create_grpc_function.function.version_id
apiKey = "$API_KEY"
metadata = [("function-id", function_id), # required
("function-version-id", function_version_id), # optional
("authorization", "Bearer " + apiKey)] # required
# make 100 unary inferences in a row
for i in range(ITERATIONS):
# this would be your client, request, and body.
# it does not have any proto def restriction.
infer = grpc_client.ModelInfer(MODEL_INFER_REQUEST,
metadata=metadata)
_ = infer
logging.info(f"finished invoking{ITERATIONS}times")
The official term for authorization handling using gRPC is “Call Credentials”. More details can be found at grpc.io documentation on credential types. The Python example provided does not showcase this. Instead, it demonstrates manually setting the “authorization” with an API Key. Using call credentials would implicitly handle this.
Below is defined the list of statuses and error codes the API can produce.
Function Invocation Response Status
If the client receives an HTTP Status code 202
from a pexec
invocation API call, the client is expected to poll or issue a GET request using this NVCF-REQID
defined in the header.
The NVCF-STATUS
header can have the following values:
pending-evaluation
- worker has not yet accepted the request.fulfilled
- The process has been completed with results.rejected
- The request was rejected by the service.errored
- An error occurred during worker processing.in-progress
- A worker is processing the request.
Statuses fulfilled
, rejected
and errored
are completed states, and you should not continue to poll.
Inference Container Status Codes and Responses
Error messages generated in your inference endpoint response are propagated from your inference container. Here is an example:
{
"type": "urn:inference-service:problem-details:bad-request",
"title": "Bad Request",
"status": 400,
"detail": "invalid datatype for input message",
"instance": "/v2/nvcf/pexec/functions/{functionId}",
"requestId": "{requestId}"
}
Error responses are formed as follows:
The
type
field in an error response will always includeinference-service
if the error is originating from your inference container.The response status code will be set to the status code your inference container returns. This is what is returned in the
status
andtitle
fields as well.The
instance
andrequestId
fields are autofilled by the worker.The
detail
field includes the error message body that your inference container returns.
Setting the Error Detail Field
Your inference container error response format must return JSON and must set the error
field:
{
"error": "put your error here"
}
If this field is not set, the detail
field in any error responses will be set to a generic Inference error
string.
It’s highly encouraged to emit logs from your inference container. See Logging and Metrics for setting and viewing logs within the Cloud Functions UI.
NVCF API Status Codes
Please refer to the OpenAPI Docs for other possible status code failure reasons in cases where they are not generated from your inference container.
For Function States, see Function Lifecycle.
To easily differentiate between errors originating from within NVCF’s API or control plane, and your own inference container, determine if the type
field includes inference-service
(indicating the error is from your inference container)
Common Function Invocation Errors
Failure Type | Description |
---|---|
Invocation response returning 4xx or 5xx status code | Check the “type” of the error message response, if the type includes inference-service this indicates the error is coming from your inference container. Please check the OpenAPI Specification and for other possible status code failure reasons in case where they are not generated from your inference container. |
Invocation request taking long to get a result | Check the capacity of your function using the Function Metrics UI or API, to see if your function is queuing. Consider instrumenting your container with additional metrics to your chosen monitoring solution for further debugging - NVCF containers allow public egress. Set NVCF-POLL-SECONDS header to 300 (maximum) to wait for a sync response for up to 5 min to rule out errors in your client’s polling logic. |
Invocation response returning 401 or 403 | This indicates that the caller is unauthorized, ensure the Authorization header is set correctly with a valid API Key. |
Container OOM | This is difficult to detect without instrumenting your container with additional metrics, unless your container is emitting logs that indicate out of memory. We recommend profiling the memory usage locally. For testing locally and in the function, you can look at a profile of the memory allocation using this guide. |