Function Invocation#
Using the NVCF Invocation API#
Invocation refers to the execution of an inference call to a function deployed into a cluster.
Considerations
If calling the invocation API with no function version ID specified, and multiple function version IDs are deployed, then the inference call may go to instances hosting either function version.
Cloud Functions use HTTP/2 persistent connections. For best performance, it is expected that clients will not close connections until it is determined that no further communication with a server is necessary.
Cloud Functions invocation supports the following use cases:
HTTP Streaming: Uses HTTP/2 persistent connections for continuous data transmission, maintaining open connections for optimal performance until no further communication with the server is necessary.
gRPC: Allows users to invoke functions with authentication and function ID in the gRPC metadata, utilizing generic data based on Protobuf messages.
Examples Please find function invocation example containers and helm charts in this repository.
The NVCF API supports routing requests to different endpoints.
To do so, you add the function id as part of the url as shown below
1curl --request POST \
2--url https://<function-id>.invocation.api.nvcf.nvidia.com/echo \
3--header 'Authorization: Bearer nvapi-<token>' \
4--header 'Content-Type: application/json' \
5--data '{"message": "hello"}'
Also we can use query parameters as part of the request
1curl --request POST \
2--url https://<function-id>.invocation.api.nvcf.nvidia.com/echo?name=John \
3--header 'Authorization: Bearer nvapi-<token>' \
4--header 'Content-Type: application/json' \
5--data '{"name": "John"}'
We can also stream arbitrary data to the function
1curl --location --request PUT \
2--url https://<function-id>.invocation.api.nvcf.nvidia.com/my-cool-endpoint?abc=123 \
3--header 'Accept: application/octet-stream' \
4--header 'Authorization: Bearer nvapi-<token>' \
5--form 'asdf="123"' \
6--form '=@"/Users/username/Downloads/cool-file.zip"'
HTTP Streaming#
This feature allows clients to receive data as an event stream, eliminating the need for polling, or making repeated requests to check for new data. The server sends events to the client over a long-lived connection, allowing the client to receive updates in real-time. Note that HTTP streaming requests use the same invocation API endpoint.
Prerequisites
Cloud function deployed on NVCF
Familiarity with the basic HTTP invocation, API usage documented above
Understanding of Server-Sent Events (SSE)
Client Configuration
The client initiates a connection by making a POST request including the header
Accept: text/event-stream.1curl --request POST \ 2 --url https://<function-id>.invocation.api.nvcf.nvidia.com/echo \ 3 --header "Authorization: Bearer $API_KEY" \ 4 --header 'Accept: text/event-stream' \ 5 --header 'Content-Type: application/json' \ 6 --data '{ 7 "messages": [ 8 { 9 "role": "user", 10 "content": "Hello" 11 } 12 ], 13 "temperature": 0.2, 14 "top_p": 0.7, 15 "max_tokens": 512 16 }'
Upon receiving this request with the appropriate header, NVCF knows the client is prepared to receive streamed data.
Handling Server Responses
If the response from the inference container includes the header
Content-Type: text/event-stream, the client keeps the connection open to the API and listens for data.Note
The NVCF worker will read events from the inference container for up to a default of 20 minutes until the inference container closes the connection, whichever is earlier. Do not create an infinite event stream. If the client disconnects, the worker will eventually time out and close the request.
Data read from the inference container’s response is forwarded and replayed to the client.
This process continues until the stream is complete.
Example
See the streaming container and client in our example containers repository.
Shutdown Behavior
During a graceful shutdown, the NVCF API waits for all ongoing event stream requests to complete.
There’s a 20-minute global request timeout for these event stream requests.
Advantages of HTTP Streaming#
Reduces latency: Clients receive data as it becomes available.
Lowers overhead: Eliminates the need for repeated polling requests.
Flexibility: The inference container controls if the response will be streamed or not, allowing the client-side implementation to remain consistent regardless of server-side changes.
gRPC#
Users can invoke functions by including their authentication information and a specific function ID in the gRPC metadata.
The data being transmitted is generic and based on Protobuf messages.
Each model or container will have its own unique API, defined by the Protobuf messages it implements.
gRPC connections will be kept alive for 30 seconds if idle, this is not configurable.
gRPC functions have no input request size limit.
Proxy Host & Endpoint
The gRPC proxy host is
grpc.nvcf.nvidia.com:443.Use this host when calling your gRPC endpoint.
The Cloud Functions gRPC proxy will attempt to open a connection to your function instance for 30 seconds before timeout.
API Key & Metadata Keys
Set your API Key as Call Credentials. Either use gRPC’s support for Call Credentials, sometimes called Per RPC Credentials, to pass the API Key or manually set the
authorizationmetadata asBearer $API_Key.Set the
function-idmetadata key.Optionally, set the
function-version-idmetadata key.When the client is finished making gRPC calls close the gRPC client connection so that you do not tie up your function’s workers longer than needed.
Example
See a complete grpc server and client example in our example containers repository.
1 def call_grpc( 2 create_grpc_function: CreateFunctionResponse, # function def info 3 ) -> None: 4 channel = grpc.secure_channel("grpc.nvcf.nvidia.com:443", 5 grpc.ssl_channel_credentials()) 6 # proto generated grpc client 7 grpc_client = grpc_service_pb2_grpc.GRPCInferenceServiceStub(channel) 8 9 function_id = create_grpc_function.function.id 10 function_version_id = create_grpc_function.function.version_id 11 12 apiKey = "$API_KEY" 13 metadata = [("function-id", function_id), # required 14 ("function-version-id", function_version_id), # optional 15 ("authorization", "Bearer " + apiKey)] # required 16 17 # make 100 unary inferences in a row 18 for i in range(ITERATIONS): 19 # this would be your client, request, and body. 20 # it does not have any proto def restriction. 21 infer = grpc_client.ModelInfer(MODEL_INFER_REQUEST, 22 metadata=metadata) 23 _ = infer 24 logging.info(f"finished invoking {ITERATIONS} times")
Note
The official term for authorization handling using gRPC is “Call Credentials”. More details can be found at grpc.io documentation on credential types. The Python example provided does not showcase this. Instead, it demonstrates manually setting the “authorization” with an API Key. Using call credentials would implicitly handle this.
Statuses and Errors#
Below is the list of statuses and error codes the API can produce.
Inference Container Status Codes and Responses#
Error messages generated in your inference endpoint response are propagated from your inference container as-is.
If you are using the pexec endpoint instead of the invocation endpoint, the error message is instead processed by the NVCF API. Here is an example:
1{ 2 "type": "urn:inference-service:problem-details:bad-request", 3 "title": "Bad Request", 4 "status": 400, 5 "detail": "invalid datatype for input message", 6 "instance": "/v2/nvcf/pexec/functions/{functionId}", 7 "requestId": "{requestId}" 8}
Error responses are formed as follows:
The
typefield in an error response will always includeinference-serviceif the error is originating from your inference container.The response status code will be set to the status code your inference container returns. This is what is returned in the
statusandtitlefields as well.The
instanceandrequestIdfields are autofilled by the worker.The
detailfield includes the error message body that your inference container returns.
Setting the Error Detail Field
Your inference container error response format must return JSON and must set the error field:
1{ 2 "error": "put your error here" 3}
If this field is not set, the detail field in any error responses will be set to a generic Inference error string.
Warning
It’s highly encouraged to emit logs from your inference container. See NGC UI Observability for setting and viewing logs within the Cloud Functions UI.
Common Function Invocation Errors#
Failure Type |
Description |
|---|---|
Invocation response returning 4xx or 5xx status code |
Check the “type” of the error message response, if the |
Invocation request taking a long to get a result |
Check the capacity of your function using the Function Metrics UI or API, to see if your function is queuing. Consider instrumenting your container with additional metrics to your chosen monitoring solution for further debugging - NVCF containers allow public egress. Set |
Invocation response returning 401 or 403 |
This indicates that the caller is unauthorized, ensure the |
Container OOM |
This is difficult to detect without instrumenting your container with additional metrics unless your container is emitting logs that indicate out of memory. We recommend profiling the memory usage locally. For testing locally and in the function, you can look at a profile of the memory allocation using this guide. |