Function Invocation#

Using the NVCF Invocation API#

Invocation refers to the execution of an inference call to a function deployed into a cluster.

Considerations

If calling the invocation API with no function version ID specified, and multiple function version IDs are deployed, then the inference call may go to instances hosting either function version.
Cloud Functions use HTTP/2 persistent connections. For best performance, it is expected that clients will not close connections until it is determined that no further communication with a server is necessary.

Cloud Functions invocation supports the following use cases:

HTTP Streaming: Uses HTTP/2 persistent connections for continuous data transmission, maintaining open connections for optimal performance until no further communication with the server is necessary.
HTTP (Polling): NVCF responds with either an HTTP Status 200 for a completed result, or an HTTP Status 202 for a response that will require polling for the result on the client.
gRPC: Allows users to invoke functions with authentication and function ID in the gRPC metadata, utilizing generic data based on Protobuf messages.

Examples Please find function invocation example containers and helm charts in this repository.

The NVCF API supports routing requests to different endpoints.

To do so, you add the function id as part of the url as shown below

curl  --request POST \
--url https://<function-id>.invocation.api.nvcf.nvidia.com/echo \
--header 'Authorization: Bearer nvapi-<token>' \
--header 'Content-Type: application/json' \
--data '{"message": "hello"}'

Also we can use query parameters as part of the request

curl  --request POST \
--url https://<function-id>.invocation.api.nvcf.nvidia.com/echo?name=John \
--header 'Authorization: Bearer nvapi-<token>' \
--header 'Content-Type: application/json' \
--data '{"name": "John"}'

We can also stream arbitrary data to the function

curl --location --request PUT \
--url https://<function-id>.invocation.api.nvcf.nvidia.com/my-cool-endpoint?abc=123 \
--header 'Accept: application/octet-stream' \
--header 'Authorization: Bearer nvapi-<token>' \
--form 'asdf="123"' \
--form '=@"/Users/username/Downloads/cool-file.zip"'

HTTP Streaming#

This feature allows clients to receive data as an event stream, eliminating the need for polling, or making repeated requests to check for new data. The server sends events to the client over a long-lived connection, allowing the client to receive updates in real-time. Note that HTTP streaming requests use the same invocation API endpoint.

Prerequisites

Cloud function deployed on NVCF
Familiarity with the basic HTTP pexec invocation API HTTP (Polling) usage documented above
Understanding of Server-Sent Events (SSE)

Client Configuration

The client initiates a connection by making a POST request including the header Accept: text/event-stream.

curl --request POST \
    --url https://<function-id>.invocation.api.nvcf.nvidia.com/echo \
    --header "Authorization: Bearer $API_KEY" \
    --header 'Accept: text/event-stream' \
    --header 'Content-Type: application/json' \
    --data '{
        "messages": [
            {
            "role": "user",
            "content": "Hello"
            }
        ],
        "temperature": 0.2,
        "top_p": 0.7,
        "max_tokens": 512
    }'

Upon receiving this request with the appropriate header, NVCF knows the client is prepared to receive streamed data.

Handling Server Responses

If the response from the inference container includes the header Content-Type: text/event-stream, the client keeps the connection open to the API and listens for data.

Note

The NVCF worker will read events from the inference container for up to a default of 20 minutes until the inference container closes the connection, whichever is earlier. Do not create an infinite event stream. Even if the client disconnects, the worker will still read events, which can tie up your function until the worker stops reading as described above, or the request times out.
Data read from the inference container’s response is buffered by the event and sent as an in-progress response to the NVCF API. Smaller events are more interactive, but they shouldn’t be too small. If they are, the size of the event stream wrapper might exceed the actual data, causing increased data transmission for your clients. The maximum event size allowed is 4MB.
This process continues until the stream is complete.

Example

See the streaming container and client in our example containers repository.

Shutdown Behavior

During a graceful shutdown, the NVCF API waits for all ongoing event stream requests to complete.
There’s a 5-minute global request timeout for these event stream requests.

Advantages of HTTP Streaming#

Reduces latency: Clients receive data as it becomes available.
Lowers overhead: Eliminates the need for repeated polling requests.
Flexibility: The inference container controls if the response will be streamed or not, allowing the client-side implementation to remain consistent regardless of server-side changes.

Warning

This feature introduces the possibility of long-lived “blocking” client requests, which the system must manage efficiently, especially during a shutdown sequence.

HTTP (Polling)#

NVCF employs long polling for function invocation and result retrieval. However, the invocation API can be used as a synchronous, blocking API up to the max timeout of 20 minutes.

The polling response timeout is set to 1 minute by default, and configurable up to 60 minutes via setting the HTTP header NVCF-POLL-SECONDS on the request, refer to the API documentation.

When you make a function invocation request, NVCF will hold your request open for the polling response period before returning with either:

HTTP Status 200 completed result
HTTP Status 202 polling response
- On receipt of a polling response your client should immediately poll NVCF to retrieve your result.

Note

In the event, that NVCF responds erroneously your client should but is not required to protect itself by ensuring it does not make more than one polling request per second.

This can be achieved by keeping a start_time when you make a polling request and sleeping for up to 1 second from that start_time before making another request.

This does not mean your client should always be adding a sleep.

For example, if 0.1 seconds have passed since making your last polling request your client should sleep for 0.9 seconds.

If 5 seconds have passed since making your last polling request your client should not sleep.

Polling After Initial Invocation#

When making a function invocation request using the pexec endpoint, and an HTTP Status 202 is returned, the following headers will be included in the response:

NVCF-REQID: Invocation request ID, referred to as requestId

The client is then expected to poll for a response using the requestId.

../_images/function-invocation-client-status.png

Clients can manage their queue timeout explicitly by setting the header NVCF-POLL-SECONDS to a longer duration (up to 1 hour) if they want their request to stay in the queue for extended periods of time. If the NVCF-POLL-SECONDS parameter is not set, the polling response timeout is set to 1 minute by default.

Example

curl --location 'https://api.nvcf.nvidia.com/v2/nvcf/pexec/status/{requestId}' \
--header "Authorization: Bearer $API_KEY"

Endpoint: GET /v2/nvcf/pexec/status/{requestId}

Headers:

NVCF-POLL-SECONDS (optional): HTTP polling response timeout, if other than default, in seconds

Parameters:

requestId (path, required): Function invocation request id, string($uuid)

Responses:

200: Invocation is fulfilled. The response body will be a passthrough of the response returned from your container.
202: A worker has picked up the request and the result is pending. The client should continue to poll using the returned request ID.
302: In this case, the result is in a different region or is a large response. The client should use the fully-qualified endpoint specified in the Location response header to fetch the result. The client can use the same API Key in the Authorization header when retrieving the result from the redirected region.
504: The request has timed out without a worker picking up the request. The client should retry the request.

Large Responses (302 Status Code)#

The result payload size may not exceed 5GB. If your payload exceeds 5MB, i.e. 5MB < result size < 5GB, you will receive a reference in the response to download the payload.

When using the pexec invocation API, either during the initial invocation API call or when polling (GET /v2/nvcf/pexec/status/{requestId}), this will be indicated by a response with the HTTP 302 status code. The Location response header will contain the fully-qualified endpoint, there will be no response body.

Your client should be configured to make a new HTTP request to the URL given in the Location response header. The new HTTP request must include an Authorization request header.

The result retrieval URL’s Time-To-Live (TTL) is 30 minutes. To read more about assets, refer to the Assets API and the asset flow example.

gRPC#

Users can invoke functions by including their authentication information and a specific function ID in the gRPC metadata.

The data being transmitted is generic and based on Protobuf messages.
Each model or container will have its own unique API, defined by the Protobuf messages it implements.
gRPC connections will be kept alive for 30 seconds if idle, this is not configurable.
gRPC functions have no input request size limit.

Proxy Host & Endpoint

The gRPC proxy host is grpc.nvcf.nvidia.com:443.
Use this host when calling your gRPC endpoint.
The Cloud Functions gRPC proxy will attempt to open a connection to your function instance for 30 seconds before timeout.

API Key & Metadata Keys

Set your API Key as Call Credentials. Either use gRPC’s support for Call Credentials, sometimes called Per RPC Credentials, to pass the API Key or manually set the authorization metadata as Bearer $API_Key.
Set the function-id metadata key.
Optionally, set the function-version-id metadata key.
When the client is finished making gRPC calls close the gRPC client connection so that you do not tie up your function’s workers longer than needed.

Example

See a complete grpc server and client example in our example containers repository.

  def call_grpc(
          create_grpc_function: CreateFunctionResponse, # function def info
  ) -> None:
      channel = grpc.secure_channel("grpc.nvcf.nvidia.com:443",
                                  grpc.ssl_channel_credentials())
      # proto generated grpc client
      grpc_client = grpc_service_pb2_grpc.GRPCInferenceServiceStub(channel)

      function_id = create_grpc_function.function.id
      function_version_id = create_grpc_function.function.version_id

      apiKey = "$API_KEY"
      metadata = [("function-id", function_id), # required
                  ("function-version-id", function_version_id), # optional
                  ("authorization", "Bearer " + apiKey)] # required

      # make 100 unary inferences in a row
      for i in range(ITERATIONS):
          # this would be your client, request, and body.
          # it does not have any proto def restriction.
          infer = grpc_client.ModelInfer(MODEL_INFER_REQUEST,
                                      metadata=metadata)
          _ = infer
      logging.info(f"finished invoking {ITERATIONS} times")

Note

The official term for authorization handling using gRPC is “Call Credentials”. More details can be found at grpc.io documentation on credential types. The Python example provided does not showcase this. Instead, it demonstrates manually setting the “authorization” with an API Key. Using call credentials would implicitly handle this.

Statuses and Errors#

Below is the list of statuses and error codes the API can produce.

Function Invocation Response Status

If the client receives an HTTP Status code 202 from a pexec invocation API call, the client is expected to poll or issue a GET request using this NVCF-REQID defined in the header.

The NVCF-STATUS header can have the following values:

pending-evaluation - The worker has not yet accepted the request.
fulfilled - The process has been completed with results.
rejected - The request was rejected by the service.
errored - An error occurred during worker processing.
in-progress - A worker is processing the request.

Statuses fulfilled, rejected and errored are completed states, and you should not continue to poll.

Inference Container Status Codes and Responses#

Error messages generated in your inference endpoint response are propagated from your inference container. Here is an example:

{
  "type": "urn:inference-service:problem-details:bad-request",
  "title": "Bad Request",
  "status": 400,
  "detail": "invalid datatype for input message",
  "instance": "/v2/nvcf/pexec/functions/{functionId}",
  "requestId": "{requestId}"
}

Error responses are formed as follows:

The type field in an error response will always include inference-service if the error is originating from your inference container.
The response status code will be set to the status code your inference container returns. This is what is returned in the status and title fields as well.
The instance and requestId fields are autofilled by the worker.
The detail field includes the error message body that your inference container returns.

Setting the Error Detail Field

Your inference container error response format must return JSON and must set the error field:

{
  "error": "put your error here"
}

If this field is not set, the detail field in any error responses will be set to a generic Inference error string.

Warning

It’s highly encouraged to emit logs from your inference container. See Logging and Metrics for setting and viewing logs within the Cloud Functions UI.

Common Function Invocation Errors#

Failure Type	Description
Invocation response returning 4xx or 5xx status code	Check the “type” of the error message response, if the `type` includes `inference-service` this indicates the error is coming from your inference container. Please check the OpenAPI Specification and for other possible status code failure reasons in the case where they are not generated from your inference container.
Invocation request taking a long to get a result	Check the capacity of your function using the Function Metrics UI or API, to see if your function is queuing. Consider instrumenting your container with additional metrics to your chosen monitoring solution for further debugging - NVCF containers allow public egress. Set `NVCF-POLL-SECONDS` header to 300 (maximum) to wait for a sync response for up to 20 min to rule out errors in your client’s polling logic.
Invocation response returning 401 or 403	This indicates that the caller is unauthorized, ensure the `Authorization` header is set correctly with a valid API Key.
Container OOM	This is difficult to detect without instrumenting your container with additional metrics unless your container is emitting logs that indicate out of memory. We recommend profiling the memory usage locally. For testing locally and in the function, you can look at a profile of the memory allocation using this guide.