Function Management

The lifecycle of a Function in any of the following states:

  • ACTIVE - If at least one Worker node is active. Only when a Function is ACTIVE it can be invoked.

  • ERROR - If all the Worker nodes associated with the Function are in an ERROR state.

  • INACTIVE - When a Function is created, it is INACTIVE. Also, when a Function is undeployed, the state is changed from ACTIVE to INACTIVE.

  • DEPLOYING - When the Function is being deployed and the instances or Workers are still coming up.


Functions can be created in one of three ways:

  1. A Triton Inference Server compatible Model repository

  2. A Docker Image

  3. A Helm chart

Function Creation with your own Model

Bring Your Own Model Overview

This is a guide to enable model owners to make use of Triton Inference Server to serve a model within NVIDIA’s Cloud Functions and NGC Private Registry.

A single model can be easily deployed leveraging Triton’s Auto-Generated Model Configuration functionality; simply upload the model and the inputs will be automatically discovered and deployed as part of the endpoint. However, more complicated configurations are supported as well through manually specifying the configuration or having multiple models controlled inside of an overall “ensemble” or through backend logic scripting (BLS)

When using BYOM within Cloud Functions, you will need to take into consideration the following limitations:

  • Only a HTTP interface is exposed

  • The service is not stateful

  • Must load models at start time

Creating the Model

As described above, a single model can be deployed and the configuration will be autogenerated. However, if a more complicated setup is required a complete model configuration will need to be created. Please see the Triton Model Configuration documentation to learn more.

Typically we recommend using Triton’s Python model backend as one if not the only model for several reasons:

  • 90% of use cases have no impact on performance

  • Faster development

  • You can call out to other backends

  • Pre or post-processing within

Refer to the example to develop your model and refer to the config.pbtxt example and create the file with the necessary configuration for your model.

If you are deploying on NVCF and using a model-only function, the input sent should not exceed 5MB. For container-based functions using assets as input, the asset should be read from a specific path. The python library NVCF Container Helpers is available to assist with common tasks like these.

There is also a Stable Diffusion based example available under examples/byom/sd_txt2img that demonstrates

Here are some other examples of how to create more complex workflows within Triton:

Uploading the Model

For detailed instructions on uploading a model to NGC please see Uploading a New NGC Model Version Using the NGC CLI

Use the Model repository structure to NGC:

To upload your model to the NVIDIA GPU Cloud (NGC) registry, follow these steps:

  1. Confirm that the model file is present in the correct directory.

  2. Upload the model to your NGC private registry by running the following command (note rnwu0zzwflg6 is the org ID found under Organization -> Profile):


ngc registry model upload-version rnwu0zzwflg6/bis-test-2:v1 --gpu-model 'GV-100' --source ./kngo-test-customization2_v1/gpt2b_ptuning.llmservice.nemo

  1. After the upload is complete, you will receive an upload summary confirming the status:


Model ID: bis-test-2[version=v1] Upload status: Completed ...

Creating a Function with Models Only

Note that the “containerImage” is omitted from the request. This Function will use the Triton Inference Server. Model-only Function uses the Predict Protocol - Version 2 for the “requestBody” (input & output).


curl -X 'POST' \ '' \ -H 'Authorization: Bearer <Token>' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "name": "simple_int8_1", "inferenceUrl": "v2/models/simple_int8/infer", "models": [ { "name": "simple_int8", "version": "1", "uri": "v2/org/cf/team/myteam/models/simple_int8/versions/1/zip" } ], "apiBodyFormat": "PREDICT_V2" }'

Function Creation with your Docker Container

Bring Your Own Container Overview

This is a guide to enable users to build a Docker that will work within NVIDIA Cloud Functions. This guide will provide examples of two tested methods for doing so.

First Method: Building a Triton Inference Server Container

NVCF is designed to working natively with Triton Inference Server based containers, including leveraging metrics and health checks from the server.

A working sample is provided under examples/byoc/triton.

The pre-built Triton Docker images can be found within NGC’s Container catalog. A minimum version of 23.04 (2.33.0) is required.

When setting the Docker Image’s run command to start tritonserver the following command options are mandatory:


CMD tritonserver --model-repository=${YOUR_PATH_HERE} --http-header-forward-pattern NVCF-.*

Once the Docker image is built and ready, it can be uploaded to NGC:

  1. Tag the Docker image:


    docker tag my_model_image[ngc_org]/[ngc_team]/echo:latest/my_model_image:latest

  2. Log in to NGC:


    docker login

    • Enter your credentials

  3. Push the Docker image to NGC:


    docker push[ngc_org]/[ngc_team]/echo:latest/my_model_image:latest

Second Method: PyTriton

NVIDIA’s PyTriton is a Python native solution of Triton inference server that works natively with NVCF. A minimum version of 0.3.0 is required.

Create the ``requirements.txt`` file:

  • This file should list the Python dependencies required for your model.

  • Add nvidia-pytriton to your requirements.txt file.

Create the ```` file:

  1. Your file (or similar python file) needs to define a PyTriton model.

  2. This involves importing your model dependencies, creating a PyTritonServer class with an __init__ function, an _infer_fn function and a run function that serves the inference_function, defining the model name, the inputs and the outputs along with optional configuration.

Here is an example of a file:


import numpy as np from pytriton.model_config import ModelConfig, Tensor from pytriton.triton import Triton, TritonConfig import time .... class PyTritonServer: """triton server for timed_sleeper""" def __init__(self): # basically need to accept image, mask(PIL Images), prompt, negative_prompt(str), seed(int) self.model_name = "timed_sleeper" def _infer_fn(self, requests): responses = [] for req in requests: req_data = sleep_duration = numpy_array_to_variable(req_data.get("sleep_duration")) # deal with header dict keys being lowerscale request_parameters_dict = uppercase_keys(req.parameters) time.sleep(sleep_duration) responses.append({"sleep_duration": np.array([sleep_duration])}) return responses def run(self): """run triton server""" with Triton( config=TritonConfig( http_header_forward_pattern="NVCF-*", # this is required http_port=8000, grpc_port=8001, metrics_port=8002, ) ) as triton: triton.bind( model_name="timed_sleeper", infer_func=self._infer_fn, inputs=[ Tensor(name="sleep_duration", dtype=np.uint32, shape=(1,)), ], outputs=[Tensor(name="sleep_duration", dtype=np.uint32, shape=(1,))], config=ModelConfig(batching=False), ) triton.serve() if __name__ == "__main__": server = PyTritonServer()

Build the Dockerfile

  1. Create a file named Dockerfile in your model directory.

  2. You can use containers like NVIDIA CUDA, Pytorch or TensorRT as your base container. They can be downloaded from the NGC Catalog.

  3. Make sure to install your Python requirements in your Dockerfile.

  4. Copy in your model source code, and model weights unless you plan to host them in NGC Registry.

Here is a sample Dockerfile and requirements.txt:



FROM RUN apt-get update && apt-get install -y \ git \ python3 \ python3-pip \ python-is-python3 \ libsm6 \ libxext6 \ libxrender-dev \ curl \ && rm -rf /var/lib/apt/lists/* WORKDIR /workspace/ # install requirements file COPY requirements.txt requirements.txt RUN pip install --no-cache-dir --upgrade pip RUN pip install --no-cache-dir -r requirements.txt ENV DEBIAN_FRONTEND=noninteractive #copy model weights COPY model_weights /models COPY model_source . COPY . CMD python3



--extra-index-url opencv-python-headless pycocotools matplotlib torch==2.1.0 nvidia-pytriton==0.3.0 numpy

This docker file will do the following:

  • Copy in model source code

  • Copy in the model weights

  • Installs requirements.txt python dependencies and nvidia-pytriton

  • Sets run command to start pytriton to serve the model

Build the Docker image

  1. Open a terminal or command prompt.

  2. Navigate to the my_model directory.

  3. Run the following command to build the Docker image:


docker build -t my_model_image .

Replace my_model_image with the desired name for your Docker image.

Use the Docker image and upload to NGC

For detailed instructions on uploading a container to NGC please see Uploading an NVIDIA Container Image

  1. Tag the Docker image:


    docker tag my_model_image[ngc_org]/[ngc_team]/echo:latest/my_model_image:latest

  2. Log in to NGC:


    docker login

    • Enter your credentials

  3. Push the Docker image to NGC:


    docker push[ngc_org]/[ngc_team]/my_model_image:latest

Creating a Function with a Custom Container

Note that the “models” array is omitted from the request.

If no port is specified for the container, it will default to 8000. If you want to override the port, add the parameter inferencePort with the desired port number.


curl -X 'POST' \ '' \ -H 'Authorization: Bearer <Token>' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "name": "echo_function", "inferenceUrl": "/echo", "containerImage": "", "apiBodyFormat": "CUSTOM" }'

Function Creation with Helmchart


  • A service must be included as part of the Helm Chart. Name of this service in your helm chart should be supplied by setting helmChartServiceName during function definition, see below. This allows the util container to communicate and make inference requests to the mini service entry point.

  • Feature only available to internal Nvidia customers.

  • Example: Use the helm chart example.

How To

  • Upload your Helm Chart to NGC

  • Ensure adherence to the helm chart prerequisite (listed above).

  • Create your function
    • Include the following additional parameter in the function definition
      • helmChart

      • helmChartServiceName

    • The helmChart property in the function definition is an optional field. It should be a URL hosted by the NGC model registry pointing to the Helm chart that will deploy the mini service. The helmChartServiceName is a required field only when helmChart property is supplied during function definition

    • Here is an example:


      curl -X 'POST' \ '' \ -H 'Authorization: Bearer <Token>' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "name": "echo_function", "inferenceUrl": "/echo", "helmChart": "", "helmChartServiceName": "echo", "apiBodyFormat": "CUSTOM" }'

  • Proceed with function deployment and invocation.

Example JSON Function Definition

Example Function Definition for Mini Service Deployment with Helm Chart


{ "name": function_name, "inferenceURL": "v2/models/model_name/versions/model_version/infer", "inferencePort": 8001, "helmChart": "", "helmChartServiceName": service_name }

In this example, the helmChart property is set to the URL of the Helm chart stored in the NGC model registry. This URL is used to deploy the associated mini service using the specified Helm chart. The function creation payload will be deemed invalid if it contains helmChart along with containerImage and/or models properties.

When using Helm Charts, following limitations needs to be taken into consideration
  • For model and assets download, you can leverage init container or download their own as part of the helm chart, but download size is limited by the disk space on the VM (For GFN 100GB approximately and BYOC this limit will vary)

  • Large response reporting is not supported, but if they have response >5MB then they can opt for HTTP streaming or gRPC bidirectional support.

Helm Chart Overrides

To override keys in your helm chart values.yml, you can provide ‘configuration’ parameter and supply corresponding key value pair in JSON format which you would like to be overridden when function is deployed.

Example helm chart override


curl -X 'POST' \ '' \ -H 'Authorization: Bearer <Token>' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "deploymentSpecifications": [{ "gpu": "L40", "backend": "OCI", "maxInstances": 2, "minInstances": 1 "configuration": { "key_one": "<value>", "key_two": { "key_two_subkey_one": "<value>", "key_two_subkey_two": "<value>" } ... }, { "gpu": "T10", "backend": "GFN", "maxInstances": 2, "minInstances": 1 }] }'

When you first register a function, it will have an initial version ID created. You can create additional versions of this function by specifying other models/containers/helm charts to use. Here is a sample API call:


curl -X 'POST' \ '' \ -H 'accept: application/json' \ -H 'Authorization: Bearer <Token>' -d '{ "name": "echo_function", "inferenceUrl": "/echo", "containerImage": "", "apiBodyFormat": "CUSTOM" }'

Multiple function versions allow for different deployment configurations for each version while still be accessible through a single function endpoint. Deployments will be discussed in the following section. Multiple functions versions can also be used to deployed to support A/B testing.

NOTE: Function versioning should only be used if the APIs between the various version are intercompatable. Different APIs should be created as new Functions.

Listing Functions

This is used to list the available functions that can be run.


curl -X 'GET' \ '' \ -H 'Authorization: Bearer <Token>' \ -H 'accept: application/json'

Listing Function Versions

This is used to list the versions of a specific Function ID.


curl -X 'GET' \ '' \ -H 'Authorization: Bearer <Token>' \ -H 'accept: application/json'

Retrieve Function Version Details

This is used to list details of a specific Function version.


curl -X 'GET' \ '' \ -H 'Authorization: Bearer <Token>' \ -H 'accept: application/json'

Public Functions

  • Functions marked as public are visible in the list_functions response for all Cloud Functions users.

  • You can filter these out if you wish using your NCAID.

Deleting a Function Version

Use both the Function ID and Function Version ID to delete a Function version.


curl -X 'DELETE' \ '' \ -H 'Authorization: Bearer <Token>' \ -H 'accept: application/json'

To activate the function, it must be deployed as a Function version. This action requests the creation of worker pods to process requests for the invocation of the function. Once worker pods are successfully created, the status of the Function will transition to ACTIVE. If all worker pods fail to launch, the status will change to ERROR.


curl -X 'POST' \ '' \ -H 'Authorization: Bearer <Token>' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "deploymentSpecifications": [ { "gpu": "L40", "backend": "OCI", "maxInstances": 2, "minInstances": 1 }, { "gpu": "T10", "backend": "GFN", "maxInstances": 2, "minInstances": 1 } ] }'

Each function version can have a different deployment configuration allowing for heterogenous computing infrastucure to be used across a single function endpoint.

Delete Function Version Deployment

To delete a Function version deployment, you supply the Function ID and version ID.


curl -X 'DELETE' \ ''

My function is stuck in the deploying state. What do I do?

Depending on the size of your containers and models, it usually takes up to 30 minutes for your function to deploy, although durations up to 2 hours are permitted. If you believe your function should have deployed already, or if it has entered an error state, review the logs to understand what happened.

See below for information on how to view and use logs to troubleshoot issues with your functions.

There may not be enough capacity available to fulfill your deployment. Try reducing the number of instances you are requesting or changing the GPU/instance type used by your function.

How do I view the logs for my function’s?

This is coming soon.

I’m getting errors when invoking my function. What do I do?

Please review the error message and update your container or model as required.

The table below provides an overview of the Function lifecycle API endpoints and their respective usages.

Listing Function Versions

The table below provides an overview of the Function lifecycle API endpoints and their respective usages.





Register Function POST /v2/nvcf/functions Creates a new Function.
Register Function Version POST /v2/nvcf/functions/{functionId}/versions Creates a new version of a Function.
Delete Function Version DELETE /v2/nvcf/functions/{functionId}/versions/{functionVersionId} Deletes a Function specified by its ID.
List Functions GET /v2/nvcf/functions Retrieves a list of functions associated with the account.
List Function Versions GET /v2/nvcf/functions/{functionId}/versions Retrieves a list of versions for a specific Function.
Retrieve Function Details GET /v2/nvcf/functions/{functionId}/versions/{functionVersionId} Retrieves details of a specific Function version.
Create Function Version Deployment POST /v2/nvcf/deployments/functions/{functionId}/versions/{functionVersionId} Initiates the deployment process for a Function version on worker nodes.
Delete Function Version Deployment DELETE /v2/nvcf/deployments/functions/{functionId}/versions/{functionVersionId} Initiates the undeployment process for a Function version.
Retrieve Function Version Deployment GET /v2/nvcf/deployments/functions/{functionId}/versions/{functionVersionId} Retrieves details of a specific Function version deployment.
Update Function Version Deployment PUT /v2/nvcf/deployments/functions/{functionId}/versions/{functionVersionId} Updates the configuration of a Function version deployment.

Function Invocation

The table below provides an overview of the Function invocation API endpoints and their respective usages.





Invoke Function POST v2/nvcf/pexec/functions/{functionId} Invokes the specified Function to execute the job and returns the results, if available. NVCF randomly selects one of the active versions of the specified Function to execute the submitted job. Avoid making a GET request to obtain result if the original POST request returns a 200 response.
Invoke Function Version POST v2/nvcf/pexec/functions/{functionId}/versions/{functionVersionId} Invokes the specified version under the specified Function to execute the job and returns the results. Avoid making a GET request to obtain result if the original POST request returns a 200 response.
Get Function Invocation Status GET /v2/nvcf/pexec/status/{invocationRequestId} Used to poll for results of a job when 202 is returned. Avoid making this request to obtain result if the original POST request returns a 200 response.

Result can be obtained just once either via the original POST request or via a subsequent GET request.

If the result was included in the original POST request, then the status will be 200. In that case, any subsequent attempts to obtain the result using the GET request will result in 404.

If the original POST request responds with 202(i.e. result is pending), then result should be obtained using the GET request. GET request can respond with either 202(i.e. result is pending) or 200(i.e. result is ready). Once GET request responds with 200, then any subsequent attempts to obtain the result using the GET request will result in 404.

Asset Management





Create Asset POST /v2/nvcf/assets Creates an asset id and a corresponding pre-signed URL to upload (file).
List Assets GET /v2/nvcf/assets Returns a list of assets associated with the account.
Delete Asset DELETE /v2/nvcf/assets/{nvcf_asset_id} Deletes an asset using the specified asset id.






Get Queue Length for Function id GET /v2/nvcf/queues/functions/{functionId} Returns a list containing a single element with corresponding queue length for the specified Function.
Get Queue Length for Version id GET /v2/nvcf/queues/functions/{functionId}/versions/{functionVersionId} Returns a list containing a single element with corresponding queue length for the specified Function version id.
Get Available GPUs GET /v2/nvcf/supportedGpus Returns a list of GPU types you have access too.
Get Queue Position for Request id GET /v2/nvcf/queues/{requestId}/position Returns estimated position in queue, up to 1000, for a specific request id of a function invocation request.

The following is a reference of available variables via the headers of the invocation message (auto-populated by NVCF), accessible within the container.

For examples of how to extract and use some of these variables, see NVCF Container Helper Functions.



NVCF-REQID Request ID for this request.
NVCF-SUB Message subject.
NVCF-NCAID Function’s organization’s NCA ID.
NVCF-ASSET-DIR Asset directory path.
NVCF-LARGE-OUTPUT-DIR Large output directory path.
NVCF-MAX-RESPONSE-SIZE-BYTES Max response size in bytes for the function.
NVCF-NSPECTID NVIDIA reserved variable.
NVCF-BACKEND Backend or “Cluster Group” the function is deployed on.
NVCF-INSTANCETYPE Instance type the function is deployed on.
NVCF-REGION Region or zone the function is deployed in.
NVCF-ENV Spot environment if deployed on spot instances.

This is coming soon.

Below are instructions on setting up output directories and efficiently tracking and communicating progress using the utils container.

Setting Up the Output Directory

The utils container automatically configures the output directory for you. To access the path, simply read the NVCF-LARGE-OUTPUT-DIR header. NVCF-LARGE-OUTPUT-DIR points to the directory for that particular requestId.

Writing Large Outputs

When your Custom BLS generates large outputs, save them temporarily with the “*.partial” extension inside the NVCF-LARGE-OUTPUT-DIR directory. For instance, if you’re writing an image, name it image1.partial.

Finalizing Outputs

Once the writing of the output file is complete, rename it from “*.partial” to its appropriate extension. Continuing with our example, rename image1.partial to image1.jpg.

Handling Progress Messages

The utils container actively observes the output directory for a file named ‘progress’. This file can be used to communicate progress and partial responses back to the caller.

Structure of the Progress File

This file should contain well-formed JSON data.

Structure the JSON content as follows:


{ "id": "<requestId>", "progress": 50, "partialResponse": { "exampleKey": "Insert any well-formed JSON here, but ensure its size is less than 250K" } }

Replace <requestId> with the actual request id if it’s present. Modify the progress integer as needed, ranging from 0 (just started) to 100 (fully complete). Within partialResponse, insert any JSON content you want to send as a partial response, making sure it’s smaller than 250KB.

Transferring Data to the Utils Container

Once the output files and progress file are correctly set up in the output directory under the correct request id, the utils container will automatically detect them. The utils container will then send these as a “progress” message.

Best Practices

  • Always use the “.partial” extension to avoid sending partial or incomplete data.

  • Rename to the final extension only when the writing process is fully complete.

  • Ensure your progress file remains under 250KB to maintain efficiency and avoid errors.

Previous API
Next Asset Management
© Copyright 2023-2024, NVIDIA. Last updated on Feb 16, 2024.