Function Deployment

Developer Guide (Latest)

This page describes deployment concepts and steps to deploy a function using Cloud Functions.

A function deployment refers to one or more function instances running on a cluster.

See Function Lifecycle for more key terminology and diagrams.

If your function is container-based, prior to deploying, it’s strongly recommended to run the local Deployment Validator, in order to catch common configuration issues and enable faster development cycles.

  1. Clone the helper repository and install the validator:


> git clone > cd nv-cloud-function-helpers/local_deployment_test/ > pip install -r requirements.txt

  1. Run the validator on your container. Supported validation arguments:

  • --protocol

  • --health-endpoint

  • --inference-endpoint

  • --container-port


> python3 --image-name $CONTAINER_IMAGE_NAME --protocol http --health-endpoint v2/health/live --inference-endpoint /v2/models/echo/infer --container-port 8000

  1. Once checks have passed, you’ll be prompted to test out your container’s inference endpoint. Paste your function’s expected inference endpoint JSON request body into the temporary file that is generated. If everything is successful, proceed to deploying your function.

Before deploying a function, you must create it first. Once created it will be listed as INACTIVE. See Function Creation.

Your Cloud Functions account will have access to various GPU clusters, instance types and configurations up to a set amount. This is determined by your NVIDIA Account Manager.

Each function version can have a different deployment configuration, allowing for heterogenous computing infrastrucure to be used across a single function endpoint. Once a deployment is created, it can be updated at any time, for example to change min or max instance counts.

Key Concepts



Cluster Group (Backend) A collection of one or more (though usually one) clusters to deploy on, for example - a CSP such as Azure, OCI, GCP or an NVIDIA specific cluster like GFN.
Instance Type Each GPU type can support one or more instance types, which are different configurations, such as number of CPU cores, and number of GPUs per node.
Min Instances The minimum number of instances your function should be deployed on.
Max Instances The maximum number of instances your function is allowed to autoscale to.
Max Concurrency The number of simultaneous invocations your container can handle at any given time.

If the minimum instances of the deployment is set to 0, the function status will be ACTIVE, but an instance will only be deployed upon first invocation of the function.

Invocations made while the instance is starting up will be queued until the instance is ready. Refer to Function States for understanding what each status means.

Deploy via the UI

  1. Once you’ve created a function, click on the kebab menu on the right to configure a deployment.

  2. First, choose the target cluster, GPU type and instance type.


  1. Next, set max concurrency, and instance counts.


Your function will be occupying GPUs up to the min instance count, even if it’s not necessarily performing work.

By default, autoscaling is enabled for all functions. Therefore, it’s most cost effective to set the minimum number of instances possible and allow Cloud Functions to autoscale as needed.

  1. Optionally, set additional deployment specifications. For example, you can do this if the function is compatible across multiple GPU types, or if there are multiple regions you’d like to deploy the same function to.

  2. Choose “Deploy Function”. Deployment times will vary depending on the cluster selected and available capacity.

Deploy via API

  1. Ensure you have an API key created, see Generate an NGC Personal API Key.

  2. First, list the available GPU clusters, types and configurations.


curl --location '' \ --header 'Accept: application/json' \ --header 'Authorization: Bearer $API_KEY' \

See an example response below:


{ "clusterGroups": [ { "id": "...", "name": "GCP-ASIASE1-A", "ncaId": "...", "authorizedNcaIds": [ "*" ], "gpus": [ { "name": "H100", "instanceTypes": [ { "name": "a3-highgpu-8g_1x", "description": "Single H100 GPU", "default": true }, { "name": "a3-highgpu-8g_4x", "description": "Four 80 GB H100 GPU", "default": false }, { "name": "a3-highgpu-8g_2x", "description": "Two 80 GB H100 GPU", "default": false }, { "name": "a3-highgpu-8g_8x", "description": "Eight 80 GB H100 GPU", "default": false } ] } ], "clusters": [ { "k8sVersion": "v1.29.2-gke.1060000", "id": "...", "name": "nvcf-gcp-prod-asiase1-a" } ] } ... ] }

In this example (which has some data omitted), the account is authorized to deploy on the GCP-ASIASE1-A cluster, which has the H100 GPU type in four different instance type configurations.

  1. Deploy the function via API by creating a deployment specification.


curl --location '$FUNCTION_ID/versions/$FUNCTION_VERSION_ID' \ --header 'Content-Type: application/json' \ --header 'Accept: application/json' \ --header 'Authorization: Bearer $API_KEY' \ --data '{ "deploymentSpecifications": [ { "backend": "GCP-ASIASE1-A", "gpu": "H100", "minInstances": "1" "maxInstances": "2", "maxRequestConcurrency": 1, } ] }'

  1. Refer to the OpenAPI Specification for further API documentation.

Deploy via CLI

  1. Ensure you have an API key created, see Generate an NGC Personal API Key.

  2. Ensure you have the NGC CLI configured.

  3. First, list the available GPU clusters, types and configurations.


ngc cloud-function available-gpus

  1. Deploy the function via CLI by creating a deployment specification.



  1. See NGC CLI Documentation. for further commands.

Delete a Deployment

To delete a function version deployment, supply the function ID and version ID.

Via UI, choose “Disable Function Version” in the Functions List Page for any deployed function and version.

Via API:



Via CLI:


ngc cloud-function function deploy remove $FUNCTION_ID:$FUNCTION_VERSION_ID


Specify the graceful parameter to true to require active function instances to fulfill any in-flight inference requests and drain all requests in the queue before terminating.

When a deployment is deleted, the function’s status will immediately become INACTIVE indicating it can no longer serve invocations.

Depending on the size of your containers and models, it can take anywhere from 2 minutes to 30 minutes for your function to deploy, although durations up to 2 hours are permitted. This is also dependent on whether your function is deploying from a cold start, or whether it’s scaling up or down (often much faster due to caching in place). Monitor your function’s instance count and scaling via the Function Metrics Page.

If you believe your function should have deployed already, or if it has entered an error state, review the logs to understand what happened, or reach out to your NVCF Support Team.

Below are some common deployment failures:

Failure Type


Function configuration problems This occurs due to incorrect inference or health endpoints and ports defined, causing container to be marked unhealthy. Try running Deployment Validation on the container locally to rule out configuration issues.
Inadequate capacity for the chosen cluster This will usually be indicated in the deployment failure error message in the UI. Try reducing the number of instances you are requesting or changing the GPU/instance type used by your function.
Container in restart loop This will be indicated in the inference container logs (if your container is configured to emit logs) and is fixed by debugging and updating your inferencing container code.
Model file not found This error typically occurs when the inference container expects a model file in a specified location, but the file is not present. Ensure the path for your model files is correct and the necessary files, like config.json, are available at that location. The config.json file should be located under the /config/models/$model-name directory.
Previous Function Creation
Next Function Management
© Copyright © 2024, NVIDIA Corporation. Last updated on Jun 7, 2024.