Function Deployment#

This page describes deployment concepts and steps to deploy a function using Cloud Functions.

A function deployment refers to one or more function instances running on a cluster.

See Function Lifecycle for more key terminology and diagrams.

Deployment Validation#

If your function is container-based, before deploying, it’s strongly recommended to run the local Deployment Validator, to catch common configuration issues and enable faster development cycles.

Clone the helper repository and install the validator:

> git clone https://github.com/NVIDIA/nv-cloud-function-helpers.git
> cd nv-cloud-function-helpers/local_deployment_test/
> pip install -r requirements.txt

Run the validator on your container. Supported validation arguments:

--protocol
--health-endpoint
--inference-endpoint
--container-port

> python3 test_container.py --image-name $CONTAINER_IMAGE_NAME --protocol http --health-endpoint v2/health/live --inference-endpoint /v2/models/echo/infer --container-port 8000

Once checks have passed, you’ll be prompted to test out your container’s inference endpoint. Paste your function’s expected inference endpoint JSON request body into the temporary file that is generated. If everything is successful, proceed to deploy your function.

Deploying a Function#

Before deploying a function, you must create it first. Once created it will be listed as INACTIVE. See Function Creation.

Your Cloud Functions account will have access to various GPU clusters, instance types, regions and configurations up to a set amount. This is determined by your NVIDIA Account Manager.

Each function version can have a different deployment configuration, allowing for heterogeneous computing infrastructure to be used across a single function endpoint. Once a deployment is created, it can be updated at any time, for example, to change min or max instance counts.

Key Concepts#

Term	Description
Cluster Group (Backend)	A collection of one or more (though usually one) clusters to deploy on, for example - a CSP such as Azure, OCI, GCP or an NVIDIA-specific cluster like GFN.
Instance Type	Each GPU type can support one or more instance types, which are different configurations, such as the number of CPU cores, and the number of GPUs per node. For non-GFN instances the naming format corresponds to <CSP>.GPU.<GPU_NAME>_<number of gpus per node>x[.x<number of nodes>]. For example `DGXC.GPU.L40S_1x` is a single L40S instance while `ON-PREM.GPU.B200_8x.x2` is two full nodes of 8-way B200. NOTE: Multi-node deployment instance types are only available for helm based functions.
Attributes	Specific capabilities or compliances of clusters, like HIPAA, SOC2, Graphics Optimized, etc.
Regions	Geographical locations where clusters are located, e.g., `us-west-2`, `eu-central-1`.
Min Instances	The minimum number of instances your function should be deployed on.
Max Instances	The maximum number of instances your function is allowed to autoscale to.
Max Concurrency	The number of simultaneous invocations your container can handle at any given time.
Function Request Queue	A first in first out queue that is created during function version deployment, which buffers incoming requests based on function “worker” instance availability.
Autoscaling	Automatic scaling up or down of instances from minimum instance count to maximum instance count, based on utilization heuristics and queue depth.

Function Queueing#

The below describes cases which trigger functions to queue. Cloud Functions maintains one queue per function version ID.

For synchronous HTTP requests - queuing is triggered when the function reaches its max concurrency limit of requests currently in progress.

Example

A single function instance is deployed with a max concurrency is set to 2 and min and max instance count of 1.

3 invocation requests hit the Cloud Functions API via /pexec endpoint for the function.

Cloud Functions API will forward 2 function invocation requests.

The remaining request will be queued, will return 202 and must be handled by HTTP polling.

For streaming requests such as gRPC - queuing is triggered when the function reaches its max concurrency limit of current connections.

Example

gRPC function is deployed with a max concurrency set to 2 and min and max instance count of 1.

3 connection requests hit the Cloud Functions gRPC endpoint for the function.

2 connections will be created to the function.

The remaining connection request will wait to connect until one of the 2 current connections is closed.

Autoscaling and Instance Counts#

Autoscaling of function instances will only occur when the maximum instance count is above the minimum instance count. Scale up or down is determined based on proprietary utilization heuristics and the function’s queue depth.

If the minimum instances of the deployment are set to 0, the function status will be ACTIVE, but instances will only be deployed upon first invocation of the function. After an extended idle period with no requests, the function will scale back down to 0. Therefore, setting minimum instance count to 0 is generally a best practice for saving on hardware costs, with the trade-off of function deployment time. It’s especially useful when longer response times are acceptable for infrequently used functions.

Invocations made while the instance is starting up will be queued until the instance is ready. Refer to Function States for understanding what each status means.

Deploy via the UI#

Once you’ve created a function, click on the kebab menu on the right to and select Deploy Function Version.
First, choose the GPU Type you want to Utilize.
Optionally you can select Instance Type Settings to filter for what Regions or Attributes you want for your deployment.

Note

Instance Type: Select the desired instance type that matches your function’s requirements.
Regions (Optional): Specify the geographical regions where you want to deploy your function.
Attributes (Optional): Select any specific attributes or compliances required, such as HIPAA, SOC2.

The instance types shown will match your specified requirements. Select the Instance Type from the results that you want to deploy to.
Under Deployment Specifications set the region you want to deploy to, or leave this as Any Region and specify the attributes you want.
If you have access to a non GFN Cluster, you will be required to select Cluster as part of the Deployment Specifications.

# Click Review Deployment and #.

Note

When you filter for a Region or Attribute, it will auto select those under Deployment Specifications.

Your function will be occupying GPUs up to the minimum instance count, even if it’s not necessarily performing work.

By default, autoscaling is enabled for all functions. Therefore, it’s most cost-effective to set the minimum number of instances possible and allow Cloud Functions to autoscale as needed.

Choose “Review Deployment”. It will show you a summary of your deployment and any specified Regions or Attributes, as well as the Instance Counts and Max Concurrency. You will also see technical details about your instance type. Deployment times will vary depending on the selections made and available capacity.

Deploy via API#

Ensure you have an API key created, see Generate an NGC Personal API Key.
First, list the available GPU clusters, types and configurations.

 curl --location 'https://api.ngc.nvidia.com/v2/nvcf/clusterGroups' \
 --header 'Accept: application/json' \
 --header "Authorization: Bearer $API_KEY"

See an example response below:

 {
     "clusterGroups": [
         {
             "id": "...",
             "name": "GCP-ASIASE1-A",
             "ncaId": "...",
             "authorizedNcaIds": [
                 "*"
             ],
             "gpus": [
                 {
                     "name": "H100",
                     "instanceTypes": [
                         {
                             "name": "a3-highgpu-8g_1x",
                             "description": "Single H100 GPU",
                             "default": true
                         },
                         {
                             "name": "a3-highgpu-8g_4x",
                             "description": "Four 80 GB H100 GPU",
                             "default": false
                         },
                         {
                             "name": "a3-highgpu-8g_2x",
                             "description": "Two 80 GB H100 GPU",
                             "default": false
                         },
                         {
                             "name": "a3-highgpu-8g_8x",
                             "description": "Eight 80 GB H100 GPU",
                             "default": false
                         }
                     ]
                 }
             ],
             "clusters": [
                 {
                     "k8sVersion": "v1.29.2-gke.1060000",
                     "id": "...",
                     "name": "nvcf-gcp-prod-asiase1-a"
                 }
             ]
         }
         ...
     ]
 }

In this example (which has some data omitted), the account is authorized to deploy on the GCP-ASIASE1-A cluster, which has the H100 GPU type in four different instance type configurations.

Deploy the function via API by creating a deployment specification.

   curl --location "https://api.ngc.nvidia.com/v2/nvcf/deployments/functions/$FUNCTION_ID/versions/$FUNCTION_VERSION_ID" \
   --header 'Content-Type: application/json' \
   --header 'Accept: application/json' \
   --header "Authorization: Bearer $API_KEY" \
   --data '{
   "deploymentSpecifications": [
           {
               "instanceType": "AZURE.GPU.H100_4x",
               "gpu": "H100",
               "minInstances": "1",
               "maxInstances": "2",
               "maxRequestConcurrency": 1,
               "regions": ["us-west-2", "us-east-1"],
               "clusters": ["byoc-cluster-1"],
               "attributes": ["HIPAA", "SOC2"]

           }

       ]

   }'

Refer to the OpenAPI Specification for further API documentation.

Deploy via CLI#

Ensure you have an API key created, see Generate an NGC Personal API Key.
Ensure you have the NGC CLI configured.
First, list the available GPU clusters, types and configurations.

 ngc cloud-function available-gpus

Deploy the function via CLI by creating a deployment specification.

 ngc cf function deploy create --deployment-specification $CLUSTER_BACKEND:$GPU_TYPE:$INSTANCE_TYPE:$MIN_INSTANCES:$MAX_INSTANCES $FUNCTION_ID:$FUNCTION_VERSION_ID:minInstances=$MIN_INSTANCES:maxInstances=$MAX_INSTANCES:maxRequestConcurrency=$MAX_CONCURRENCY:regions=$REGIONS,clusters=$CLUSTERS:attributes=$ATTRIBUTES \

See NGC CLI Documentation. for further commands.

Delete a Deployment#

To delete a function version deployment, supply the function ID and version ID.

Via UI, choose “Disable Function Version” in the Functions List Page for any deployed function and version.

Via API:

curl -X 'DELETE' \
  "https://api.nvcf.nvidia.com/v2/nvcf/functions/$FUNCTION_ID"

Via CLI:

 ngc cloud-function function deploy remove $FUNCTION_ID:$FUNCTION_VERSION_ID

Tip

Specify the graceful parameter to true to require active function instances to fulfill any in-flight inference requests and drain all requests in the queue before terminating.

When a deployment is deleted, the function’s status will immediately become INACTIVE indicating it can no longer serve invocations.

Deployment Failures#

Depending on the size of your containers and models, it can take anywhere from 2 minutes to 30 minutes for your function to deploy, although durations up to 2 hours are permitted. This is also dependent on whether your function is deploying from a cold start, or whether it’s scaling up or down (often much faster due to caching in place). Monitor your function’s instance count and scaling via the Function Metrics Page.

If you believe your function should have deployed already, or if it has entered an error state, review the logs to understand what happened, or reach out to your NVCF Support Team.

Below are some common deployment failures:

Failure Type	Description
Function configuration problems	This occurs due to incorrect inference or health endpoints and ports defined, causing the container to be marked unhealthy. Try running Deployment Validation on the container locally to rule out configuration issues.
Inadequate capacity for the chosen cluster	This will usually be indicated in the deployment failure error message in the UI. Try reducing the number of instances you are requesting or changing the GPU/instance type used by your function.
Container in restart loop	This will be indicated in the inference container logs (if your container is configured to emit logs) and is fixed by debugging and updating your inferencing container code.
Model file not found	This error typically occurs when the inference container expects a model file in a specified location, but the file is not present. Ensure the path for your model files is correct and the necessary files, like `config.json`, are available at that location. The `config.json` file should be located under the `/config/models/$model-name` directory.