Tasks#

Introduction#

NVCF now supports one-and-done workloads such as fine-tuning models or building TensorRT engines that can run for extended periods where latency is not critical. These workloads require executing a container with appropriate input and configuration on a GPU-powered worker. The output or result of executing such a container can be a set of results, such as checkpoints. This functionality is supported through Tasks. The container that will be wrapped in a Task and executed on a worker is referred to as the Task Container.

Unlike Functions, a Task is not interactive or invoked multiple times. Launching a Task is analogous to deploying a Function, with the key difference being that the Task Container runs to completion on a GPU-powered worker. Upon completion, the worker shuts down.

Tasks can be used for:

  • Fine-tuning existing models using training and validation data to produce checkpoints with weights.

  • Generating TensorRT engine builds.

  • Other operations that require execution on a GPU-powered worker.

A sample task container can be found on GitHub.

Task Management and Execution#

Tasks belong to a specific NVIDIA Cloud Account (NCA). The following sections detail the REST endpoints available for Task management and execution.

Lifecycle of a Task#

As a Task progresses through its lifecycle, its status updates accordingly:

  • QUEUED: The Task is created and waiting to be scheduled.

  • LAUNCHED: The Task has been scheduled to run.

  • RUNNING: The Task is currently executing.

  • COMPLETED: The Task has finished successfully.

  • CANCELED: The Task has been canceled by the user.

  • ERRORED: An error occurred during Task execution.

  • EXCEEDED_MAX_RUNTIME_DURATION: The Task exceeded the specified maximum runtime duration.

  • EXCEEDED_MAX_QUEUED_DURATION: Task could not be launched and has exceeded the specified maximum queued duration.

Environments#

Prod: https://api.nvct.nvidia.com

REST API Endpoints#

Create Task#

Endpoint:

POST /v1/nvct/tasks
Authorization: Bearer <SSA-JWT> with 'launch_task' scope

Request Payload:

{
  "name": "gpt-3.5-turbo-fine-tuning-task",
  "containerImage": "nvcr.io/zq9tgr/gpt-3.5-turbo-fine-tune:1.0.0",
  "containerArgs": "python3 main.py", # optional container command
  'containerEnvironment': [ # optional environment variables
    {'key': 'MODEL_NAME', 'value': 'gpt-3.5-turbo-fine-tune'}
    ],
  "models": [  # Optional model
    {
        "name": "gpt-3.5-turbo",     # Input model
        "uri": "nvcr.io/zq9tgr/model/gpt-3.5-turbo/versions/1/files",
        "version": "2.0.0"
    }
  ],
  "gpuSpecification": {
    "gpu": "T10",
    "instanceType": "g6.full",
    "backend": "GFN"
  },
  "secrets": [
    {
      "name": "NGC_API_KEY",    # Well-known secret name for uploading results to NGC
      "value": "<personal api-key with PRIVATE_REGISTRY scope to upload results to NGC org zq9tgr >"
    }
  ],
  "maxRuntimeDuration": "PT7H",
  "maxQueuedDuration": "PT6H",
  "terminationGracePeriodDuration": "PT15M",
  "resultHandlingStrategy": "UPLOAD",
  "resultsLocation": "zq9tgr/finetuned-gpt-3.5-turbo"
}

Response:

Status 200

{
  "id": "579ad430-34b9-4a6e-9537-a060db4a9e6c",
  "ncaId": "test-nca-id",
  "name": "gpt-3.5-turbo-fine-tuning-task",
  }

Note

  • For more details, the OpenAPI specs include the response details.

  • If the resultHandlingStrategy is UPLOAD, you must include NGC_API_KEY as a secret to allow results to be uploaded to your NGC Private Registry.

  • If maxRuntimeDuration is not specified and GFN is the backend, the request will be rejected with status 400. Additionally, if maxRuntimeDuration exceeds 8 hours for a Task on the GFN backend, the request will be rejected.

  • When this endpoint is invoked, a new Task is created/queued, and appropriate entries are added to Tasks DB to manage the Task’s lifecycle.

List Tasks#

Endpoint:

GET /v1/nvct/tasks?limit=N&cursor=<cursor-id>  # Default N to 10
Authorization: Bearer <SSA-JWT> with 'list_tasks' scope

Response:

Status 200

{
  "tasks": [
    {
      "id": "579ad430-34b9-4a6e-9537-a060db4a9e6c",
      "ncaId": "test-nca-id",
      "name": "gpt-3.5-turbo-fine-tuning-task",
      "tags": ["gpt", "llm", "3.5"],
      "description": "GPT 3.5 Fine-Tuning Task",
      "status": "COMPLETED",
      "containerImage": "nvcr.io/zq9tgr/gpt-3.5-turbo-fine-tune:1.0.0",
      "secretNames": ["secret-t1"],
      "createdAt": "2024-04-15T17:57:39.716Z",
      "lastUpdatedAt": "2024-04-15T18:50:01.716Z",
      "lastHeartbeatAt": "2024-04-15T18:50:01.716Z"
    },
    {
      "id": "679ad430-34b9-4a6e-9537-a060db4a9e6c",
      "ncaId": "test-nca-id",
      "name": "trt-engine-build-task",
      "tags": ["gpt", "llm", "4.0"],
      "description": "TRT Engine Build Task",
      "status": "RUNNING",
      "containerImage": "nvcr.io/zq9tgr/trt-engine-build:1.0.0",
      "secretNames": ["secret-t2"],
      "createdAt": "2024-04-15T18:57:39.716Z",
      "lastUpdatedAt": "2024-04-15T18:50:01.716Z",
      "lastHeartbeatAt": "2024-04-15T18:50:01.716Z"
    }
  ],
  "cursor": "<UUID>"  # To be passed in as query param with the next pagination request
}

Note

Response does not include instance details and any secret names for each of the Tasks in the response.

Retrieve Task Details#

Endpoint:

GET /v1/nvct/tasks/{taskId}
Authorization: Bearer <SSA-JWT> with 'task_details' scope

Response:

Status 200

{
  "id": "579ad430-34b9-4a6e-9537-a060db4a9e6c",
  "ncaId": "test-nca-id",
  "name": "gpt-3.5-fine-tuning-task",
  "tags": ["gpt", "llm", "3.5"],
  "description": "GPT 3.5 Fine-Tuning Task",
  "status": "QUEUED | LAUNCHED | RUNNING | ERRORED | COMPLETED | CANCELED | EXCEEDED_MAX_RUNTIME_DURATION | EXCEEDED_MAX_QUEUED_DURATION",
  "containerImage": "nvcr.io/zq9tgr/gpt-3.5-turbo-fine-tune:1.0.0",
  "createdAt": "2024-04-15T17:57:39.716Z",
  "lastUpdatedAt": "2024-04-15T18:01:21.650Z",
  "lastHeartbeatAt": "2024-04-15T18:01:21.650Z",
  "percentComplete": 50,
  "secretNames": ["secret-t1"],
  "healthInfo": {
      "instanceType": "g6.full",
      "gpu": "T10",
      "backend": "GFN",
      "error": "Failed to start the container"
    },
  "gpuSpecification": {
    "gpu": "T10",
    "instanceType": "g6.full",
    "backend": "GFN"
  },
  "maxRuntimeDuration": "PT1H",
  "activeInstances": [  # Optional - Typically only one worker
    {
      "instanceId": "bbe8690a-9cc1-4973-bfcb-d5ff64221bb3",
      "taskId": "579ad430-34b9-4a6e-9537-a060db4a9e6c",
      "instanceType": "g6.full",
      "instanceStatus": "ACTIVE",
      "ncaId": "test-nca-id",
      "gpu": "T10",
      "backend": "GFN",
      "location": "ND-SJC6M-03",
      "instanceCreatedAt": "2024-04-15T17:57:50.245Z",
      "instanceUpdatedAt": "2024-04-15T18:01:21.650Z"
    }
  ]
}

Delete Task#

Endpoint:

DELETE /v1/nvct/tasks/{taskId}
Authorization: Bearer <SSA-JWT> with 'delete_task' scope

Response:

Status 204

Note

  • Deleting a Task also removes associated secrets and terminates the GPU-powered instance.

  • If the Task is currently executing, it will be terminated gracefully, allowing the Task Container to save results within the specified ‘terminationGracePeriodDuration’.

Cancel Task#

Endpoint:

POST /v1/nvct/tasks/{taskId}/cancel
Authorization: Bearer <SSA-JWT> with 'cancel_task' scope

Response:

Status 200

{
  "id": "579ad430-34b9-4a6e-9537-a060db4a9e6c",
  "ncaId": "test-nca-id",
  "name": "example-task",
  "tags": ["gpt", "llm", "3.5"],
  "description": "GPT 3.5 Fine-Tuning Task",
  "status": "CANCELED",
  "containerImage": "nvcr.io/zq9tgr/gpt-3.5-turbo-fine-tune:1.0.0",
  "createdAt": "2024-04-15T17:57:39.716Z",
  "lastUpdatedAt": "2024-04-15T18:01:21.650Z",
  "lastHeartbeatAt": "2024-04-15T18:01:21.650Z"
}

Note

  • Canceling a Task also terminates the associated GPU-powered instance.

  • Canceling a Task stops its execution gracefully, allowing it to save any results within the specified ‘terminationGracePeriodDuration’.

List Task Events#

Endpoint:

GET /v1/nvct/tasks/{taskId}/events?limit=N&cursor=<cursor-pos-from-prev-response>
Authorization: Bearer <SSA-JWT> with 'list_task_events' scope

Response:

Status 200

{
  "id": "579ad430-34b9-4a6e-9537-a060db4a9e6c",
  "cursor": "cursor-position-for-next-set-of-events",
  "limit": N,
  "ncaId": "test-nca-id",
  "name": "example-task",
  "events": [
    {
      "id": "679ad430-44b9-5a6e-0537-b060db4a9e6c",
      "message": "Status changed from QUEUED to LAUNCHED",
      "createdAt": "2024-04-15T17:57:50.245Z"
    },
    {
      "id": "779ad430-54b9-6a6e-1537-c060db4a9e6c",
      "message": "Status changed from LAUNCHED to RUNNING",
      "createdAt": "2024-04-15T17:58:50.245Z"
    },
    {
      "id": "879ad430-64b9-7a6e-2537-d060db4a9e6c",
      "message": "Status changed from RUNNING to COMPLETED",
      "createdAt": "2024-04-16T17:57:50.245Z"
    }
  ]
}

List Task Results#

Endpoint:

GET /v1/nvct/tasks/{taskId}/results?limit=N&cursor=<result-id>
Authorization: Bearer <SSA-JWT> with 'list_task_results' scope

Response:

{
  "taskId": "579ad430-34b9-4a6e-9537-a060db4a9e6c",
  "ncaId": "test-nca-id",
  "taskName": "fine-tune-gpt-3.5",
  "results": [
    {
      "id": "679ad430-44b9-5a6e-0537-b060db4a9e6c",
      "createdAt": "2024-04-15T17:57:50.245Z",
      "name": "ckpt-step-2000",
      "uri": "nvcr.io/zq9tgr/model/gpt-3.5-ckpt-step-2000/versions/1/files"
    },
    {
      "id": "779ad430-54b9-6a6e-1537-c060db4a9e6c",
      "createdAt": "2024-04-15T16:57:50.245Z",
      "name": "ckpt-step-1000",
      "uri": "nvcr.io/zq9tgr/model/gpt-3.5-ckpt-step-1000/versions/1/files"
    }
  ],
  "cursor": "<cursor-position-for-next-set-of-results>"
}

Update Task Secrets#

Endpoint:

PUT /v1/nvct/secrets/tasks/{taskId}
Authorization: Bearer <SSA-JWT> with 'update_secrets' scope

Request Payload:

{
    "secrets": [
        {
            "name": "AWS_ACCESS_KEY_ID",
            "value": "shhh!"
        },
        {
            "name": "AWS_SECRET_ACCESS_KEY",
            "value": "shhh!shhh!"
        }
    ]
}

Response:

Status 204

Environment Variables Available in Task Container#

In addition to the environment variables that are specified during Task creation, the following environment variables will be made available to the Task Container:

Environment Variable

Description

NVCT_TASK_ID

Unique Task ID. It can be included in the logs.

NVCT_TASK_NAME

Task name

NVCT_NCA_ID

NVIDIA Cloud Account that owns the Task.

NVCT_PROGRESS_FILE_PATH

Path to the progress file.

NVCT_RESULTS_DIR

Location where the result subdirectories should be created.

Models and Resources#

If models are associated with a Task, the Task Container can find the models downloaded from NGC Private Registry under the /config/models/{modelName} folder.

Similarly, if resources are associated with a Task, the Task Container can find the resources downloaded from NGC Private Registry under the /config/resources/{resourceName} folder.

Task Secrets#

If a Task is created with secrets, then the secrets will be available to the Task Container at a fixed path /var/secrets/secrets.json on the Worker node. Task authors can update/rotate secrets using a REST endpoint. This will cause /var/secrets/secrets.json to be automatically updated. The Task Container can watch updates to this file and use the latest secrets as they get rotated.

{
  "AWS_ACCESS_KEY_ID": "key-id",
  "AWS_SECRET_ACCESS_KEY": "access-key",
  ..
}

Task Results#

Handling Task Results#

Tasks can produce multiple results during execution. For example, when fine-tuning a model, multiple checkpoints may be created. Each Task has a result handling strategy specified at creation time:

  • UPLOAD: The system uploads results to a specified location.

  • NONE: The Task Container handles result management.

Result Handling Strategy: UPLOAD#

Under the UPLOAD strategy:

  • The Task Container must create result files in a result-specific subfolder on the disk under NVCT_RESULTS_DIR.

  • The Task Container or Helm Chart must update the lastUpdatedAt field in the progress file every 3 minutes as a heartbeat. The value of the lastUpdatedAt field should be based on RFC3339Nano format.If the difference between lastUpdatedAt and the current timestamp is larger than 5 minutes and percentComplete is not 100, the Task will be ERRORED.

  • The Task Container must update the progress file with percentComplete (between 1 and 100), name, and optional result metadata.

  • The name property in the progress file must match the name of the result-specific subfolder under the NVCT_RESULTS_DIR folder.

  • The system uploads the newly generated result to your NGC Private Registry using the NGC_API_KEY secret.

  • Results are uploaded to the path specified in the resultsLocation property.

  • The Task Container must use NVCT_PROGRESS_FILE_PATH environment variable to create/update the progress file on the Worker node. Similarly, the Task Container must use the NVCT_RESULTS_DIR environment variable to create a result-specific subfolder.

Result Handling Strategy: NONE#

Under the NONE strategy:

  • The Task Container is responsible for creating and uploading results.

  • The Task Container or Helm Chart must update the lastUpdatedAt field in the progress file every 3 minutes as a heartbeat. The value of the lastUpdatedAt field should be based on RFC3339Nano format.If the difference between lastUpdatedAt and the current timestamp is larger than 5 minutes and percentComplete is not 100, the Task will be ERRORED.

  • The Task Container must create/update the progress file with percentComplete set to 100 at the end to indicate that it exited gracefully for the system to mark the Task as COMPLETED.

  • Credentials for external storage can be provided as Task secrets.

  • The system updates the Task’s progress based on the progress file but does not handle result uploading.

  • The Task Container must use NVCT_PROGRESS_FILE_PATH environment variable to create/update the progress file on the Worker node. Similarly, the Task Container should use the NVCT_RESULTS_DIR environment variable to create a result-specific subfolder for storing results.

progress File Format#

{
  "taskId": "579ad430-34b9-4a6e-9537-a060db4a9e6c",
  "percentComplete": 20,
  "name": "ckpt-step-2000",
  "metadata": {
    "step-number": 2000,
    "token_accuracy": 0.874
  },
  "lastUpdatedAt": "20025-01-02T15:04:05.999999999Z07:00"
}
  • taskId: Task ID.

  • percentComplete: Integer indicating the completion percentage between 1-100.

  • metadata: Optional field for Task Container to add metadata regarding the upload. Required format: key-value pairs.

  • name: Directory/folder name for the result to be uploaded to NGC Private Registry. There are certain restrictions in naming the directory for UPLOAD strategy. The field should be 1-190 characters long. Allowed characters: [0-9a-zA-Z!-_.*’()]. Prefixes ./ and ../ are not allowed.

  • lastUpdatedAt: ISO 8061 timestamp indicating when the progress file was last updated. Must be updated as minimum every 3 minutes to signal to NVCF the task is in progress.

The Task Container must use NVCT_PROGRESS_FILE_PATH environment variable to create/update the progress file on the Worker node. Similarly, the Task Container must use the NVCT_RESULTS_DIR environment variable to create a result-specific subfolder.

An atomic write to the progress file is recommended to avoid the race condition of concurrent reads. The Task Container should first write to a new temporary progress file, and then move it to the same location as the original progress file to overwrite the old result (such as os.rename() method in python). Please see the sample task container for reference here.

Intermediate and Final Results#

  • Intermediate Results: As new results are generated, the Task Container should update the progress file with percentComplete (between 1 and <100), name, and optionally with result metadata. The system monitors this file and updates the Task’s status and progress accordingly. Any consecutive intermediate results with duplicate percentComplete and name fields will be ignored.

  • Final Result: When the Task completes, the Task Container must update the progress file one last time with percentComplete set to 100, name, and optionally with result metadata. The system marks the Task as COMPLETED.

Task Runtime Duration#

By default, a Task runs indefinitely. However, you can specify the maxRuntimeDuration property when creating a Task to control its execution duration. If maxRuntimeDuration is specified, the Task will be terminated gracefully after the duration elapses, allowing it to save any final results. If GFN is selected as the backend, then maxRuntimeDuration must be less than 8 hours.

Helm Chart Based Tasks#

Prerequisites#

Note: Ensure that your Helm chart versions do not contain hyphens (-). For example, v1 is acceptable, but v1-test will cause issues.

1. Essential Helm Chart Values#

The following field keys must be defined in values.yaml of your Helm chart. They will be overridden during runtime. Make use of these values as environment variables in your containers.

Key | Description

nvctNcaId | Unique ID of NVIDIA Cloud Account that owns the Task.

nvctTaskId | Unique Task ID. It can be included in the logs.

nvctTaskName | Task name

nvctResultsDir | Location where the result subdirectories should be created.

nvctProgressFilePath | Path to the progress file.

2. Progress Updates#

Please make sure the progress file is updated when intermediate and final results are generated regardless of the resultHandlingStrategy specified in the Task definition. As Tasks are designed to run to completion, ensure that the container writing/updating the progress file is not restarted by Kubernetes if it exits in the end. As percentComplete is always required to be increasing to 100, a decrease of percentComplete in the progress file caused by container restarts will lead to task failures. A Kubernetes Job is recommended to deploy Task pods. Sample specs is shown below:

apiVersion: batch/v1
kind: Job
metadata:
  name: sample
spec:
  backoffLimit: 0
  template:
    metadata:
      labels:
        app.kubernetes.io/name: {{ .Release.Name }}
        app.kubernetes.io/instance: {{ .Release.Name }}
    spec:
      restartPolicy: Never
      imagePullSecrets:
      - name: {{ .Values.ngcImagePullSecretName }}
      containers:
      - name: task
        image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"

3. Container Image Pull Secret Management#

For pulling containers defined as part of the Helm chart from NGC Private Registry, define ngcImagePullSecretName in values.yaml. This value will be used in the deployment spec as spec.imagePullSecrets.name for the pods.

For nested Helm charts, define global.ngcImagePullSecretName in values.yaml, which will be referenced in the deployment spec under spec.imagePullSecrets.name for the pods.

Note: Containers specified in the Helm Chart definition should be in the same NGC Organization and Team that the Helm Chart itself is being pulled from.

4. Sample Helm Chart values.yaml#

Please find all the essential Helm Chart keys in the following sample values.yaml:

ngcImagePullSecretName: ""
nvctNcaId: ""
nvctTaskId: ""
nvctTaskName: ""
nvctResultsDir: ""
nvctProgressFilePath: ""
global:

ngcImagePullSecretName: “” # Only for nested helm chart

Create a Helm-based Task#

Creation from UI#

Helm-based Tasks can be created from NGC UI. To do this select Custom Helm Chart under What would you like to create.

Input your task name and other information.

Select your Helm Chart and version from the NGC Private Registry.

Fill in other configurations as creating a container-based task and then click on the button Create Task.

Creation from API#

You can also create a Helm-based Task through REST API calls with the following request payloads:

POST /v1/nvct/tasks
Authorization: Bearer <JWT> with launch_task scope
{
   "name": "helm-based-task",
   "gpuSpecification": {
       "gpu": "H100",
       "backend": "nvcf-qa-cluster-gcp",
       "instanceType": "GCP.GPU.H100_1x",
       "configuration": {
          "key": "value",
       }
   },
   "helmChart": "<helm-chart-url>",
   "resultHandlingStrategy": "UPLOAD",
   "maxRuntimeDuration": "PT15M",
   "maxQueuedDuration": "PT1H",
   "terminationGracePeriodDuration": "PT1M",
   "secrets": [
       {
           "name": "<secret-name>",
           "value": "<secret-value>"
       }
   ],
   "resultsLocation": "<org>/<model-name>"
}

Helm Chart Values Overrides#

To override keys in your Helm chart values.yaml, you can provide the key-value pairs in JSON format through NGC UI when creating the task.

Or you can also add JSON key-value pairs to gpuSpecification.configuration field of request payload when creating a task through API.

"gpuSpecification": {
  "gpu": "H100",
  "backend": "nvcf-qa-cluster-gcp",
  "instanceType": "GCP.GPU.H100_1x",
  "configuration": {
      // Add your helm chart values overrides here
      "key": "value",
  }
}

Limitations#

When using Helm Charts to create a Task, the following limitations need to be taken into consideration.

1. Disk Size#

For any results generated by your Task’s containers, the file size is limited by the disk space on the VM - for GFN backend this is 100GB approximately for a single GPU instance and about 250 GB for a dual GPU instance. This limit varies among different clusters.

2. Security Constraints#

Helm Charts must conform to certain security standards to be deployable as a Task. This means that certain Helm and Kubernetes features are restricted in NVCF backends. Your Helm Chart along with the overrides and other deployment metadata will be validated during Task creation to enforce the standards.

Restrictions#

Helm Chart and all its containers must be hosted within the NGC Private Registry..

Follow these steps.. to upload your Helm Chart and containers to the NGC Private Registry.

Only the following K8s artifacts are supported under Helm Chart namespace:

  • ConfigMaps

  • Secrets

  • Services - Only type: ClusterIP or none

  • Deployments

  • ReplicaSets

  • StatefulSets

  • Jobs

  • CronJobs

  • Pods

  • ServiceAccounts

  • Roles

  • Rolebindings

  • PersistentVolumeClaims (GFN backend only)

Others will be rejected.

All pods and resources that define a pod template must conform to the Kubernetes Pod Security Standards Baseline.. and Restricted.. policies. Many of these restrictions are applied to your pod or pod templates automatically. Only the following pod or pod template volume types are supported:

  • configMap

  • secret

  • persistentVolumeClaim

No chart hooks.. are allowed; if specified in the Helm Chart, they will not be executed.