Tasks#
Introduction#
NVCF now supports one-and-done workloads such as fine-tuning models or building TensorRT engines that can run for extended periods where latency is not critical. These workloads require executing a container with appropriate input and configuration on a GPU-powered worker. The output or result of executing such a container can be a set of results, such as checkpoints. This functionality is supported through Tasks. The container that will be wrapped in a Task and executed on a worker is referred to as the Task Container.
Unlike Functions, a Task is not interactive or invoked multiple times. Launching a Task is analogous to deploying a Function, with the key difference being that the Task Container runs to completion on a GPU-powered worker. Upon completion, the worker shuts down.
Tasks can be used for:
Fine-tuning existing models using training and validation data to produce checkpoints with weights.
Generating TensorRT engine builds.
Other operations that require execution on a GPU-powered worker.
A sample task container can be found on GitHub.
Task Management and Execution#
Tasks belong to a specific NVIDIA Cloud Account (NCA). The following sections detail the REST endpoints available for Task management and execution.
Lifecycle of a Task#
As a Task progresses through its lifecycle, its status updates accordingly:
QUEUED: The Task is created and waiting to be scheduled.
LAUNCHED: The Task has been scheduled to run.
RUNNING: The Task is currently executing.
COMPLETED: The Task has finished successfully.
CANCELED: The Task has been canceled by the user.
ERRORED: An error occurred during Task execution.
EXCEEDED_MAX_RUNTIME_DURATION: The Task exceeded the specified maximum runtime duration.
EXCEEDED_MAX_QUEUED_DURATION: Task could not be launched and has exceeded the specified maximum queued duration.
Environments#
REST API Endpoints#
Create Task#
Endpoint:
POST /v1/nvct/tasks
Authorization: Bearer <SSA-JWT> with 'launch_task' scope
Request Payload:
{
"name": "gpt-3.5-turbo-fine-tuning-task",
"containerImage": "nvcr.io/zq9tgr/gpt-3.5-turbo-fine-tune:1.0.0",
"containerArgs": "python3 main.py", # optional container command
'containerEnvironment': [ # optional environment variables
{'key': 'MODEL_NAME', 'value': 'gpt-3.5-turbo-fine-tune'}
],
"models": [ # Optional model
{
"name": "gpt-3.5-turbo", # Input model
"uri": "nvcr.io/zq9tgr/model/gpt-3.5-turbo/versions/1/files",
"version": "2.0.0"
}
],
"gpuSpecification": {
"gpu": "T10",
"instanceType": "g6.full",
"backend": "GFN"
},
"secrets": [
{
"name": "NGC_API_KEY", # Well-known secret name for uploading results to NGC
"value": "<personal api-key with PRIVATE_REGISTRY scope to upload results to NGC org zq9tgr >"
}
],
"maxRuntimeDuration": "PT7H",
"maxQueuedDuration": "PT6H",
"terminationGracePeriodDuration": "PT15M",
"resultHandlingStrategy": "UPLOAD",
"resultsLocation": "zq9tgr/finetuned-gpt-3.5-turbo"
}
Response:
Status 200
{
"id": "579ad430-34b9-4a6e-9537-a060db4a9e6c",
"ncaId": "test-nca-id",
"name": "gpt-3.5-turbo-fine-tuning-task",
…
}
Note
For more details, the OpenAPI specs include the response details.
If the
resultHandlingStrategy
isUPLOAD
, you must includeNGC_API_KEY
as a secret to allow results to be uploaded to your NGC Private Registry.If
maxRuntimeDuration
is not specified andGFN
is the backend, the request will be rejected with status400
. Additionally, ifmaxRuntimeDuration
exceeds 8 hours for a Task on theGFN
backend, the request will be rejected.When this endpoint is invoked, a new Task is created/queued, and appropriate entries are added to Tasks DB to manage the Task’s lifecycle.
List Tasks#
Endpoint:
GET /v1/nvct/tasks?limit=N&cursor=<cursor-id> # Default N to 10
Authorization: Bearer <SSA-JWT> with 'list_tasks' scope
Response:
Status 200
{
"tasks": [
{
"id": "579ad430-34b9-4a6e-9537-a060db4a9e6c",
"ncaId": "test-nca-id",
"name": "gpt-3.5-turbo-fine-tuning-task",
"tags": ["gpt", "llm", "3.5"],
"description": "GPT 3.5 Fine-Tuning Task",
"status": "COMPLETED",
"containerImage": "nvcr.io/zq9tgr/gpt-3.5-turbo-fine-tune:1.0.0",
"secretNames": ["secret-t1"],
"createdAt": "2024-04-15T17:57:39.716Z",
"lastUpdatedAt": "2024-04-15T18:50:01.716Z",
"lastHeartbeatAt": "2024-04-15T18:50:01.716Z"
},
{
"id": "679ad430-34b9-4a6e-9537-a060db4a9e6c",
"ncaId": "test-nca-id",
"name": "trt-engine-build-task",
"tags": ["gpt", "llm", "4.0"],
"description": "TRT Engine Build Task",
"status": "RUNNING",
"containerImage": "nvcr.io/zq9tgr/trt-engine-build:1.0.0",
"secretNames": ["secret-t2"],
"createdAt": "2024-04-15T18:57:39.716Z",
"lastUpdatedAt": "2024-04-15T18:50:01.716Z",
"lastHeartbeatAt": "2024-04-15T18:50:01.716Z"
}
],
"cursor": "<UUID>" # To be passed in as query param with the next pagination request
}
Note
Response does not include instance details and any secret names for each of the Tasks in the response.
Retrieve Task Details#
Endpoint:
GET /v1/nvct/tasks/{taskId}
Authorization: Bearer <SSA-JWT> with 'task_details' scope
Response:
Status 200
{
"id": "579ad430-34b9-4a6e-9537-a060db4a9e6c",
"ncaId": "test-nca-id",
"name": "gpt-3.5-fine-tuning-task",
"tags": ["gpt", "llm", "3.5"],
"description": "GPT 3.5 Fine-Tuning Task",
"status": "QUEUED | LAUNCHED | RUNNING | ERRORED | COMPLETED | CANCELED | EXCEEDED_MAX_RUNTIME_DURATION | EXCEEDED_MAX_QUEUED_DURATION",
"containerImage": "nvcr.io/zq9tgr/gpt-3.5-turbo-fine-tune:1.0.0",
"createdAt": "2024-04-15T17:57:39.716Z",
"lastUpdatedAt": "2024-04-15T18:01:21.650Z",
"lastHeartbeatAt": "2024-04-15T18:01:21.650Z",
"percentComplete": 50,
"secretNames": ["secret-t1"],
"healthInfo": {
"instanceType": "g6.full",
"gpu": "T10",
"backend": "GFN",
"error": "Failed to start the container"
},
"gpuSpecification": {
"gpu": "T10",
"instanceType": "g6.full",
"backend": "GFN"
},
"maxRuntimeDuration": "PT1H",
"activeInstances": [ # Optional - Typically only one worker
{
"instanceId": "bbe8690a-9cc1-4973-bfcb-d5ff64221bb3",
"taskId": "579ad430-34b9-4a6e-9537-a060db4a9e6c",
"instanceType": "g6.full",
"instanceStatus": "ACTIVE",
"ncaId": "test-nca-id",
"gpu": "T10",
"backend": "GFN",
"location": "ND-SJC6M-03",
"instanceCreatedAt": "2024-04-15T17:57:50.245Z",
"instanceUpdatedAt": "2024-04-15T18:01:21.650Z"
}
]
}
Delete Task#
Endpoint:
DELETE /v1/nvct/tasks/{taskId}
Authorization: Bearer <SSA-JWT> with 'delete_task' scope
Response:
Status 204
Note
Deleting a Task also removes associated secrets and terminates the GPU-powered instance.
If the Task is currently executing, it will be terminated gracefully, allowing the Task Container to save results within the specified ‘terminationGracePeriodDuration’.
Cancel Task#
Endpoint:
POST /v1/nvct/tasks/{taskId}/cancel
Authorization: Bearer <SSA-JWT> with 'cancel_task' scope
Response:
Status 200
{
"id": "579ad430-34b9-4a6e-9537-a060db4a9e6c",
"ncaId": "test-nca-id",
"name": "example-task",
"tags": ["gpt", "llm", "3.5"],
"description": "GPT 3.5 Fine-Tuning Task",
"status": "CANCELED",
"containerImage": "nvcr.io/zq9tgr/gpt-3.5-turbo-fine-tune:1.0.0",
"createdAt": "2024-04-15T17:57:39.716Z",
"lastUpdatedAt": "2024-04-15T18:01:21.650Z",
"lastHeartbeatAt": "2024-04-15T18:01:21.650Z"
}
Note
Canceling a Task also terminates the associated GPU-powered instance.
Canceling a Task stops its execution gracefully, allowing it to save any results within the specified ‘terminationGracePeriodDuration’.
List Task Events#
Endpoint:
GET /v1/nvct/tasks/{taskId}/events?limit=N&cursor=<cursor-pos-from-prev-response>
Authorization: Bearer <SSA-JWT> with 'list_task_events' scope
Response:
Status 200
{
"id": "579ad430-34b9-4a6e-9537-a060db4a9e6c",
"cursor": "cursor-position-for-next-set-of-events",
"limit": N,
"ncaId": "test-nca-id",
"name": "example-task",
"events": [
{
"id": "679ad430-44b9-5a6e-0537-b060db4a9e6c",
"message": "Status changed from QUEUED to LAUNCHED",
"createdAt": "2024-04-15T17:57:50.245Z"
},
{
"id": "779ad430-54b9-6a6e-1537-c060db4a9e6c",
"message": "Status changed from LAUNCHED to RUNNING",
"createdAt": "2024-04-15T17:58:50.245Z"
},
{
"id": "879ad430-64b9-7a6e-2537-d060db4a9e6c",
"message": "Status changed from RUNNING to COMPLETED",
"createdAt": "2024-04-16T17:57:50.245Z"
}
]
}
List Task Results#
Endpoint:
GET /v1/nvct/tasks/{taskId}/results?limit=N&cursor=<result-id>
Authorization: Bearer <SSA-JWT> with 'list_task_results' scope
Response:
{
"taskId": "579ad430-34b9-4a6e-9537-a060db4a9e6c",
"ncaId": "test-nca-id",
"taskName": "fine-tune-gpt-3.5",
"results": [
{
"id": "679ad430-44b9-5a6e-0537-b060db4a9e6c",
"createdAt": "2024-04-15T17:57:50.245Z",
"name": "ckpt-step-2000",
"uri": "nvcr.io/zq9tgr/model/gpt-3.5-ckpt-step-2000/versions/1/files"
},
{
"id": "779ad430-54b9-6a6e-1537-c060db4a9e6c",
"createdAt": "2024-04-15T16:57:50.245Z",
"name": "ckpt-step-1000",
"uri": "nvcr.io/zq9tgr/model/gpt-3.5-ckpt-step-1000/versions/1/files"
}
],
"cursor": "<cursor-position-for-next-set-of-results>"
}
Update Task Secrets#
Endpoint:
PUT /v1/nvct/secrets/tasks/{taskId}
Authorization: Bearer <SSA-JWT> with 'update_secrets' scope
Request Payload:
{
"secrets": [
{
"name": "AWS_ACCESS_KEY_ID",
"value": "shhh!"
},
{
"name": "AWS_SECRET_ACCESS_KEY",
"value": "shhh!shhh!"
}
]
}
Response:
Status 204
Environment Variables Available in Task Container#
In addition to the environment variables that are specified during Task creation, the following environment variables will be made available to the Task Container:
Environment Variable |
Description |
---|---|
|
Unique Task ID. It can be included in the logs. |
|
Task name |
|
NVIDIA Cloud Account that owns the Task. |
|
Path to the progress file. |
|
Location where the result subdirectories should be created. |
Models and Resources#
If models are associated with a Task, the Task Container can find the models downloaded from NGC Private Registry under the /config/models/{modelName}
folder.
Similarly, if resources are associated with a Task, the Task Container can find the resources downloaded from NGC Private Registry under the /config/resources/{resourceName}
folder.
Task Secrets#
If a Task is created with secrets, then the secrets will be available to the Task Container at a fixed path /var/secrets/secrets.json
on the Worker node. Task authors can update/rotate secrets using a REST endpoint. This will cause /var/secrets/secrets.json
to be automatically updated. The Task Container can watch updates to this file and use the latest secrets as they get rotated.
{
"AWS_ACCESS_KEY_ID": "key-id",
"AWS_SECRET_ACCESS_KEY": "access-key",
..
}
Task Results#
Handling Task Results#
Tasks can produce multiple results during execution. For example, when fine-tuning a model, multiple checkpoints may be created. Each Task has a result handling strategy specified at creation time:
UPLOAD: The system uploads results to a specified location.
NONE: The Task Container handles result management.
Result Handling Strategy: UPLOAD#
Under the UPLOAD strategy:
The Task Container must create result files in a result-specific subfolder on the disk under
NVCT_RESULTS_DIR
.The Task Container or Helm Chart must update the lastUpdatedAt field in the progress file every 3 minutes as a heartbeat. The value of the lastUpdatedAt field should be based on
RFC3339Nano
format.If the difference between lastUpdatedAt and the current timestamp is larger than 5 minutes andpercentComplete
is not 100, the Task will beERRORED
.The Task Container must update the progress file with
percentComplete
(between 1 and 100),name
, and optional result metadata.The
name
property in the progress file must match the name of the result-specific subfolder under theNVCT_RESULTS_DIR
folder.The system uploads the newly generated result to your NGC Private Registry using the
NGC_API_KEY
secret.Results are uploaded to the path specified in the
resultsLocation
property.The Task Container must use
NVCT_PROGRESS_FILE_PATH
environment variable to create/update the progress file on the Worker node. Similarly, the Task Container must use theNVCT_RESULTS_DIR
environment variable to create a result-specific subfolder.
Result Handling Strategy: NONE#
Under the NONE strategy:
The Task Container is responsible for creating and uploading results.
The Task Container or Helm Chart must update the lastUpdatedAt field in the progress file every 3 minutes as a heartbeat. The value of the lastUpdatedAt field should be based on
RFC3339Nano
format.If the difference between lastUpdatedAt and the current timestamp is larger than 5 minutes andpercentComplete
is not 100, the Task will beERRORED
.The Task Container must create/update the progress file with
percentComplete
set to 100 at the end to indicate that it exited gracefully for the system to mark the Task asCOMPLETED
.Credentials for external storage can be provided as Task secrets.
The system updates the Task’s progress based on the progress file but does not handle result uploading.
The Task Container must use
NVCT_PROGRESS_FILE_PATH
environment variable to create/update the progress file on the Worker node. Similarly, the Task Container should use theNVCT_RESULTS_DIR
environment variable to create a result-specific subfolder for storing results.
progress File Format#
{
"taskId": "579ad430-34b9-4a6e-9537-a060db4a9e6c",
"percentComplete": 20,
"name": "ckpt-step-2000",
"metadata": {
"step-number": 2000,
"token_accuracy": 0.874
},
"lastUpdatedAt": "20025-01-02T15:04:05.999999999Z07:00"
}
taskId: Task ID.
percentComplete: Integer indicating the completion percentage between 1-100.
metadata: Optional field for Task Container to add metadata regarding the upload. Required format: key-value pairs.
name: Directory/folder name for the result to be uploaded to NGC Private Registry. There are certain restrictions in naming the directory for UPLOAD strategy. The field should be 1-190 characters long. Allowed characters:
[0-9a-zA-Z!-_.*’()]
. Prefixes./
and../
are not allowed.lastUpdatedAt: ISO 8061 timestamp indicating when the progress file was last updated. Must be updated as minimum every 3 minutes to signal to NVCF the task is in progress.
The Task Container must use NVCT_PROGRESS_FILE_PATH
environment variable to create/update the progress file on the Worker node. Similarly, the Task Container must use the NVCT_RESULTS_DIR
environment variable to create a result-specific subfolder.
An atomic write to the progress file is recommended to avoid the race condition of concurrent reads. The Task Container should first write to a new temporary progress file, and then move it to the same location as the original progress file to overwrite the old result (such as os.rename() method in python). Please see the sample task container for reference here.
Intermediate and Final Results#
Intermediate Results: As new results are generated, the Task Container should update the progress file with
percentComplete
(between 1 and <100),name
, and optionally with result metadata. The system monitors this file and updates the Task’s status and progress accordingly. Any consecutive intermediate results with duplicatepercentComplete
andname
fields will be ignored.Final Result: When the Task completes, the Task Container must update the progress file one last time with
percentComplete
set to 100,name
, and optionally with result metadata. The system marks the Task as COMPLETED.
Task Runtime Duration#
By default, a Task runs indefinitely. However, you can specify the maxRuntimeDuration
property when creating a Task to control its execution duration. If maxRuntimeDuration
is specified, the Task will be terminated gracefully after the duration elapses, allowing it to save any final results. If GFN is selected as the backend, then maxRuntimeDuration
must be less than 8 hours.
Helm Chart Based Tasks#
Prerequisites#
Note: Ensure that your Helm chart versions do not contain hyphens (-). For example, v1 is acceptable, but v1-test will cause issues.
1. Essential Helm Chart Values#
The following field keys must be defined in values.yaml of your Helm chart. They will be overridden during runtime. Make use of these values as environment variables in your containers.
Key | Description |
|
---|---|
|
|
|
|
|
|
|
|
|
2. Progress Updates#
Please make sure the progress file is updated when intermediate and final results are generated regardless of the resultHandlingStrategy specified in the Task definition. As Tasks are designed to run to completion, ensure that the container writing/updating the progress file is not restarted by Kubernetes if it exits in the end. As percentComplete is always required to be increasing to 100, a decrease of percentComplete in the progress file caused by container restarts will lead to task failures. A Kubernetes Job is recommended to deploy Task pods. Sample specs is shown below:
apiVersion: batch/v1
kind: Job
metadata:
name: sample
spec:
backoffLimit: 0
template:
metadata:
labels:
app.kubernetes.io/name: {{ .Release.Name }}
app.kubernetes.io/instance: {{ .Release.Name }}
spec:
restartPolicy: Never
imagePullSecrets:
- name: {{ .Values.ngcImagePullSecretName }}
containers:
- name: task
image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
3. Container Image Pull Secret Management#
For pulling containers defined as part of the Helm chart from NGC Private Registry, define ngcImagePullSecretName in values.yaml. This value will be used in the deployment spec as spec.imagePullSecrets.name for the pods.
For nested Helm charts, define global.ngcImagePullSecretName in values.yaml, which will be referenced in the deployment spec under spec.imagePullSecrets.name for the pods.
Note: Containers specified in the Helm Chart definition should be in the same NGC Organization and Team that the Helm Chart itself is being pulled from.
4. Sample Helm Chart values.yaml#
Please find all the essential Helm Chart keys in the following sample values.yaml:
ngcImagePullSecretName: ""
nvctNcaId: ""
nvctTaskId: ""
nvctTaskName: ""
nvctResultsDir: ""
nvctProgressFilePath: ""
- global:
ngcImagePullSecretName: “” # Only for nested helm chart
Create a Helm-based Task#
Creation from UI#
Helm-based Tasks can be created from NGC UI. To do this select Custom Helm Chart under What would you like to create.
Input your task name and other information.
Select your Helm Chart and version from the NGC Private Registry.
Fill in other configurations as creating a container-based task and then click on the button Create Task.
Creation from API#
You can also create a Helm-based Task through REST API calls with the following request payloads:
POST /v1/nvct/tasks
Authorization: Bearer <JWT> with launch_task scope
{
"name": "helm-based-task",
"gpuSpecification": {
"gpu": "H100",
"backend": "nvcf-qa-cluster-gcp",
"instanceType": "GCP.GPU.H100_1x",
"configuration": {
"key": "value",
}
},
"helmChart": "<helm-chart-url>",
"resultHandlingStrategy": "UPLOAD",
"maxRuntimeDuration": "PT15M",
"maxQueuedDuration": "PT1H",
"terminationGracePeriodDuration": "PT1M",
"secrets": [
{
"name": "<secret-name>",
"value": "<secret-value>"
}
],
"resultsLocation": "<org>/<model-name>"
}
Helm Chart Values Overrides#
To override keys in your Helm chart values.yaml, you can provide the key-value pairs in JSON format through NGC UI when creating the task.
Or you can also add JSON key-value pairs to gpuSpecification.configuration field of request payload when creating a task through API.
"gpuSpecification": {
"gpu": "H100",
"backend": "nvcf-qa-cluster-gcp",
"instanceType": "GCP.GPU.H100_1x",
"configuration": {
// Add your helm chart values overrides here
"key": "value",
}
}
Limitations#
When using Helm Charts to create a Task, the following limitations need to be taken into consideration.
1. Disk Size#
For any results generated by your Task’s containers, the file size is limited by the disk space on the VM - for GFN backend this is 100GB approximately for a single GPU instance and about 250 GB for a dual GPU instance. This limit varies among different clusters.
2. Security Constraints#
Helm Charts must conform to certain security standards to be deployable as a Task. This means that certain Helm and Kubernetes features are restricted in NVCF backends. Your Helm Chart along with the overrides and other deployment metadata will be validated during Task creation to enforce the standards.
Restrictions#
Helm Chart and all its containers must be hosted within the NGC Private Registry..
Follow these steps.. to upload your Helm Chart and containers to the NGC Private Registry.
Only the following K8s artifacts are supported under Helm Chart namespace:
ConfigMaps
Secrets
Services - Only type: ClusterIP or none
Deployments
ReplicaSets
StatefulSets
Jobs
CronJobs
Pods
ServiceAccounts
Roles
Rolebindings
PersistentVolumeClaims (GFN backend only)
Others will be rejected.
All pods and resources that define a pod template must conform to the Kubernetes Pod Security Standards Baseline.. and Restricted.. policies. Many of these restrictions are applied to your pod or pod templates automatically. Only the following pod or pod template volume types are supported:
configMap
secret
persistentVolumeClaim
No chart hooks.. are allowed; if specified in the Helm Chart, they will not be executed.