REST API Overview and Examples#

The TAO (Train, Adapt, Optimize) API v2 provides a unified, job-centric interface for managing workspaces, datasets, and training jobs. This version simplifies the API structure with consolidated endpoints and improved authentication.

Examples in this section are based on cURL commands and jq JSON data processing on a Linux machine with CURL and the jq tool pre-installed.

Note

For comprehensive API specifications, see the TAO API Reference.

API v2 Architecture#

The TAO API v2 introduces a unified architecture with the following key improvements:

Unified Jobs Endpoint

All experiment and dataset operations are handled through /api/v2/orgs/{org_name}/jobs with a kind parameter (experiment or dataset).

Environment Variable Authentication

Authentication uses JWT tokens with environment variable support for better security and CI/CD integration.

Resource-Specific Metadata

Dedicated endpoints for workspace, dataset, and job metadata provide clearer access to resource information.

Enhanced Job Control

Comprehensive job management with pause, resume, cancel, and delete operations.

User Authentication#

User authentication is based on NGC Personal Key. For more details, see the pre-requisites in API Setup.

Login and Obtain JWT Token

BASE_URL=https://api.tao.ngc.nvidia.com/api/v2

NGC_ORG_NAME=your_org_name

NGC_API_KEY=nvapi-******

# Login to get JWT token
CREDS=$(curl -s -X POST $BASE_URL/login -d '{
    "ngc_key": "'"$NGC_API_KEY"'",
    "ngc_org_name": "'"$NGC_ORG_NAME"'"
}')

TOKEN=$(echo $CREDS | jq -r '.token')
echo "Token: $TOKEN"

Using the Token for API Calls

For all subsequent API calls, include the token in the Authorization header:

curl -s -X GET $BASE_URL/orgs/$NGC_ORG_NAME/workspaces \
    -H "Authorization: Bearer $TOKEN"

Note

The API Base URL can be retrieved after the cluster is setup. For more details, see the TAO API Setup.

API v2 Endpoints Overview#

The TAO API v2 service is organized around these main resource types:

Workspaces (/api/v2/orgs/{org_name}/workspaces)

  • List workspaces

  • Create workspace

  • Get workspace metadata

  • Delete workspace

  • Backup workspace

  • Restore workspace

Datasets (/api/v2/orgs/{org_name}/datasets)

  • List datasets

  • Create dataset

  • Get dataset metadata

  • Delete dataset

  • Get dataset formats

Jobs - Unified (/api/v2/orgs/{org_name}/jobs)

  • List jobs

  • Create job (experiment or dataset)

  • Get job metadata

  • Get job status

  • Get job logs

  • Pause/Resume/Cancel job

  • Delete job

  • Download job files

  • List base experiments

  • Get job schema

  • Get GPU types

  • Get job events

  • Publish/Remove published model

Inference Microservices (/api/v2/orgs/{org_name}/inference_microservices)

  • Start inference microservice

  • Get microservice status

  • Make inference request

  • Stop microservice

Workspaces#

In TAO 6.0+, cloud workspaces are used to pull datasets and store experiment results in popular cloud storage providers.

Supported Cloud Types:

  • AWS - cloud_type: aws; cloud_specific_details needed: access_key, secret_key, region, bucket_name

  • Azure - cloud_type: azure; cloud_specific_details needed: account_name, access_key, region, container_name

  • HuggingFace - cloud_type: huggingface; cloud_specific_details needed: token (datasets only, not for experiment storage)

  • SLURM - cloud_type: slurm; cloud_specific_details needed: slurm_user, slurm_hostname (list of hostnames for failover), base_results_dir

  • Lepton - cloud_type: lepton; cloud_specific_details needed: lepton_workspace_id, lepton_auth_token, plus a storage backend (AWS, Azure, or SeaweedFS credentials)

  • SeaweedFS - cloud_type: seaweedfs; cloud_specific_details needed: access_key, secret_key, endpoint_url, cloud_bucket_name

Creating a Workspace

WORKSPACE_ID=$(curl -s -X POST $BASE_URL/orgs/$NGC_ORG_NAME/workspaces \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $TOKEN" \
-d '{
    "name": "my_workspace",
    "cloud_type": "aws",
    "cloud_specific_details": {
        "access_key": "AKIAIOSFODNN7EXAMPLE",
        "secret_key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
        "region": "us-west-2",
        "bucket_name": "my-tao-bucket"
    },
    "shared": false
}' | jq -r '.id')
echo $WORKSPACE_ID

List Workspaces

curl -s -X GET $BASE_URL/orgs/$NGC_ORG_NAME/workspaces \
    -H "Authorization: Bearer $TOKEN" | jq

Get Workspace Metadata

curl -s -X GET $BASE_URL/orgs/$NGC_ORG_NAME/workspaces/$WORKSPACE_ID \
    -H "Authorization: Bearer $TOKEN" | jq

Delete Workspace

curl -s -X DELETE $BASE_URL/orgs/$NGC_ORG_NAME/workspaces/$WORKSPACE_ID \
    -H "Authorization: Bearer $TOKEN" | jq

Note

For experiments, you must provide cloud storage with read and write access, which pushes the action artifacts, like training checkpoints, to the provided cloud storage. Datasets also require cloud storage with read and write access, as TAO may need to convert your dataset to a compatible format before training.

Datasets#

Important

Breaking change in TAO 6.26+: Dataset path fields have been renamed from *_path / *_paths to *_uri / *_uris across the API. Update any existing job payloads or scripts that reference the old field names:

Old field name

New field name

train_dataset_paths

train_dataset_uris

eval_dataset_path

eval_dataset_uri

inference_dataset_path

inference_dataset_uri

calibration_dataset_path

calibration_dataset_uri

Dataset URIs now accept the following schemes in addition to local paths: aws://, azure://, lustre://, file://, seaweedfs://.

Note

Datasets are optional for derived-action jobs. Actions that operate on a parent job’s output — such as export, gen_trt_engine, and inference — no longer require train_dataset_uris or eval_dataset_uri. The API infers the data source from the parent job. You only need to supply dataset fields when the action genuinely requires new data (e.g., calibration_dataset_uri for INT8/FP8 calibration, or inference_dataset_uri for batch inference on a different dataset).

When passing dataset URIs directly on the job (instead of pre-created dataset IDs), you can also specify dataset_format (e.g., "coco", "kitti") and dataset_type (e.g., "object_detection") to tell the API how to interpret the data. Set skip_dataset_validation: true to bypass format validation for non-standard layouts.

You can either use datasets stored in the cloud workspace with cloud_file_path or public dataset with an https url.

This example workflow uses the object detection data based on the COCO dataset format. For more details about the COCO format, refer to the COCO dataset page. If you are using a custom dataset, it must follow the dataset structure as depicted below.

$DATA_DIR
├── annotations.json
├── images
    ├── image_name_1.jpg
    ├── image_name_2.jpg
    ├── ...

Note

Ensure that the dataset folder structure in cloud_file_path or url matches the model’s requirements. For details, refer to Data Annotation Format.

Object Detection Use Case Example with API v2#

The following example walks you through a complete TAO workflow using the unified v2 API.

Note

Datasets provided in these examples are subject to the following license Dataset License.

  1. Set Training Dataset URI

    Set the URI pointing to your training data in cloud storage. Supported URI schemes: aws://, azure://, lustre://, file://, seaweedfs://.

    TRAIN_DATASET_URI="<scheme>://my-bucket/train-data"
    
  2. Set Validation Dataset URI

    EVAL_DATASET_URI="<scheme>://my-bucket/eval-data"
    
  3. List Base Experiments

    # List all base experiments
    curl -s -X GET $BASE_URL/orgs/$NGC_ORG_NAME/jobs:list_base_experiments \
        -H "Authorization: Bearer $TOKEN" | jq
    
    # Find specific base experiment (e.g., RT-DETR with ResNet50)
    BASE_EXPERIMENT_ID=$(curl -s -X GET \
        $BASE_URL/orgs/$NGC_ORG_NAME/jobs:list_base_experiments \
        -H "Authorization: Bearer $TOKEN" | \
        jq -r '[.base_experiments[] | select(.network_arch == "rtdetr") | select(.ngc_path | contains("resnet50"))][0] | .id')
    echo $BASE_EXPERIMENT_ID
    
  1. Create Training Job (Unified v2 API)

In API v2, you create jobs directly with all parameters in a single call:

# Get job schema for train action
TRAIN_SCHEMA=$(curl -s -X GET \
    "$BASE_URL/orgs/$NGC_ORG_NAME/jobs:schema?action=train&base_experiment_id=$BASE_EXPERIMENT_ID" \
    -H "Authorization: Bearer $TOKEN" | jq -r '.default')

# Modify specs as needed
TRAIN_SPECS=$(echo $TRAIN_SCHEMA | jq '.train.num_epochs=10 | .train.num_gpus=2')

# Create training job
TRAIN_JOB_ID=$(curl -s -X POST $BASE_URL/orgs/$NGC_ORG_NAME/jobs \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $TOKEN" \
    -d '{
        "kind": "experiment",
        "name": "rtdetr_training_job",
        "network_arch": "rtdetr",
        "encryption_key": "tlt_encode",
        "workspace": "'"$WORKSPACE_ID"'",
        "action": "train",
        "specs": '"$TRAIN_SPECS"',
        "train_dataset_uris": ["'"$TRAIN_DATASET_URI"'"],
        "eval_dataset_uri": "'"$EVAL_DATASET_URI"'",
        "inference_dataset_uri": "'"$EVAL_DATASET_URI"'",
        "calibration_dataset_uri": "'"$TRAIN_DATASET_URI"'",
        "base_experiment_ids": ["'"$BASE_EXPERIMENT_ID"'"],
        "automl_settings": {
            "automl_enabled": false
        }
    }' | jq -r '.id')
echo $TRAIN_JOB_ID
  1. Monitor Training Job Status

    # Get job status
    curl -s -X GET $BASE_URL/orgs/$NGC_ORG_NAME/jobs/$TRAIN_JOB_ID \
        -H "Authorization: Bearer $TOKEN" | jq '.status'
    
    # Get detailed job metadata
    curl -s -X GET $BASE_URL/orgs/$NGC_ORG_NAME/jobs/$TRAIN_JOB_ID \
        -H "Authorization: Bearer $TOKEN" | jq
    
    # Get job logs
    curl -s -X GET "$BASE_URL/orgs/$NGC_ORG_NAME/jobs/$TRAIN_JOB_ID:logs" \
        -H "Authorization: Bearer $TOKEN"
    
    # Get all status events emitted by the framework for this job
    curl -s -X GET "$BASE_URL/orgs/$NGC_ORG_NAME/jobs/$TRAIN_JOB_ID:events" \
        -H "Authorization: Bearer $TOKEN" | jq
    
    # Filter events for a specific AutoML experiment trial
    curl -s -X GET "$BASE_URL/orgs/$NGC_ORG_NAME/jobs/$TRAIN_JOB_ID:events?automl_experiment_number=2" \
        -H "Authorization: Bearer $TOKEN" | jq
    
  2. Job Control Operations

    # Pause a running job
    curl -s -X POST "$BASE_URL/orgs/$NGC_ORG_NAME/jobs/$TRAIN_JOB_ID:pause" \
        -H "Authorization: Bearer $TOKEN" | jq
    
    # Resume a paused job
    curl -s -X POST "$BASE_URL/orgs/$NGC_ORG_NAME/jobs/$TRAIN_JOB_ID:resume" \
        -H "Content-Type: application/json" \
        -H "Authorization: Bearer $TOKEN" \
        -d '{"parent_job_id": "", "specs": {}}' | jq
    
    # Cancel a job
    curl -s -X POST "$BASE_URL/orgs/$NGC_ORG_NAME/jobs/$TRAIN_JOB_ID:cancel" \
        -H "Authorization: Bearer $TOKEN" | jq
    
  3. Create Evaluation Job

    After training completes, run evaluation:

    # Get evaluation schema
    EVAL_SCHEMA=$(curl -s -X GET \
        "$BASE_URL/orgs/$NGC_ORG_NAME/jobs:schema?action=evaluate&base_experiment_id=$BASE_EXPERIMENT_ID" \
        -H "Authorization: Bearer $TOKEN" | jq -r '.default')
    
    # Create evaluation job
    EVAL_JOB_ID=$(curl -s -X POST $BASE_URL/orgs/$NGC_ORG_NAME/jobs \
        -H "Content-Type: application/json" \
        -H "Authorization: Bearer $TOKEN" \
        -d '{
            "kind": "experiment",
            "name": "rtdetr_evaluation_job",
            "network_arch": "rtdetr",
            "encryption_key": "tlt_encode",
            "workspace": "'"$WORKSPACE_ID"'",
            "action": "evaluate",
            "parent_job_id": "'"$TRAIN_JOB_ID"'",
            "specs": '"$EVAL_SCHEMA"',
            "eval_dataset_uri": "'"$EVAL_DATASET_URI"'"
        }' | jq -r '.id')
    echo $EVAL_JOB_ID
    
  4. Create Inference Job

    # Get inference schema
    INFERENCE_SCHEMA=$(curl -s -X GET \
        "$BASE_URL/orgs/$NGC_ORG_NAME/jobs:schema?action=inference&base_experiment_id=$BASE_EXPERIMENT_ID" \
        -H "Authorization: Bearer $TOKEN" | jq -r '.default')
    
    # Create inference job
    INFERENCE_JOB_ID=$(curl -s -X POST $BASE_URL/orgs/$NGC_ORG_NAME/jobs \
        -H "Content-Type: application/json" \
        -H "Authorization: Bearer $TOKEN" \
        -d '{
            "kind": "experiment",
            "name": "rtdetr_inference_job",
            "network_arch": "rtdetr",
            "encryption_key": "tlt_encode",
            "workspace": "'"$WORKSPACE_ID"'",
            "action": "inference",
            "parent_job_id": "'"$TRAIN_JOB_ID"'",
            "specs": '"$INFERENCE_SCHEMA"',
            "inference_dataset_uri": "'"$EVAL_DATASET_URI"'"
        }' | jq -r '.id')
    echo $INFERENCE_JOB_ID
    
  5. Export Model to ONNX

    # Get export schema
    EXPORT_SCHEMA=$(curl -s -X GET \
        "$BASE_URL/orgs/$NGC_ORG_NAME/jobs:schema?action=export&base_experiment_id=$BASE_EXPERIMENT_ID" \
        -H "Authorization: Bearer $TOKEN" | jq -r '.default')
    
    # Create export job
    EXPORT_JOB_ID=$(curl -s -X POST $BASE_URL/orgs/$NGC_ORG_NAME/jobs \
        -H "Content-Type: application/json" \
        -H "Authorization: Bearer $TOKEN" \
        -d '{
            "kind": "experiment",
            "name": "rtdetr_export_job",
            "network_arch": "rtdetr",
            "encryption_key": "tlt_encode",
            "workspace": "'"$WORKSPACE_ID"'",
            "action": "export",
            "parent_job_id": "'"$TRAIN_JOB_ID"'",
            "specs": '"$EXPORT_SCHEMA"'
        }' | jq -r '.id')
    echo $EXPORT_JOB_ID
    
  6. Generate TensorRT Engine

    # Get TensorRT engine schema
    TRT_SCHEMA=$(curl -s -X GET \
        "$BASE_URL/orgs/$NGC_ORG_NAME/jobs:schema?action=gen_trt_engine&base_experiment_id=$BASE_EXPERIMENT_ID" \
        -H "Authorization: Bearer $TOKEN" | jq -r '.default')
    
    # Create TensorRT engine generation job
    TRT_JOB_ID=$(curl -s -X POST $BASE_URL/orgs/$NGC_ORG_NAME/jobs \
        -H "Content-Type: application/json" \
        -H "Authorization: Bearer $TOKEN" \
        -d '{
            "kind": "experiment",
            "name": "rtdetr_trt_engine_job",
            "network_arch": "rtdetr",
            "encryption_key": "tlt_encode",
            "workspace": "'"$WORKSPACE_ID"'",
            "action": "gen_trt_engine",
            "parent_job_id": "'"$EXPORT_JOB_ID"'",
            "specs": '"$TRT_SCHEMA"'
        }' | jq -r '.id')
    echo $TRT_JOB_ID
    
  7. Download Job Files

    # List job files
    curl -s -X POST "$BASE_URL/orgs/$NGC_ORG_NAME/jobs/$TRAIN_JOB_ID:list_files" \
        -H "Content-Type: application/json" \
        -H "Authorization: Bearer $TOKEN" \
        -d '{
            "retrieve_logs": true,
            "retrieve_specs": true
        }' | jq
    
    # Download selective files
    curl -s -X POST "$BASE_URL/orgs/$NGC_ORG_NAME/jobs/$TRAIN_JOB_ID:download_selective_files" \
        -H "Content-Type: application/json" \
        -H "Authorization: Bearer $TOKEN" \
        -d '{
            "best_model": true,
            "latest_model": false
        }' > job_files.tar.gz
    
    # Download entire job
    curl -s -X GET "$BASE_URL/orgs/$NGC_ORG_NAME/jobs/$TRAIN_JOB_ID:download" \
        -H "Authorization: Bearer $TOKEN" > job_complete.tar.gz
    
  8. Publish Model

    # Publish trained model to NGC
    curl -s -X POST "$BASE_URL/orgs/$NGC_ORG_NAME/jobs/$TRAIN_JOB_ID:publish_model" \
        -H "Content-Type: application/json" \
        -H "Authorization: Bearer $TOKEN" \
        -d '{
            "display_name": "RT-DETR Production Model v1.0",
            "description": "Trained RT-DETR model for object detection",
            "team": "ml_team"
        }' | jq
    
    # Remove published model
    curl -s -X POST "$BASE_URL/orgs/$NGC_ORG_NAME/jobs/$TRAIN_JOB_ID:remove_published_model" \
        -H "Content-Type: application/json" \
        -H "Authorization: Bearer $TOKEN" \
        -d '{
            "team": "ml_team"
        }' | jq
    

Inference Microservices#

Deploy trained models as inference microservices for scalable, real-time inference.

Start Inference Microservice

MICROSERVICE_ID=$(curl -s -X POST \
    $BASE_URL/orgs/$NGC_ORG_NAME/inference_microservices:start \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $TOKEN" \
    -d '{
        "docker_image": "nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf2.11.0",
        "gpu_type": "a100",
        "num_gpus": 1,
        "parent_job_id": "'"$TRAIN_JOB_ID"'",
        "kind": "experiment",
        "model_path": "/workspace/models/best_model.pth",
        "workspace": "'"$WORKSPACE_ID"'",
        "checkpoint_choose_method": "best_model",
        "network_arch": "rtdetr"
    }' | jq -r '.id')
echo $MICROSERVICE_ID

Check Microservice Status

curl -s -X GET \
    "$BASE_URL/orgs/$NGC_ORG_NAME/inference_microservices/$MICROSERVICE_ID:status" \
    -H "Authorization: Bearer $TOKEN" | jq

Make Inference Request

# Base64-encoded image inference
curl -s -X POST \
    "$BASE_URL/orgs/$NGC_ORG_NAME/inference_microservices/$MICROSERVICE_ID:inference" \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $TOKEN" \
    -d '{
        "input": ["data:image/jpeg;base64,/9j/4AAQSkZJRgABAQEA..."],
        "model": "rtdetr_model"
    }' | jq

# Cloud media path inference
curl -s -X POST \
    "$BASE_URL/orgs/$NGC_ORG_NAME/inference_microservices/$MICROSERVICE_ID:inference" \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $TOKEN" \
    -d '{
        "media": "s3://my-bucket/path/to/image.jpg",
        "prompt": "Detect objects in this image"
    }' | jq

Stop Microservice

curl -s -X POST \
    "$BASE_URL/orgs/$NGC_ORG_NAME/inference_microservices/$MICROSERVICE_ID:stop" \
    -H "Authorization: Bearer $TOKEN" | jq

Job Events#

When a job fails or produces unexpected results, the standard status field only shows a summary. The events endpoint gives you the full, chronological log of every status message the training framework emitted — useful for debugging loss spikes, epoch-by-epoch progress, or understanding exactly where an AutoML trial went wrong.

Endpoint: GET /api/v2/orgs/{org_name}/jobs/{job_id}:events

Query parameters:

  • automl_experiment_number (integer, optional): When set, returns only the events from that specific AutoML trial number. Omit to retrieve events from all trials.

Example:

# Get all events for a job
curl -s -X GET "$BASE_URL/orgs/$NGC_ORG_NAME/jobs/$TRAIN_JOB_ID:events" \
    -H "Authorization: Bearer $TOKEN" | jq

# Filter events for AutoML trial 3
curl -s -X GET "$BASE_URL/orgs/$NGC_ORG_NAME/jobs/$TRAIN_JOB_ID:events?automl_experiment_number=3" \
    -H "Authorization: Bearer $TOKEN" | jq

Response (JobEventsRsp):

{
    "events": [
        "Epoch 1/10 - train_loss: 0.842",
        "Epoch 2/10 - train_loss: 0.713",
        "Epoch 3/10 - train_loss: 0.621 - val_miou: 0.543"
    ]
}

Job Execution Backends#

By default, jobs run on the local Kubernetes cluster (local backend). TAO 6.26+ adds support for running jobs on remote SLURM clusters and Lepton AI as alternative execution backends. The backend is selected per-job via the backend_details field.

Supported backends:

Backend

Description

local

Default. Runs jobs on the Kubernetes cluster hosting FTMS.

slurm

Submits jobs to a remote SLURM cluster. Requires a SLURM workspace with SSH connectivity.

lepton

Runs jobs on the Lepton AI platform. Requires a Lepton workspace with authentication credentials.

Creating a SLURM workspace:

SLURM_WORKSPACE_ID=$(curl -s -X POST $BASE_URL/orgs/$NGC_ORG_NAME/workspaces \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $TOKEN" \
    -d '{
        "name": "slurm_cluster",
        "cloud_type": "slurm",
        "cloud_specific_details": {
            "slurm_user": "tao_user",
            "slurm_hostname": ["slurm-login-01.example.com", "slurm-login-02.example.com"],
            "base_results_dir": "/scratch/tao_results"
        }
    }' | jq -r '.id')
echo $SLURM_WORKSPACE_ID
  • slurm_user (string, required): SSH user on the SLURM login node.

  • slurm_hostname (list of strings, required): One or more login node hostnames. Multiple hostnames enable automatic failover if the primary node is unreachable.

  • base_results_dir (string, optional): Directory on the SLURM cluster where job results are written.

Creating a Lepton workspace:

LEPTON_WORKSPACE_ID=$(curl -s -X POST $BASE_URL/orgs/$NGC_ORG_NAME/workspaces \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $TOKEN" \
    -d '{
        "name": "lepton_workspace",
        "cloud_type": "lepton",
        "cloud_specific_details": {
            "lepton_workspace_id": "my-lepton-workspace",
            "lepton_auth_token": "lpt_...",
            "storage_backend": {
                "storage_type": "aws",
                "access_key": "AKIA...",
                "secret_key": "...",
                "cloud_bucket_name": "my-bucket",
                "cloud_region": "us-west-2"
            }
        }
    }' | jq -r '.id')
echo $LEPTON_WORKSPACE_ID
  • lepton_workspace_id (string, required): Your Lepton AI workspace identifier.

  • lepton_auth_token (string, required): Lepton API authentication token.

  • storage_backend (object, required): Cloud storage used by the Lepton workspace. Set storage_type to aws, azure, or seaweedfs and provide the corresponding credentials.

Running a job on SLURM:

Pass backend_details when creating the job:

TRAIN_JOB_ID=$(curl -s -X POST $BASE_URL/orgs/$NGC_ORG_NAME/jobs \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $TOKEN" \
    -d '{
        "kind": "experiment",
        "name": "rtdetr_slurm_train",
        "network_arch": "rtdetr",
        "workspace": "'"$SLURM_WORKSPACE_ID"'",
        "action": "train",
        "specs": '"$TRAIN_SPECS"',
        "train_dataset_uris": ["'"$TRAIN_DATASET_URI"'"],
        "base_experiment_ids": ["'"$BASE_EXPERIMENT_ID"'"],
        "backend_details": {
            "backend_type": "slurm",
            "partition": "gpu_a100",
            "cluster_name": "my-cluster"
        }
    }' | jq -r '.id')
echo $TRAIN_JOB_ID

``backend_details`` fields:

  • backend_type (string, required): "local", "slurm", or "lepton".

  • partition (string, optional, SLURM only): SLURM partition to submit the job to.

  • cluster_name (string, optional, SLURM only): SLURM cluster name.

  • platform_id (string, optional, Lepton only): Lepton platform identifier.

Note

When backend_details is omitted, the job runs on the local Kubernetes cluster. AutoML jobs on SLURM backends automatically resume interrupted trials after preemption.

Preventing Accidental Duplicate Artifacts#

By default, the API blocks creation of jobs, datasets, or workspaces that are identical to an already-existing artifact. This prevents common mistakes like submitting the same training job twice from a script or notebook.

If you intentionally want to create a duplicate — for example, to re-run a job with different random seeds while keeping the same configuration — pass "force_create": true in the request body.

This applies to:

  • POST /api/v2/orgs/{org_name}/jobs

  • POST /api/v2/orgs/{org_name}/datasets

  • POST /api/v2/orgs/{org_name}/workspaces

Example — re-running a training job with the same config:

TRAIN_JOB_ID=$(curl -s -X POST $BASE_URL/orgs/$NGC_ORG_NAME/jobs \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $TOKEN" \
    -d '{
        "kind": "experiment",
        "name": "rtdetr_training_job",
        "network_arch": "rtdetr",
        "encryption_key": "tlt_encode",
        "workspace": "'"$WORKSPACE_ID"'",
        "action": "train",
        "specs": '"$TRAIN_SPECS"',
        "train_dataset_uris": ["'"$TRAIN_DATASET_URI"'"],
        "base_experiment_ids": ["'"$BASE_EXPERIMENT_ID"'"],
        "force_create": true
    }' | jq -r '.id')
echo $TRAIN_JOB_ID

Note

force_create defaults to false. Omit it in normal usage — the safeguard is on by default.

Calibration Datasets for INT8/FP8 TensorRT Engines#

When generating a quantized TensorRT engine (gen_trt_engine action), TensorRT needs to see a representative sample of your data to determine the optimal quantization scale factors. This is called calibration. You provide this data by pointing to a calibration dataset via calibration_dataset_uri.

Typically you can reuse your training or validation dataset for calibration — it does not need to be a separate dataset, but it should be representative of the inputs your deployed model will see.

Valid dataset intent values (dataset_intent field):

  • training

  • evaluation

  • testing

  • calibration (new — use this when the dataset is purely for quantization calibration)

The following networks support gen_trt_engine with INT8/FP8 calibration:

  • Object Detection: rtdetr, deformable_detr, dino, grounding_dino, mask_grounding_dino

  • Segmentation: mask2former, oneformer, segformer

  • Classification: classification_pyt, nvdinov2

  • OCR: ocdnet, ocrnet

  • Depth Estimation: depth_net (mono, stereo)

  • Autonomous Driving: sparse4d

  • Change Detection: visual_changenet (classify, segment)

  • Multimodal: clip

When creating a gen_trt_engine job, supply the calibration dataset via calibration_dataset_uri:

TRT_JOB_ID=$(curl -s -X POST $BASE_URL/orgs/$NGC_ORG_NAME/jobs \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $TOKEN" \
    -d '{
        "kind": "experiment",
        "name": "rtdetr_trt_int8_job",
        "network_arch": "rtdetr",
        "encryption_key": "tlt_encode",
        "workspace": "'"$WORKSPACE_ID"'",
        "action": "gen_trt_engine",
        "parent_job_id": "'"$EXPORT_JOB_ID"'",
        "calibration_dataset_uri": "'"$TRAIN_DATASET_URI"'",
        "specs": '"$TRT_SCHEMA"'
    }' | jq -r '.id')
echo $TRT_JOB_ID

Dataset Processing Jobs#

Create dataset processing jobs using the unified jobs endpoint:

Dataset Conversion Job

CONVERT_JOB_ID=$(curl -s -X POST $BASE_URL/orgs/$NGC_ORG_NAME/jobs \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $TOKEN" \
    -d '{
        "kind": "dataset",
        "train_dataset_uris": ["'"$TRAIN_DATASET_URI"'"],
        "action": "convert",
        "specs": {
            "output_format": "tfrecords",
            "train_split": 0.8,
            "val_split": 0.2,
            "shuffle": true
        }
    }' | jq -r '.id')
echo $CONVERT_JOB_ID

Workspace Backup and Restore#

Backup Workspace

curl -s -X POST "$BASE_URL/orgs/$NGC_ORG_NAME/workspaces/$WORKSPACE_ID:backup" \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $TOKEN" \
    -d '{
        "backup_file_name": "mongodb_backup_20251110.gz"
    }' | jq

Restore Workspace

curl -s -X POST "$BASE_URL/orgs/$NGC_ORG_NAME/workspaces/$WORKSPACE_ID:restore" \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $TOKEN" \
    -d '{
        "backup_file_name": "mongodb_backup_20251110.gz"
    }' | jq

Note

  • Restore action is recommended when reinstalling the FTMS Helm Chart or if ptmPull is set to False.

  • Workspace used for restore must refer to a cloud bucket which contains a backup file generated by the FTMS backup action.

Job Management Features#

Graceful Job Termination#

TAO FTMS supports graceful termination of training jobs, allowing them to complete their current checkpoint and upload results before shutting down. This ensures that no training progress is lost when pausing or stopping jobs.

Using Graceful Pause#

When pausing a job, you can specify the graceful parameter in the request body to allow the job to finish its current training epoch and upload checkpoints:

# Graceful pause (recommended) - allows checkpoint upload before stopping
curl -X POST $BASE_URL/orgs/$NGC_ORG_NAME/jobs/$JOB_ID:pause \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"graceful": true}'

# Abrupt pause - stops immediately without uploading
curl -X POST $BASE_URL/orgs/$NGC_ORG_NAME/jobs/$JOB_ID:pause \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"graceful": false}'

Additional Termination Options#

When running training jobs, you can configure graceful termination behavior using these top-level parameters:

  • retain_checkpoints_for_resume (boolean): Retain intermediate checkpoints for resuming training later (useful for Hyperband AutoML)

  • early_stop_epoch (integer): Specify a predefined epoch number to stop training at, triggering graceful termination and checkpoint upload

These options are specified as top-level parameters in the job run request:

curl -X POST $BASE_URL/orgs/$NGC_ORG_NAME/jobs \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{
    "kind": "experiment",
    "action": "train",
    "specs": {
      "train": {
        "num_epochs": 10,
        "num_gpus": 2
      }
    },
    "retain_checkpoints_for_resume": true,
    "early_stop_epoch": 50
  }'

Job Timeout Configuration#

TAO FTMS provides per-job timeout management for long-running training jobs. Each job has its own timeout value (default: 60 minutes), providing fine-grained control over different types of operations.

Specifying Timeout When Running a Job#

The timeout_minutes parameter is specified as a top-level field in the job run request body, not inside the specs.

# Training job with 3-hour timeout
curl -X POST $BASE_URL/orgs/$NGC_ORG_NAME/jobs \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{
    "kind": "experiment",
    "action": "train",
    "specs": {
      "train": {
        "num_epochs": 10,
        "num_gpus": 2
      }
    },
    "timeout_minutes": 180
  }'

Parameters#

  • timeout_minutes (integer, optional): Timeout in minutes for the job. Default: 60 minutes. Must be at least 1 minute.

  • This is a top-level parameter alongside action, specs, parent_job_id, etc., NOT inside specs.

Jobs that exceed their timeout are automatically terminated to prevent runaway processes from consuming cluster resources indefinitely.

Best Practices#

  • Adjust the timeout (default: 60 minutes) based on your workload.

  • Set realistic timeouts based on dataset size and model complexity.

  • Training jobs typically need longer timeouts than evaluation/inference jobs.

  • Consider hardware capabilities (GPU type memory) when setting timeout values.

  • Monitor job progress through the status API to adjust timeouts if needed.

  • For large-scale training (large models or extensive datasets), increase the timeout accordingly.

Cloud File Operations with Progress Tracking#

TAO FTMS provides enhanced visibility into cloud storage operations with detailed progress tracking for uploads and downloads. Progress information is available when you fetch the job metadata through the status API.

Progress Tracking Features#

  • Real-time progress updates for dataset downloads

  • Upload progress tracking when saving checkpoints to cloud storage

  • File count and size information for large model checkpoints

  • Current file being transferred with individual file progress

  • Overall transfer progress across all files

  • Transfer speed and ETA estimation

  • Docker image pull and extraction status reflected on job status (see below)

Accessing Progress Information#

Progress updates are included in the job metadata when you query job status:

# Get job status to see progress
curl -s -X GET $BASE_URL/orgs/$NGC_ORG_NAME/jobs/$JOB_ID \
  -H "Authorization: Bearer $TOKEN" | jq

Download Progress Examples#

When downloading datasets or pretrained models, you’ll see progress updates like these in the job status response:

NGC Model Download#

Current file download: NGC: nvdinov2_vitg (nvidia/tao)
Current file download Progress: 1.2 GB
Total Download Progress: 1/7 files (14.3%), 1.2 GB/25.7 GB (4.6%)
Remaining: 6 files, 24.5 GB, ETA: 0:06:14

HuggingFace Model Download#

Current file download: HF: nvidia/Cosmos-Reason1-7B
Current file download Progress: 7.1 GB/15.5 GB (45.7%)
Total Download Progress: 3/7 files (42.9%), 11.7 GB/25.7 GB (45.3%)
Remaining: 4 files, 14.1 GB, ETA: 0:01:55

Dataset Download#

Current file download: mvtec_mgcn_train/images.tar.gz
Current file download Progress: 80.0 MB/5.0 GB (1.6%)
Total Download Progress: 5/7 files (71.4%), 20.1 GB/25.7 GB (78.2%)
Remaining: 2 files, 5.6 GB, ETA: 0:00:31

Upload Progress Example#

When uploading model checkpoints or results to cloud storage:

Current file upload: train/model_epoch_000_step_00117.pth
Current file upload Progress: 8.0 MB/1.0 GB (0.8%)
Total Upload Progress: 2/4 files (50.0%), 8.0 MB/1.0 GB (0.8%)
Remaining: 2 files, 1.0 GB, ETA: 0:44:39

Progress Information Includes#

  • Current file: Name and source of the file being transferred

  • Current file progress: Bytes transferred and percentage for the current file

  • Total progress: Overall completion across all files (count and size)

  • Remaining: Files and data size yet to be transferred

  • ETA: Estimated time to completion based on current transfer rate

Docker Image Pull and Extraction Status#

Previously, a job that was pulling its container image would appear stuck in a pending state with no visible progress — making it impossible to tell whether the cluster was busy or the image download was just slow.

The job status now surfaces the image pull and extraction phase directly. When you query job metadata, you will see messages like:

Pulling image: nvcr.io/nvidia/tao/tao-toolkit:6.26.0
Image pull progress: layer 3/12 (25.0%)
Extracting image layers: 6/12 complete

This makes it clear the job is active and lets you estimate how long the startup will take before the actual training begins.

Use Cases#

  • Monitoring large dataset downloads from cloud storage

  • Tracking model checkpoint uploads during training

  • Observing PTM (PreTrained Model) downloads from NGC or HuggingFace

  • Monitoring experiment artifact uploads to cloud workspaces

  • Verifying transfer completion and detecting stalled operations

  • Monitoring Docker image pull and extraction for custom framework jobs

AutoML#

AutoML is a TAO Toolkit API service that automatically selects deep learning hyperparameters for a chosen model and dataset.

Supported AutoML Algorithms

The automl_algorithm field accepts the following values:

Algorithm

Description

Key parameters

bayesian

Gaussian-process-based adaptive search

automl_max_recommendations

hyperband

Successive halving with adaptive resource allocation

automl_R, automl_nu, epoch_multiplier

bohb

Bayesian Optimization and Hyperband

automl_kde_samples, automl_top_n_percent, automl_min_points_in_model

bfbo

Batch First Bayesian Optimization

automl_max_recommendations

asha

Asynchronous Successive Halving Algorithm

automl_max_concurrent, automl_max_trials

pbt

Population-Based Training

automl_population_size, automl_eval_interval

dehb

Differential Evolution with Hyperband

automl_mutation_factor, automl_crossover_prob

hyperband_es

Hyperband with Early Stopping

automl_early_stop_threshold, automl_min_early_stop_epochs

See the AutoML docs for full parameter descriptions and algorithm explanations.

W&B Integration

When Weights & Biases is configured for your experiment, AutoML automatically logs a summary table of all trials to your W&B project. The table includes each trial’s hyperparameters, validation metrics, and final status, making it easy to compare runs and identify the best configuration.

Create Training Job with AutoML

AUTOML_JOB_ID=$(curl -s -X POST $BASE_URL/orgs/$NGC_ORG_NAME/jobs \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $TOKEN" \
    -d '{
        "kind": "experiment",
        "name": "automl_training_job",
        "network_arch": "classification_pyt",
        "encryption_key": "tlt_encode",
        "workspace": "'"$WORKSPACE_ID"'",
        "action": "train",
        "specs": '"$TRAIN_SPECS"',
        "train_dataset_uris": ["'"$TRAIN_DATASET_URI"'"],
        "eval_dataset_uri": "'"$EVAL_DATASET_URI"'",
        "base_experiment_ids": ["'"$BASE_EXPERIMENT_ID"'"],
        "automl_settings": {
            "automl_enabled": true,
            "automl_algorithm": "bayesian",
            "automl_max_recommendations": 20,
            "automl_delete_intermediate_ckpt": true,
            "metric": "mAP"
        }
    }' | jq -r '.id')
echo $AUTOML_JOB_ID

Get AutoML Defaults

curl -s -X GET \
    "$BASE_URL/orgs/$NGC_ORG_NAME/automl:get_param_details?base_experiment_id=$BASE_EXPERIMENT_ID" \
    -H "Authorization: Bearer $TOKEN" | jq

The metric field inside automl_settings lets you override which validation metric AutoML uses to rank trials. When omitted, the default metric for the network is used (e.g., mAP for detection models, accuracy for classification). Set this when your model reports multiple metrics and you want AutoML to optimize for a specific one.

See the AutoML docs for more details.

Migration from v1 to v2#

Key differences when migrating from API v1:

Endpoint Changes

  • v1: /api/v1/orgs/{org}/experiments → v2: /api/v2/orgs/{org}/jobs (with kind: "experiment")

  • v1: /api/v1/orgs/{org}/datasets/{id}/actions → v2: /api/v2/orgs/{org}/jobs (with kind: "dataset")

Job Creation

  • v1: Two-step process (create experiment, then run action)

  • v2: Single-step job creation with all parameters

Authentication

  • v1: File-based config

  • v2: Environment variables and JWT tokens

Metadata Access

  • v1: Generic /metadata endpoint

  • v2: Resource-specific endpoints (/workspaces/{id}, /datasets/{id}, /jobs/{id})

Additional Resources#

For complete API specifications, see the TAO API Reference.