Migrating to TAO 7.0#

From the TAO CLI to Agent Prompts#

Migration guide for the API-less, skill-bank-driven workflow.

Plugin: tao-skills (models/, data/, platform/, applications/).

1. What Changed#

Previous TAO releases shipped a hosted Fine-Tuning Microservice (FTMS) plus the nvidia-tao-client package, which provided a TAO CLI and a Python SDK. Both talked to a REST API that hosted workspaces, datasets, jobs, and inference microservices as server-side state.

This release removes the API surface. There is no FTMS server, no central database, and no tao login to authenticate against. Instead, you load the tao-skills plugin in an agent (Claude Code or any compatible coding agent) and ask the agent in plain English. The agent reads the relevant skill SKILL.md files, builds the right command, and runs it on your local Docker daemon — or on SLURM, Kubernetes, or Brev if you ask for a remote platform.

Concretely, three things go away:

  • The REST API. There is no https://<host>/api/v2/... to point clients at. Every CLI verb is now an agent prompt that invokes a skill directly.

  • Server-side state (workspaces, datasets, jobs as DB rows). You manage your own cloud paths and your own local artifact directories. The agent helps you keep track inside a session, but nothing is persisted on a server.

  • UUIDs. WORKSPACE_ID, DATASET_ID, JOB_ID exported as shell variables are replaced by natural references (“the training run I just kicked off”, “the checkpoint at ./runs/dino/”).

What stays the same: the model containers (DINO, CLIP, Visual ChangeNet, etc.), the AutoML algorithms, the dataset formats. Only the orchestration layer changed — from a REST service to a skill-bank-driven agent.

2. Quick Start#

Replace your CLI install with the agent plus the skill plugin.

Before#

pip install nvidia-tao-client
tao login --ngc-key $NGC_KEY --ngc-org-name $NGC_ORG
tao --help

After#

# In a Claude Code session:
/plugin marketplace add https://github.com/NVIDIA-TAO/tao-skills-bank.git#7.0.1
/plugin install tao-skills@tao-skill-bank

# Export credentials in your shell BEFORE launching the agent.
# Use your shell's secret-loading mechanism of choice — for example a
# password manager that prints to stdout, your OS keychain, or a CI-injected
# environment. Do NOT write these values to a file on disk.
export NGC_KEY="$(security find-generic-password -s NGC_KEY -w)"
export NGC_ORG="my-org"
export ACCESS_KEY="$(op read 'op://Private/AWS/access_key')"
export SECRET_KEY="$(op read 'op://Private/AWS/secret_key')"
export S3_BUCKET_NAME="my-bucket"
export S3_ENDPOINT_URL="https://s3.us-west-1.amazonaws.com"

# Now launch the agent so it inherits the exported env.
claude

# Verify your GPU host is ready (only needed for the local-docker platform):
docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Note

Security policy. Set every credential as an environment variable in the shell session BEFORE launching the agent. The agent never reads credential values directly; the model containers pick them up from the inherited environment at docker run time. DO NOT write secrets to a file on disk — files are leaky (they end up in shell history, backups, Spotlight indexes, and accidental commits). Use your password manager, OS keychain, or a CI secret injector to materialise the value into the shell env at launch time, and let it disappear when the session ends.

3. Required Environment Variables#

All TAO credentials and tunables are read from the shell environment that launched the agent. The agent never reads a credentials file from disk, and neither do the containers it spawns — every value below must be exported in your shell BEFORE you start the agent, and the agent must be launched from that same shell so the values are inherited.

Note

Why env vars, not files. Files are leaky: shell history, periodic backups, Spotlight / Windows Search indexes, and accidental git add . commits all silently capture them. Keep secrets in the shell process only, populated at launch time from your password manager, your OS keychain, or a CI secret injector. When the shell exits, the secret is gone.

3.1 The Complete Variable List#

NGC (required for every TAO container pull from nvcr.io)

Variable

Purpose

Where to get it

Required?

NGC_KEY

NGC personal API key — used as the docker login password for nvcr.io.

ngc.nvidia.com → Setup → Generate API Key.

Yes

NGC_ORG

NGC organization slug. Containers pull from nvcr.io/<org>/....

Your NGC org name (e.g. nvstaging, nvidia).

Yes

NGC_TEAM

Optional team scope for model publishing.

Your NGC team slug.

Only for publish-model

S3 / S3-compatible object storage (required when datasets live on AWS or similar)

Variable

Purpose

Where to get it

Required?

ACCESS_KEY

S3 access key id. Plumbed into containers as AWS_ACCESS_KEY_ID.

AWS console → IAM, or your S3-compatible provider.

Yes if using s3://

SECRET_KEY

S3 secret key. Plumbed into containers as AWS_SECRET_ACCESS_KEY.

Same provider as ACCESS_KEY.

Yes if using s3://

S3_BUCKET_NAME

Default bucket the agent assumes for s3:// shorthand.

Your bucket name.

Recommended

S3_ENDPOINT_URL

S3 endpoint URL. Required for non-AWS S3-compatible storage (MinIO, Wasabi, NVCF storage, …); leave unset for vanilla AWS.

Your provider’s endpoint.

If non-AWS S3

AWS_REGION

Region for AWS S3 operations (e.g. us-west-1).

Your bucket region.

Optional

Azure Blob Storage (only if you use azure:// URIs)

Variable

Purpose

Where to get it

Required?

AZURE_STORAGE_ACCOUNT

Azure storage account name.

Azure portal → Storage account.

Yes if using azure://

AZURE_STORAGE_KEY

Storage account access key. Prefer Azure CLI auth or a SAS token where possible; key-based auth is the simplest fallback.

Azure portal → Storage account → Access keys.

Yes if using azure://

HuggingFace (only if pulling private models / gated checkpoints)

Variable

Purpose

Where to get it

Required?

HF_TOKEN

HuggingFace access token. Aliased as HUGGINGFACE_TOKEN by some skills.

huggingface.co → Settings → Access Tokens.

Only for HF models

Remote platform: Kubernetes (only if --platform=kubernetes)

Variable

Purpose

Where to get it

Required?

KUBECONFIG

Path to your kubeconfig (not a secret itself; the file it points at is). The context selected must target a cluster with the NVIDIA GPU Operator installed.

Provided by your cluster admin.

Yes for k8s

Remote platform: SLURM (only if --platform=slurm)

Variable

Purpose

Where to get it

Required?

(SSH agent)

SLURM uses ssh to a head node, not a token. Ensure ssh-agent is running and your private key is added (ssh-add) BEFORE launching the agent.

Your existing SSH key.

Yes for slurm

AutoML LLM brain (only for algorithm in {llm, hybrid, autoresearch})

Variable

Purpose

Where to get it

Required?

AUTOML_LLM_ENDPOINT

OpenAI-compatible endpoint URL. Default https://inference-api.nvidia.com.

Your NIM / OpenAI / vLLM endpoint.

Yes for LLM AutoML

AUTOML_LLM_MODEL

LLM model name passed to the endpoint.

Endpoint’s model registry (e.g. meta/llama-3.1-70b-instruct, gcp/google/gemini-3.1-pro-preview).

Yes for LLM AutoML

AUTOML_LLM_API_KEY

Bearer token for the LLM endpoint.

Your provider — NVIDIA NIM key, OpenAI key, etc.

Yes for LLM AutoML

NVIDIA_API_KEY

Fallback when AUTOML_LLM_API_KEY isn’t set and the endpoint is NVIDIA NIM.

build.nvidia.com → Get API Key.

Optional fallback

Observability (optional but recommended for sweeps)

Variable

Purpose

Where to get it

Required?

WANDB_API_KEY

Weights & Biases API key. Without it, WandB tracking is silently disabled.

wandb.ai → Settings.

Only for WandB

Configuration (not secrets, but required for some flows)

Variable

Purpose

Where to get it

Required?

TAO_SKILL_BANK_PATH

Filesystem path to the skill bank. The session-start hook sets this; export yourself only if you script your own runner outside the agent.

Usually ~/tao-skills-bank.

Auto-set

4. The New Mental Model#

Every CLI verb maps to one or more skills under ~/tao-skills-bank/. Knowing which skill the agent reaches for makes the prompts much easier to write:

Skill layer

What it owns

models/<network>/

Container image, per-action command, accepted dataset format, required dataset URIs, spec template, AutoML notes. One per network (dino, segformer, clip, …).

applications/tao-train-single-step

Standard fine-tune workflow: train → eval → export, composing the model and platform skills.

applications/tao-run-automl

Hyperparameter optimization (bayesian, hyperband, ASHA, LLM-guided, autoresearch). Drives the model skill repeatedly.

platform/tao-run-on-local-docker

Default backend: actual docker run --gpus all on the local host.

platform/{tao-run-on-slurm,tao-run-on-kubernetes,tao-run-on-brev}

Remote backends: same container, different launcher. Switch by asking the agent for a different platform.

data/*

Data preparation skills: kNN mining, embedding generation, captioning, AOI mining, anomaly generation, KPI analysis. Cover what tao create-job --kind dataset used to do.

5. How to Read Each Entry#

Sections 6–14 each contain a three-column table. The left column is the CLI you used to type. The middle column is the agent prompt that does the same thing — say it in your own words; this is just a concrete starting point. The right column lists the skills the agent will reach for so you know where to look if something needs tweaking.

Note

Legend. Rows tagged (removed) document CLI verbs that have no equivalent in the new workflow — they managed FTMS-only state that no longer exists. Section 15 collects all of them in one place.

6. Authentication, Workspaces, and Datasets#

Without a server, there is no login, no workspace, no dataset registry. Authentication to NGC and to your cloud bucket lives entirely in your shell environment — every credential is exported in the shell that launches the agent and inherited by every container the agent spawns. “Workspace” collapses to your cloud bucket URI prefix; “dataset” collapses to a path inside that prefix.

Note

Reminder. Secrets must never be written to a file. Export NGC_KEY, NGC_ORG, ACCESS_KEY, SECRET_KEY, S3_BUCKET_NAME, S3_ENDPOINT_URL, HF_TOKEN, WANDB_API_KEY, AZURE_*, etc. in the shell before launching the agent. If a row below tells you to “set X”, that always means export X=... in your shell, not editing a file.

CLI command

Agent prompt

Skills used

tao login --ngc-key $K --ngc-org-name $O

“Are my NGC credentials exported in this shell?” Agent runs [ -n "$NGC_KEY" ] && [ -n "$NGC_ORG" ] — presence-only, never reads the values. To change credentials, export NGC_KEY=... NGC_ORG=... and restart the agent. Do not save them to a file.

(shell env)

tao logout

“Unset my TAO credentials in this shell.” Agent emits unset NGC_KEY NGC_ORG ACCESS_KEY SECRET_KEY ... which you run in your shell. Quitting the shell achieves the same thing.

(shell env)

tao whoami

“Which TAO credentials does the agent see?” Reports present/absent per variable — never the value.

(shell env)

tao --version

“What version of the skill bank is loaded?” Reports skill-bank git SHA and the TAO container image versions pinned in versions.yaml.

(versions.yaml)

tao <net> create-workspace-aws

(removed) No server-side workspaces exist. Export ACCESS_KEY / SECRET_KEY / S3_BUCKET_NAME / S3_ENDPOINT_URL in your shell BEFORE launching the agent, then reference cloud URIs directly (s3://my-bucket/data/train) in train prompts. Do not save the keys to a file.

tao <net> create-workspace-azure

(removed) Same story — export AZURE_STORAGE_ACCOUNT / AZURE_STORAGE_KEY (or use Azure CLI auth) in your shell before launching the agent, then reference azure://... URIs.

tao <net> list-workspaces

(removed) No server-side state to list.

tao <net> get-workspace-metadata

(removed) No server-side state.

tao <net> update-workspace

(removed) No server-side state to update.

tao <net> delete-workspace

(removed) No server-side state to delete.

tao <net> backup-workspace / restore-workspace

(removed) There is no workspace DB. Back up your local artifact dir and cloud bucket the usual way.

tao <net> create-dataset

(removed) Datasets are no longer registered server-side. Place data at s3://bucket/path (or azure://, file://, lustre://) and pass the URI directly into train prompts.

tao <net> list-datasets

(removed) Use aws s3 ls s3://bucket/ (or the agent: “list the contents of s3://my-bucket/data/”).

tao <net> get-dataset-metadata

(removed) Use the cloud CLI — aws s3 ls --recursive s3://... — or ask the agent to inspect the bucket.

tao <net> update-dataset / delete-dataset

(removed) Same: cloud-side ops via aws s3 / az storage / gsutil.

7. Inspecting Models, Schemas, and AutoML Defaults#

The CLI hit a REST endpoint to learn what each network supports. The agent reads the same information directly from the model skill’s SKILL.md and references/skill_info.yaml — no network call.

CLI command

Agent prompt

Skills used

tao <net> list-base-experiments --filter-param network_arch=$N

“Which pretrained checkpoints (PTMs) does $N support?” Agent reads models/<net>/SKILL.md (the “Base experiments” / “PTM map” section) and references/skill_info.yaml#pretrained_models.

models/<net>/

tao <net> get-job-schema --action train --base-experiment-id $PTM

“Show me $N’s default training spec.” Reads references/spec_template_train.yaml (or the matching template for the requested action). Suggests edits in chat.

models/<net>/

tao <net> get-automl-defaults --base-experiment-id $PTM --action train

“What AutoML hyperparameters does $N expose by default?” Combines the model SKILL’s “AutoML / HPO Notes” with the param generator described in applications/tao-run-automl/SKILL.md.

applications/tao-run-automl, models/<net>/

tao <net> get-automl-param-details --parameters epochs,batch_size

“Explain the AutoML search ranges for epochs and batch_size on $N.”

applications/tao-run-automl, models/<net>/

tao <net> get-gpu-types

“What GPUs are visible to my platform?” For local-docker: nvidia-smi. For Kubernetes: queries the cluster for node GPU shapes. Different per backend.

platform/<chosen>/

8. Training and the Experiment Action Chain#

tao create-job --kind experiment was the single biggest CLI surface. The agent replaces it with applications/tao-train-single-step for one-off runs and applications/tao-run-automl for sweeps, both composing the model and platform skills. Action chaining via --parent-job-id becomes “now do the next step on the artifacts you just produced.”

CLI command

Agent prompt

Skills used

tao <net> create-job --kind experiment --action train --workspace-id $WS --base-experiment-id $PTM --train-dataset $DS_TR --eval-dataset $DS_EV --specs @train.yaml

“Train $N on s3://bucket/train, eval against s3://bucket/val, starting from the default PTM, overrides: epochs=10, batch_size=4, num_classes=5.” Agent fetches the schema from the model SKILL, applies your overrides, runs the container, streams logs to your session.

applications/tao-train-single-stepmodels/<net>/ + platform/<backend>/

tao <net> create-job --kind experiment --action evaluate --parent-job-id $JOB_TR

“Now evaluate the checkpoint we just trained against the val set.” The agent finds the most recent train artifact dir in this session and points evaluate.checkpoint at it.

models/<net>/ + platform/<backend>/

tao <net> create-job --kind experiment --action prune --parent-job-id $JOB_TR

“Prune the trained $N model to 50% of channels.” Same pattern: parent-checkpoint resolved from session context.

models/<net>/ + platform/<backend>/

tao <net> create-job --kind experiment --action retrain --parent-job-id $JOB_PR

“Retrain the pruned model.”

applications/tao-train-single-stepmodels/<net>/

tao <net> create-job --kind experiment --action distill --parent-job-id $JOB_TR

“Distill the trained model into a smaller backbone.”

models/<net>/ + platform/<backend>/

tao <net> create-job --kind experiment --action quantize --parent-job-id $JOB_TR

“Quantize the trained model.”

models/<net>/ + platform/<backend>/

tao <net> create-job --kind experiment --action export --parent-job-id $JOB_TR

“Export the trained model to ONNX.”

models/<net>/ + platform/<backend>/

tao <net> create-job --kind experiment --action gen_trt_engine --parent-job-id $JOB_EXP

“Build a TensorRT engine from the ONNX we just exported.”

models/<net>/ + platform/<backend>/

tao <net> create-job --kind experiment --action inference --parent-job-id $JOB_TRT

“Run TensorRT inference on the test set.”

models/<net>/ + platform/<backend>/

tao <net> create-job --kind experiment --action auto_label --parent-job-id $JOB_TR

“Run MAL auto-labeling on the unlabeled dataset using the trained model.”

models/tao-train-mask-auto-label + platform/<backend>/

9. AutoML#

tao create-job --automl-settings @automl.json routed through the FTMS server. The agent now drives the same AutoMLRunner directly through applications/tao-run-automl.

CLI command

Agent prompt

Skills used

tao <net> get-automl-defaults --base-experiment-id $PTM --action train --output @automl.json

“Show me the AutoML defaults for $N and save them.”

applications/tao-run-automl, models/<net>/

tao <net> create-job --kind experiment --action train --automl-settings @automl.json ...

“Run AutoML on $N with the bayesian algorithm for 20 trials, optimizing val_mAP50.” Algorithm options: bayesian, hyperband, asha, bohb, dehb, pbt, hyperband_es, llm, hybrid, autoresearch. Ask the agent which one fits your budget.

applications/tao-run-automlmodels/<net>/ + platform/<backend>/

(implicit in CLI — was hidden behind --automl-settings)

“Use the LLM-guided AutoML algorithm with NVIDIA NIM as the brain.” LLM algorithms read llm_endpoint, llm_model, and llm_api_key from your prompt or from the shell env (AUTOML_LLM_ENDPOINT, AUTOML_LLM_MODEL, AUTOML_LLM_API_KEY / NVIDIA_API_KEY). Export them before launch — never commit them to a file.

applications/tao-run-automl

(implicit — set via WANDB_API_KEY env)

“Track this AutoML sweep in Weights & Biases under project ‘tao-hpo’.” export WANDB_API_KEY=... in the shell that launches the agent, then mention ‘track in W&B project tao-hpo’.

applications/tao-run-automl

10. Dataset Preparation Jobs#

tao create-job --kind dataset (and the data_services notebook commands) handled format conversion, image validation, augmentation, captioning, and similar pre-training prep. These split across two places now: the model skill’s dataset_convert action (when it’s network-specific) and the data/* skill family.

CLI command

Agent prompt

Skills used

tao <net> create-job --kind dataset --action dataset_convert ...

“Convert my raw KITTI data into TFRecords for $N.” The model SKILL’s skill_info.yaml exposes dataset_convert for the networks that need it (DetectNet, FasterRCNN, etc.).

models/<net>/ + platform/<backend>/

tao <net> create-job --kind dataset --action augment ...

“Augment my training images (rotate, brightness, blur).” Use the data-services-style flow if no model-specific augment action exists.

data/* or models/<net>/

tao <net> create-job --kind dataset --action validate_images ...

“Validate my training images and remove the corrupted ones.”

data/* (image validation)

tao <net> create-job --kind dataset --action annotation_format_convert ...

“Convert my KITTI annotations to COCO.”

data/* or scripted via the agent

tao <net> create-job --kind dataset --action analyze ...

“Analyze the class distribution and image stats of my dataset.”

agent-scripted (no published skill)

tao <net> create-job --kind dataset --action auto_label ...

“Auto-label this unlabeled image folder using MAL.”

models/tao-train-mask-auto-label + platform/<backend>/

(no direct CLI verb)

“Mine the nearest neighbors of these query images in my unlabeled pool.”

data/tao-mine-aoi-images (DEFT embed-then-mine workflow)

11. Monitoring Runs and Downloading Artifacts#

Without an API there is no job DB to query. Instead, the agent inspects whatever the platform skill manages: containers on the local Docker daemon, jobs in your SLURM queue, or pods on your Kubernetes cluster.

CLI command

Agent prompt

Skills used

tao <net> list-jobs --filter-param status=Running

“What TAO jobs are currently running?” Local-docker: docker ps. SLURM: squeue. Kubernetes: kubectl get jobs.

platform/<chosen>/

tao <net> get-job-status --job-id $JOB

“Is my training run still going?” The agent recognises the run by name or by the latest container if you don’t name it.

platform/<chosen>/

tao <net> get-job-metadata --job-id $JOB

“Show me the full details of the training run.” Combines docker inspect (or platform equivalent) with the local artifact directory.

platform/<chosen>/

tao <net> get-job-logs --job-id $JOB

“Tail the logs of my training run.” Local: docker logs -f. SLURM: tail -f slurm-<id>.out. Kubernetes: kubectl logs -f.

platform/<chosen>/

tao <net> cancel-job --job-id $JOB

“Cancel the training run.” Local: docker stop. SLURM: scancel. Kubernetes: kubectl delete job.

platform/<chosen>/

tao <net> pause-job / resume-job

“Pause my training run … now resume it.” Practical only on local-docker (docker pause / unpause). Remote backends don’t generally pause GPU jobs.

platform/tao-run-on-local-docker

tao <net> list-job-files --job-id $JOB

“What files did the training run produce?” Agent lists the run’s artifact directory (usually ./runs/<name>/ for local-docker or the cloud results path for remote).

(filesystem)

tao <net> download-entire-job --job-id $JOB --workdir ./out

“Download all artifacts from my training run to ./out.” Local-docker: just copy the artifact dir. Remote: agent calls the platform skill’s sync verb (aws s3 sync, az storage blob download, …).

(cloud CLI or filesystem)

tao <net> download-job-files --job-id $JOB --workdir ./out --best-model true

“Download only the best checkpoint and the spec from my training run.”

(cloud CLI or filesystem)

tao <net> update-job

(removed) No central job DB to update. Use file-system tags or your own bookkeeping.

tao <net> delete-job

“Delete the training run.” No central job DB; the agent invokes the platform skill’s native cleanup. Local Docker: docker stop + docker rm (containers run with --rm auto-remove on exit). Kubernetes: sdk.cancel_job runs delete_namespaced_job with foreground propagation; finished jobs also auto-delete via ttl_seconds_after_finished. SLURM: scancel cancels in-flight jobs; finished jobs remain in sacct history. Brev: brev delete removes the instance. Artifacts on the cloud bucket are deleted with the matching cloud CLI (aws s3 rm, az storage blob delete, gsutil rm); local artifacts with rm -rf.

platform/<chosen>/

12. Inference Serving (Microservices)#

The FTMS inference-microservice surface is replaced by the model skill’s inference action running in serving mode on the chosen platform.

CLI command

Agent prompt

Skills used

tao <net> start-inference-microservice --network-arch $N --docker-image $IMG --num-gpus 1 --parent-job-id $JOB

“Serve the $N model from the last training run on port 8080.” Agent reads the model SKILL’s inference action, mounts the trained checkpoint, runs the container in detached mode with a port-forward.

models/<net>/ + platform/<backend>/

tao <net> inference-request --microservice-job-id $IMS --input 'a cat'

“Send this prompt to the $N inference server: ‘a cat in a hat’.” Agent constructs the LLM / VLM / diffusion request body that matches the container’s sidecar API and calls it via curl.

(direct HTTP to the container)

tao <net> get-inference-microservice-status --microservice-job-id $IMS

“Is the $N inference server still up?” Local: docker inspect. Remote: platform status verb.

platform/<chosen>/

tao <net> stop-inference-microservice --microservice-job-id $IMS

“Stop the $N inference server.” Local: docker stop. Remote: platform stop verb.

platform/<chosen>/

13. Publishing Models#

Publishing was the FTMS endpoint that pushed a trained checkpoint to NGC. With no API in the loop you push via the NGC CLI or docker push directly. The agent can help construct the commands but does not call them itself.

CLI command

Agent prompt

Skills used

tao <net> publish-model --job-id $JOB --display-name 'DINO v1' --description '...' --team-name myteam

“Help me publish my trained $N model to NGC team ‘myteam’ as ‘DINO v1’.” Agent emits the ngc registry model upload-version (or equivalent) command and you run it.

(ngc CLI / docker push)

tao <net> remove-published-model --job-id $JOB --team-name myteam

“Help me unpublish $JOB from team ‘myteam’.”

(ngc CLI)

14. End-to-End Example#

Compare a typical object-detection pipeline (train → eval → export → TRT → inference) in both forms.

Before (CLI + REST API)#

tao login --ngc-key $NGC_KEY --ngc-org-name $NGC_ORG

WS=$(tao dino create-workspace-aws --name WS --cloud-region us-west-1 \
  --cloud-bucket-name mybucket --access-key $AK --secret-key $SK \
  --output json | jq -r .id)

TR=$(tao dino create-dataset --dataset-type object_detection \
  --dataset-format coco --workspace-id $WS --cloud-file-path /data/train \
  --use-for training --output json | jq -r .id)

EV=$(tao dino create-dataset --dataset-type object_detection \
  --dataset-format coco --workspace-id $WS --cloud-file-path /data/val \
  --use-for evaluation --output json | jq -r .id)

# wait for pull_complete on both...

PTM=$(tao dino list-base-experiments --filter-param network_arch=dino \
  --output json | jq -r '.[0].id')

tao dino get-job-schema --action train --base-experiment-id $PTM \
  --output @train.yaml
# hand-edit train.yaml...

JOB_TR=$(tao dino create-job --kind experiment --action train \
  --encryption-key tlt_encode --workspace-id $WS \
  --base-experiment-id $PTM --train-dataset $TR --eval-dataset $EV \
  --specs @train.yaml --output json | jq -r .id)

# poll get-job-status...

tao dino create-job --kind experiment --action evaluate \
  --parent-job-id $JOB_TR --eval-dataset $EV --specs @eval.yaml

tao dino create-job --kind experiment --action export \
  --parent-job-id $JOB_TR --specs @export.yaml

tao dino create-job --kind experiment --action gen_trt_engine \
  --parent-job-id $JOB_EXP --specs @trt.yaml

tao dino create-job --kind experiment --action inference \
  --parent-job-id $JOB_TRT --specs @infer.yaml

After (single agent prompt)#

“Fine-tune DINO on the COCO data at s3://mybucket/data/train , evaluating against s3://mybucket/data/val . Use the default PTM. Override epochs to 10, batch_size to 4, num_classes to 5. When training finishes, evaluate, export to ONNX, build a TensorRT engine, and run TRT inference on the test images at s3://mybucket/data/test . Run everything on the local Docker daemon.”

The agent will: confirm the GPU host is ready (docker run --runtime=nvidia --gpus all ubuntu nvidia-smi), read models/tao-train-dino/SKILL.md and applications/tao-train-single-step/SKILL.md, build the train spec from the model’s spec template overlaid with your three values, run the DINO container with --gpus all and the S3 creds plumbed in, stream the logs, then chain evaluate → export → gen_trt_engine → inference using the previous step’s artifact path each time. Artifacts land in ./runs/dino-<timestamp>/.

15. CLI Commands with No Equivalent#

These CLI verbs managed FTMS server state that no longer exists. None of them have a direct prompt equivalent — instead, the underlying need is met by the cloud (aws/azure/gcp CLIs), the local filesystem, the Docker daemon, or simply by not needing the abstraction any more.

CLI verb

Why it goes away / what to do instead

tao <net> create-workspace-{aws,azure,huggingface}

No workspace registry. Export cloud credentials (ACCESS_KEY/SECRET_KEY/AZURE_*/HF_TOKEN) as environment variables in your shell before launching the agent; reference s3://, azure://, hf:// URIs directly. Never write the credentials to a file.

tao <net> list-workspaces / get-workspace-metadata / update-workspace / delete-workspace

No workspace registry. Cloud-side operations via aws/az/gsutil.

tao <net> backup-workspace / restore-workspace

No workspace DB to back up. Snapshot your local artifact dir and cloud bucket the usual way.

tao <net> create-dataset / list-datasets / get-dataset-metadata / update-dataset / delete-dataset

No dataset registry. Cloud paths replace dataset IDs; cloud CLIs replace listing/inspection.

tao <net> update-job

No job DB. Tag artifacts on the filesystem or in your own notes.

tao <net> delete-job

No central job DB. The agent invokes the platform skill’s native cleanup: Local Docker docker stop / docker rm (or rely on --rm); Kubernetes sdk.cancel_job (deletes the Job and pods, plus ttl_seconds_after_finished auto-cleanup); SLURM scancel (jobs remain in sacct history); Brev brev delete for the instance. Artifacts on cloud buckets are removed via the matching cloud CLI; local artifacts via rm -rf.

16. Gotchas During Migration#

  • Identifiers are paths, not UUIDs. Where you used to script JOB_ID=$(...) and pass it through --parent-job-id, the agent works with artifact directories and container names. Name your runs (“train DINO; tag the run dino-smoke-01”) and the agent will use that name as the artifact-dir suffix and the container name.

  • State lives where the work runs. Local-docker: ./runs/<name>/. SLURM: under $SLURM_SUBMIT_DIR. Kubernetes: on the cloud bucket the job wrote to. Don’t expect a single dashboard — ask the agent to list runs on a given platform.

  • Concurrent training is up to you. The FTMS server serialised jobs through its queue. Now nothing prevents you from kicking off two trainings at once on the same GPU — watch your memory.

  • AutoML still wants TAO_SKILL_BANK_PATH. applications/tao-run-automl reads model skills from the bank; if TAO_SKILL_BANK_PATH isn’t set, it errors with “No skill config found.” The session-start hook sets it automatically; if you script your own runner, export it yourself.

  • AutoML LLM endpoints don’t auto-default. For llm, hybrid, and autoresearch algorithms, the agent will prompt you for llm_endpoint, llm_model, and llm_api_key before launching. Export them as environment variables (AUTOML_LLM_ENDPOINT, AUTOML_LLM_MODEL, AUTOML_LLM_API_KEY) before launching the agent if you want to skip the prompt — do not save them to a file.

  • Remote platforms still need their preflight. SLURM needs an SSH-reachable head node and your SSH agent loaded. Kubernetes needs a kubeconfig context with the GPU Operator installed. Brev needs BREV_API_TOKEN exported in your shell. The agent checks env-var presence before launching and will tell you exactly what is missing.

  • Secrets in env vars, never in files. All TAO secrets — NGC_KEY, ACCESS_KEY, SECRET_KEY, HF_TOKEN, WANDB_API_KEY, AUTOML_LLM_API_KEY, etc. — must be exported in your shell before you launch the agent. Files on disk are leaky: they end up in shell history, in backups, in Spotlight / Windows Search indexes, and occasionally in accidental git add . commits. If you need a way to materialise a secret into the shell at launch time, use your password manager’s CLI (1Password op, Bitwarden bw, Keychain security find-generic-password, AWS SSM / Secrets Manager, HashiCorp Vault). Let the value live only in the shell process — when the shell exits, the secret is gone.

  • Logs are platform-native. tao get-job-logs used to return a single text blob via REST. Now the agent runs docker logs / tail -f slurm-<id>.out / kubectl logs against your local shell — the output stays where it natively lives.

  • Spec files still matter. You can either describe overrides in prose (“epochs=10, batch_size=4”) or point the agent at an existing YAML (“use the spec at ./train.yaml”). Both are merged onto the model SKILL’s default template before running.

17. Reference#

  • Skill bank root: ~/tao-skills-bank/ (cloned by the tao-skills plugin).

  • Per-network skills: models/<network>/SKILL.md + references/skill_info.yaml + references/spec_template_<action>.yaml.

  • Standard fine-tune workflow: applications/tao-train-single-step/SKILL.md.

  • Hyperparameter optimization: applications/tao-run-automl/SKILL.md.

  • Local Docker conventions: platform/tao-run-on-local-docker/SKILL.md.

  • Remote platforms: platform/tao-run-on-slurm/, platform/tao-run-on-kubernetes/, platform/tao-run-on-brev/.

  • Data preparation skills: data/* (tao-mine-aoi-images, tao-analyze-gaps-visual-changenet, tao-route-visual-changenet-samples, tao-generate-image-grounding, tao-generate-referring-expressions, tao-generate-video-reasoning-annotations).

  • Credentials: export NGC_KEY, NGC_ORG, ACCESS_KEY, SECRET_KEY, S3_BUCKET_NAME, S3_ENDPOINT_URL, HF_TOKEN, WANDB_API_KEY, AUTOML_LLM_API_KEY, etc. as environment variables in the shell that launches the agent — never write them to a file on disk. The .env.example shipped with the skill bank documents variable NAMES only; treat it as a checklist, not a template to copy.

  • Image and SDK version pins: ~/tao-skills-bank/versions.yaml.