Running TAO via Fine-Tuning Micro-Services (FTMS)#
Fine-Tuning Micro-Services (FTMS) provides a comprehensive API-driven interface for training, optimizing, and deploying deep learning models. The service features the new TAO API v2 with a unified job-centric architecture, along with an integrated Python SDK and CLI for seamless access to all TAO functionality.
Navigate to any section below to get started:
Fine-Tuning Micro-Services Overview#
Get started with a high-level introduction to Fine-Tuning Micro-Services (FTMS), including the new TAO API v2 architecture, unified jobs interface, and enhanced authentication. Learn about the three access methods: REST API, Python SDK, and CLI.
Microservices Prerequisites and Setup#
Follow step-by-step guidance to configure prerequisites and prepare your environment for deployment across various platforms.
Kubernetes Deployment#
Deploy the microservice using Kubernetes with detailed explanations of configurable values in the Helm chart.
Docker Compose Deployment#
Deploy the microservice using Docker Compose with configurable settings for simplified deployment.
Air-gapped Environment Deployment#
Deploy TAO Toolkit in secure, isolated environments without internet connectivity using pre-downloaded models and SeaweedFS storage.
Python SDK and CLI#
Interact with the TAO API v2 using two integrated tools:
- Python SDK
The
nvidia-taopackage provides programmatic access to all TAO operations with aTaoClientclass. Features include:Environment variable authentication
Unified job creation and management
Workspace and dataset operations
Inference microservice control
Comprehensive error handling
- Command-Line Interface (CLI)
The
taocommand provides terminal access to all TAO functionality organized by network architecture. Features include:36+ supported network architectures (classification_pyt, rtdetr, mask2former, etc.)
Consistent command structure across all networks
Interactive authentication with
tao loginJob management, workspace operations, and dataset handling
Inference microservice deployment
Installation
pip install nvidia-tao
REST API Overview and Examples#
Access comprehensive documentation of the TAO API v2 endpoints, including request/response formats and complete workflow examples. The v2 API features:
Unified Jobs Endpoint - Single endpoint for all experiment and dataset operations
Environment Variable Authentication - JWT token-based auth with CI/CD integration
Resource-Specific Metadata - Dedicated endpoints for workspaces, datasets, and jobs
Enhanced Job Control - Pause, resume, cancel, and delete operations
Inference Microservice#
Deploy trained models as persistent inference servers for fast, repeated inference without model reloading overhead. Features include:
Start/stop microservices via API, SDK, or CLI
Support for multiple inference modes (base64, cloud media paths, VLM with prompts)
Scalable GPU allocation
Real-time status monitoring
AutoML#
Discover the supported AutoML algorithms (Bayesian, Hyperband), their configuration, and how to use them for automated hyperparameter optimization. AutoML is fully integrated with the v2 API and can be enabled during job creation.
API Reference#
Access comprehensive OpenAPI specifications for TAO API v2 and previous versions. The v2 API documentation includes:
Interactive Swagger UI
ReDoc documentation
OpenAPI specs (JSON/YAML)
Complete endpoint reference
Request/response schemas
Example notebooks
After deployment, access the v2 API documentation at:
Swagger UI:
/api/v2/swaggerReDoc:
/api/v2/redocOpenAPI Specs:
/api/v2/openapi.json
Quick Start Guide#
1. Install the SDK/CLI
pip install nvidia-tao
2. Authenticate
# Using CLI
tao login --ngc-key YOUR_NGC_KEY --ngc-org-name YOUR_ORG
# Or using Python SDK
from tao_sdk.client import TaoClient
client = TaoClient()
client.login(ngc_key="YOUR_NGC_KEY", ngc_org_name="YOUR_ORG")
3. Create Your First Job
# Using CLI
tao classification_pyt create-job \
--kind experiment \
--name "my_first_job" \
--encryption-key "my_key" \
--workspace "workspace_id" \
--action train \
--specs '{"epochs": 100, "learning_rate": 0.001}' \
--train-datasets '["dataset_id"]'
# Using Python SDK
job = client.create_job(
kind="experiment",
name="my_first_job",
network_arch="classification_pyt",
encryption_key="my_key",
workspace="workspace_id",
action="train",
specs={"epochs": 100, "learning_rate": 0.001},
train_datasets=["dataset_id"]
)
4. Monitor Progress
# CLI
tao classification_pyt get-job-status --job-id "job_id"
# Python SDK
status = client.get_job_status("job_id")
Migration from v1 to v2#
If you’re using TAO API v1, the v2 API offers significant improvements:
Key Changes
Unified
/jobsendpoint replaces separate/experimentsand dataset action endpointsSingle-step job creation instead of two-step process
Environment variable authentication instead of file-based configuration
Resource-specific metadata endpoints
Enhanced job control operations
Migration Benefits
Simpler API structure with fewer endpoints
Better authentication for CI/CD pipelines
Comprehensive job management
Inference microservice support
Improved error handling
See the individual documentation sections for detailed migration guidance and examples.