Inference Microservice#
The TAO Inference Microservice provides a persistent, long-running inference server for deploying trained models. Unlike an atomic batched inference job that loads the model, runs inference, and terminates the execution context, inference microservices keep the model loaded in memory, enabling fast, repeated inference requests without the overhead of model reloading.
Overview#
An inference microservice is a containerized model server that loads your trained model once, keeps it in memory, and runs as a persistent Kubernetes StatefulSet. The microservice accepts multiple inference requests without reloading the model, provides fast response times for real-time inference, and offers graceful status handling during initialization and model loading.
How an Inference Microservice Works#
The inference microservice operates as a long-running server that:
Loads your trained model once and keeps it in memory
Runs as a persistent Kubernetes StatefulSet
Accepts multiple inference requests without reloading the model
Provides fast response times for real-time inference
Handles initialization and model loading gracefully with clear status messages
When to Use Inference Microservices#
Use inference microservices when you need to:
Run multiple inference requests on the same model
Minimize latency by avoiding repeated model loading
Test different prompts or inputs interactively
Build applications that require real-time model predictions
Deploy models for production inference workloads
When to Use One-Time Inference Jobs#
Use one-time inference jobs when you need to:
Run inference once on a fixed batch of data
Evaluate model performance on a test set
Process a large dataset and save all results to cloud storage
Save computational resources after inference completes
Architecture#
The inference microservice architecture consists of three main components: the TAO API Gateway that routes requests, a Kubernetes StatefulSet that manages the container lifecycle, and the inference microservice container that hosts the Flask server and loaded model. The following diagram illustrates how these components interact:
Key Features#
The inference microservice provides the following key features:
Persistent model loading: Loads models once and keeps them in memory for fast inference
Graceful status handling: Server starts immediately and provides clear status messages during initialization and model loading
Cloud storage integration: Automatically downloads media files from cloud storage (AWS, Azure) for inference
Flexible input formats: Each model defines its own input parameter format (media, images, text, prompts, etc.)
Health monitoring: Exposes health check and status endpoints for monitoring service readiness
Automatic resource management: StatefulSet deployment ensures proper resource allocation and cleanup
Getting Started#
Prerequisites#
Before using inference microservices:
Completed TAO API setup: Follow the instructions in TAO API Setup guide.
Trained model: Complete a training job with a trained model checkpoint.
Cloud workspace: Configure cloud storage for model artifacts (see Workspaces).
Authentication: Ensure that you have valid NGC credentials and API token.
Basic Workflow#
The typical inference microservice workflow consists of four steps:
Start the inference microservice (creates StatefulSet, loads model).
Run inference requests (send inputs, receive predictions).
Check status (monitor health and readiness).
Stop the service (cleanup resources).
# 1. Start the inference microservice
curl -X POST $BASE_URL/orgs/$NGC_ORG/experiments/$EXPERIMENT_ID/inference_microservice/start \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"parent_id": "'"$TRAIN_JOB_ID"'"}'
# 2. Run inference (once model is ready)
curl -X POST $BASE_URL/orgs/$NGC_ORG/experiments/$EXPERIMENT_ID/jobs/$JOB_ID/inference_microservice/inference \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"prompt": "What is in this image?", "media": "s3://bucket/image.jpg"}'
# 3. Check status
curl -X GET $BASE_URL/orgs/$NGC_ORG/experiments/$EXPERIMENT_ID/jobs/$JOB_ID/inference_microservice/status \
-H "Authorization: Bearer $TOKEN"
# 4. Stop the service
curl -X POST $BASE_URL/orgs/$NGC_ORG/experiments/$EXPERIMENT_ID/jobs/$JOB_ID/inference_microservice/stop \
-H "Authorization: Bearer $TOKEN"
API Endpoints#
The inference microservice provides four main endpoints:
1. Start Inference Microservice#
Endpoint: POST /api/v1/orgs/{org}/experiments/{experiment_id}/inference_microservice/start
Purpose: Creates a new persistent inference server in a Kubernetes StatefulSet.
Request body:
{
"parent_id": "train_job_id"
}
Parameters:
parent_id(required): The job ID of the completed training job whose model you want to deploy
Response:
{
"job_id": "uuid-for-inference-microservice",
"status": "pending",
"message": "Inference microservice starting"
}
Example:
INFERENCE_MS_JOB_ID=$(curl -s -X POST \
$BASE_URL/orgs/$NGC_ORG/experiments/$EXPERIMENT_ID/inference_microservice/start \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"parent_id": "'"$TRAIN_JOB_ID"'"
}' | jq -r '.job_id')
echo "Inference Microservice Job ID: $INFERENCE_MS_JOB_ID"
Examples:
See the example above for starting the inference microservice.
What happens:
Kubernetes creates a StatefulSet for the inference microservice.
The container starts and launches the Flask server.
The microservice downloads model files from cloud storage in the background.
The microservice loads the model into memory in the background.
The server becomes ready to accept inference requests.
Note
The server starts immediately but may not be ready for inference requests. Use the status endpoint to check when the model is loaded and ready.
2. Run Inference#
Endpoint: POST /api/v1/orgs/{org}/experiments/{experiment_id}/jobs/{job_id}/inference_microservice/inference
Purpose: Send inference requests to the running model server.
Request body:
The request body format is model-specific. Each model defines its own parameter names and formats.
Parameters:
The parameters vary by model. Common parameter patterns include:
media,images,files: Input files (images or videos)text,prompt,question: Text inputconv_mode,mode: Conversation or inference modetemperature,top_p,max_tokens: Generation parametersconfidence_threshold,batch_size: Model-specific settings
Examples:
Vision-language models (VILA, Cosmos-RL):
{
"media": ["s3://bucket/video.mp4", "s3://bucket/image.jpg"],
"text": "Describe what you see in these media files",
"conv_mode": "auto"
}
Image classification models:
{
"images": ["s3://bucket/image1.jpg", "s3://bucket/image2.jpg"],
"batch_size": 1
}
Object detection models:
{
"images": ["s3://bucket/image.jpg"],
"confidence_threshold": 0.5
}
Response:
Success (HTTP 200):
{
"status": "completed",
"results": {
"response": "Model prediction or generated text",
"confidence": 0.95,
"additional_metadata": "..."
},
"job_id": "inference-microservice-job-id",
"message": "Inference completed successfully"
}
Model loading (HTTP 202):
{
"job_id": "inference-microservice-job-id",
"status": "loading",
"message": "Model is currently loading, please wait and try again"
}
Server initializing (HTTP 202):
{
"job_id": "inference-microservice-job-id",
"status": "initializing",
"message": "Server is initializing (downloading files, setting up), please wait"
}
Examples:
VLM inference example (Cosmos-RL):
curl -X POST \
$BASE_URL/orgs/$NGC_ORG/experiments/$EXPERIMENT_ID/jobs/$INFERENCE_MS_JOB_ID/inference_microservice/inference \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"media": "s3://my-bucket/videos/sample.mp4",
"text": "What action is happening in this video?",
"conv_mode": "auto"
}' | jq '.'
Note
The microservice automatically downloads media files (images and videos) from your cloud storage bucket. Ensure files are accessible with the workspace credentials.
3. Check Status#
Endpoint: GET /api/v1/orgs/{org}/experiments/{experiment_id}/jobs/{job_id}/inference_microservice/status
Purpose: Monitor service health and model loading status.
Response:
{
"job_id": "inference-microservice-job-id",
"model_loaded": true,
"model_loading": false,
"server_initializing": false,
"initialization_error": null,
"model_load_error": null,
"status": "healthy",
"pod_name": "statefulset-pod-name",
"pod_status": "Running"
}
Status fields:
model_loaded(Boolean): Whether the model is fully loaded and readymodel_loading(Boolean): Whether model loading is in progressserver_initializing(Boolean): Whether server initialization is in progressinitialization_error(string): Error message if initialization failedmodel_load_error(string): Error message if model loading failedstatus(string): Overall service health statuspod_name(string): Kubernetes pod namepod_status(string): Kubernetes pod status
Examples:
Check if model is ready:
curl -s -X GET \
$BASE_URL/orgs/$NGC_ORG/experiments/$EXPERIMENT_ID/jobs/$INFERENCE_MS_JOB_ID/inference_microservice/status \
-H "Authorization: Bearer $TOKEN" | jq '.'
Poll for readiness:
# Wait until model is loaded
while true; do
STATUS=$(curl -s -X GET \
$BASE_URL/orgs/$NGC_ORG/experiments/$EXPERIMENT_ID/jobs/$INFERENCE_MS_JOB_ID/inference_microservice/status \
-H "Authorization: Bearer $TOKEN")
MODEL_LOADED=$(echo $STATUS | jq -r '.model_loaded')
if [ "$MODEL_LOADED" == "true" ]; then
echo "Model is ready!"
break
fi
echo "Model loading... waiting 10 seconds"
sleep 10
done
4. Stop Service#
Endpoint: POST /api/v1/orgs/{org}/experiments/{experiment_id}/jobs/{job_id}/inference_microservice/stop
Purpose: Gracefully shut down the inference microservice and clean up resources.
Response:
{
"status": "stopped",
"message": "Inference microservice stopped successfully",
"job_id": "inference-microservice-job-id"
}
Examples:
Stop the inference microservice:
curl -X POST \
$BASE_URL/orgs/$NGC_ORG/experiments/$EXPERIMENT_ID/jobs/$INFERENCE_MS_JOB_ID/inference_microservice/stop \
-H "Authorization: Bearer $TOKEN"
What happens:
Kubernetes deletes the StatefulSet.
The container terminates gracefully.
The microservice unloads the model from memory.
Kubernetes cleans up the resources.
Warning
Once stopped, you must create a new inference microservice to run more inference requests. The same job ID cannot be reused.
Status Codes and Error Handling#
Understanding HTTP Status Codes#
The inference microservice uses standard HTTP status codes to communicate the service state:
200 OK: Request successful, model is ready and inference completed
202 Accepted: Request received, but service is still initializing or model is loading; retry after a short wait
503 Service Unavailable: Service is not ready (initialization/loading failed) or encountered an error
404 Not Found: Job ID not found or service not started
Server Lifecycle States#
The inference microservice goes through several states during its lifecycle:
Server initializing (HTTP 202)
The Flask server has started but is still:
Downloading model files from cloud storage
Setting up cloud storage connections
Preparing the environment
Response:
{ "job_id": "uuid", "status": "initializing", "message": "Server is initializing (downloading files, setting up), please wait" }
Action: Retry after 5-10 seconds.
Initialization failed (HTTP 503)
Server initialization encountered an error.
Response:
{ "job_id": "uuid", "status": "error", "error": "Server initialization failed: network timeout downloading model files" }
Action: Check cloud storage credentials and network connectivity. Stop and restart the service.
Model loading (HTTP 202)
Initialization complete. Model is being loaded into memory.
Response:
{ "job_id": "uuid", "status": "loading", "message": "Model is currently loading, please wait and try again" }
Action: Retry every 10-15 seconds. Model loading can take several minutes for large models.
Model load failed (HTTP 503)
Model loading encountered an error.
Response:
{ "job_id": "uuid", "status": "error", "error": "Model failed to load: CUDA out of memory" }
Action: Check model requirements and available GPU resources. Stop and restart with appropriate resources.
Model not started (HTTP 503)
Model loading has not been initiated.
Response:
{ "job_id": "uuid", "status": "not_ready", "message": "Model not loaded yet, please try again later" }
Action: Wait for model loading to start, then retry.
Model Ready (HTTP 200)
Model is loaded and ready for inference.
Response:
{ "status": "completed", "results": { "response": "Model prediction", "metadata": "..." }, "job_id": "uuid", "message": "Inference completed successfully" }
Action: Continue sending inference requests.
Complete Workflow Example#
This section provides a complete end-to-end example using Cosmos-RL VLM model.
Setup#
# Environment setup
export BASE_URL="http://<your-api-host>:<port>/api/v1"
export NGC_ORG="your-org-name"
export NGC_API_KEY="your-ngc-api-key"
# Authenticate
TOKEN=$(curl -s -X POST $BASE_URL/login \
-H "Content-Type: application/json" \
-d '{
"ngc_key": "'"$NGC_API_KEY"'",
"ngc_org_name": "'"$NGC_ORG"'"
}' | jq -r '.token')
echo "Authenticated with token"
# Set experiment and training job IDs (from previous training)
EXPERIMENT_ID="your-experiment-id"
TRAIN_JOB_ID="your-completed-training-job-id"
Step 1: Start Inference Microservice#
echo "Starting inference microservice..."
INFERENCE_MS_JOB_ID=$(curl -s -X POST \
$BASE_URL/orgs/$NGC_ORG/experiments/$EXPERIMENT_ID/inference_microservice/start \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"parent_id": "'"$TRAIN_JOB_ID"'"
}' | jq -r '.job_id')
echo "Inference Microservice Job ID: $INFERENCE_MS_JOB_ID"
Step 2: Wait for Model to Load#
echo "Waiting for model to load..."
while true; do
STATUS=$(curl -s -X GET \
$BASE_URL/orgs/$NGC_ORG/experiments/$EXPERIMENT_ID/jobs/$INFERENCE_MS_JOB_ID/inference_microservice/status \
-H "Authorization: Bearer $TOKEN")
MODEL_LOADED=$(echo $STATUS | jq -r '.model_loaded')
MODEL_LOADING=$(echo $STATUS | jq -r '.model_loading')
SERVER_INIT=$(echo $STATUS | jq -r '.server_initializing')
INIT_ERROR=$(echo $STATUS | jq -r '.initialization_error')
LOAD_ERROR=$(echo $STATUS | jq -r '.model_load_error')
# Check for errors
if [ "$INIT_ERROR" != "null" ]; then
echo "Initialization error: $INIT_ERROR"
exit 1
fi
if [ "$LOAD_ERROR" != "null" ]; then
echo "Model load error: $LOAD_ERROR"
exit 1
fi
# Check if ready
if [ "$MODEL_LOADED" == "true" ]; then
echo "Model is ready!"
break
fi
# Status message
if [ "$SERVER_INIT" == "true" ]; then
echo "Server initializing (downloading files)... waiting 10s"
elif [ "$MODEL_LOADING" == "true" ]; then
echo "Model loading... waiting 10s"
else
echo "Waiting for service... 10s"
fi
sleep 10
done
Step 3: Run Inference Requests#
echo "Running inference..."
# Single video inference
RESULT=$(curl -s -X POST \
$BASE_URL/orgs/$NGC_ORG/experiments/$EXPERIMENT_ID/jobs/$INFERENCE_MS_JOB_ID/inference_microservice/inference \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"media": "s3://my-bucket/videos/sample.mp4",
"text": "What is happening in this video?",
"conv_mode": "auto"
}')
echo "Inference result:"
echo $RESULT | jq '.'
# Multiple images inference
RESULT2=$(curl -s -X POST \
$BASE_URL/orgs/$NGC_ORG/experiments/$EXPERIMENT_ID/jobs/$INFERENCE_MS_JOB_ID/inference_microservice/inference \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"media": [
"s3://my-bucket/images/img1.jpg",
"s3://my-bucket/images/img2.jpg"
],
"text": "Compare these two images",
"conv_mode": "auto"
}')
echo "Second inference result:"
echo $RESULT2 | jq '.'
Step 4: Stop the Service#
echo "Stopping inference microservice..."
curl -s -X POST \
$BASE_URL/orgs/$NGC_ORG/experiments/$EXPERIMENT_ID/jobs/$INFERENCE_MS_JOB_ID/inference_microservice/stop \
-H "Authorization: Bearer $TOKEN" | jq '.'
echo "Inference microservice stopped"
Troubleshooting#
Common Issues and Solutions#
Service never becomes ready#
Symptoms: Status endpoint shows server_initializing: true for a long time.
Possible causes:
Large model files are taking time to download.
Cloud storage is suffering network issues.
Cloud storage credentials are incorrect.
Solutions:
Check cloud storage credentials in workspace configuration.
Verify network connectivity to cloud storage.
Check Kubernetes pod logs:
kubectl logs <pod-name>.Ensure sufficient bandwidth for model file downloads.
Model loading fails#
Symptoms: Status endpoint shows model_load_error with error message.
Possible causes:
CUDA out of memory: GPU memory is insufficient for the model.
Solution: Use GPUs with more memory, or reduce model size.
Model file not found: Model checkpoint is missing.
Solution: Verify that the training job completed successfully and the checkpoint was saved.
Invalid model format: Model checkpoint is corrupted.
Solution: Retrain the model or check checkpoint integrity.
Inference request timeouts#
Symptoms: Requests return HTTP 202 repeatedly, never succeed.
Possible causes:
Model is still loading (for very large models).
Service crashed after starting.
Resource constraints (CPU or memory) prevent the model from loading or starting.
Solutions:
Increase retry timeout and wait longer.
Check status endpoint for errors.
Check Kubernetes pod status:
kubectl get pods.Review pod logs for crash information.
Inference returns unexpected results#
Symptoms: Model predictions are incorrect or nonsensical.
Possible causes:
Input format is incorrect.
Model is not fully trained.
Parameters are incompatible with this model.
Solutions:
Verify that input format matches model expectations.
Check model training metrics and convergence.
Review model-specific documentation for parameter requirements.
Cannot stop service#
Symptoms: Stop endpoint returns an error, or service continues running.
Solutions:
Check Kubernetes StatefulSet status:
kubectl get statefulsets.Manually delete StatefulSet if needed:
kubectl delete statefulset <name>.Check for pending operations on the job.