Inference Microservice#

The TAO Inference Microservice provides a persistent, long-running inference server for deploying trained models. Unlike an atomic batched inference job that loads the model, runs inference, and terminates the execution context, inference microservices keep the model loaded in memory, enabling fast, repeated inference requests without the overhead of model reloading.

Overview#

An inference microservice is a containerized model server that loads your trained model once, keeps it in memory, and runs as a persistent Kubernetes StatefulSet. The microservice accepts multiple inference requests without reloading the model, provides fast response times for real-time inference, and offers graceful status handling during initialization and model loading.

How an Inference Microservice Works#

The inference microservice operates as a long-running server that:

  • Loads your trained model once and keeps it in memory

  • Runs as a persistent Kubernetes StatefulSet

  • Accepts multiple inference requests without reloading the model

  • Provides fast response times for real-time inference

  • Handles initialization and model loading gracefully with clear status messages

When to Use Inference Microservices#

Use inference microservices when you need to:

  • Run multiple inference requests on the same model

  • Minimize latency by avoiding repeated model loading

  • Test different prompts or inputs interactively

  • Build applications that require real-time model predictions

  • Deploy models for production inference workloads

When to Use One-Time Inference Jobs#

Use one-time inference jobs when you need to:

  • Run inference once on a fixed batch of data

  • Evaluate model performance on a test set

  • Process a large dataset and save all results to cloud storage

  • Save computational resources after inference completes

Architecture#

The inference microservice architecture consists of three main components: the TAO API Gateway that routes requests, a Kubernetes StatefulSet that manages the container lifecycle, and the inference microservice container that hosts the Flask server and loaded model. The following diagram illustrates how these components interact:

Inference Microservice Architecture Diagram

Key Features#

The inference microservice provides the following key features:

  • Persistent model loading: Loads models once and keeps them in memory for fast inference

  • Graceful status handling: Server starts immediately and provides clear status messages during initialization and model loading

  • Cloud storage integration: Automatically downloads media files from cloud storage (AWS, Azure) for inference

  • Flexible input formats: Each model defines its own input parameter format (media, images, text, prompts, etc.)

  • Health monitoring: Exposes health check and status endpoints for monitoring service readiness

  • Automatic resource management: StatefulSet deployment ensures proper resource allocation and cleanup

Getting Started#

Prerequisites#

Before using inference microservices:

  1. Completed TAO API setup: Follow the instructions in TAO API Setup guide.

  2. Trained model: Complete a training job with a trained model checkpoint.

  3. Cloud workspace: Configure cloud storage for model artifacts (see Workspaces).

  4. Authentication: Ensure that you have valid NGC credentials and API token.

Basic Workflow#

The typical inference microservice workflow consists of four steps:

  1. Start the inference microservice (creates StatefulSet, loads model).

  2. Run inference requests (send inputs, receive predictions).

  3. Check status (monitor health and readiness).

  4. Stop the service (cleanup resources).

# 1. Start the inference microservice
curl -X POST $BASE_URL/orgs/$NGC_ORG/experiments/$EXPERIMENT_ID/inference_microservice/start \
    -H "Authorization: Bearer $TOKEN" \
    -H "Content-Type: application/json" \
    -d '{"parent_id": "'"$TRAIN_JOB_ID"'"}'

# 2. Run inference (once model is ready)
curl -X POST $BASE_URL/orgs/$NGC_ORG/experiments/$EXPERIMENT_ID/jobs/$JOB_ID/inference_microservice/inference \
    -H "Authorization: Bearer $TOKEN" \
    -H "Content-Type: application/json" \
    -d '{"prompt": "What is in this image?", "media": "s3://bucket/image.jpg"}'

# 3. Check status
curl -X GET $BASE_URL/orgs/$NGC_ORG/experiments/$EXPERIMENT_ID/jobs/$JOB_ID/inference_microservice/status \
    -H "Authorization: Bearer $TOKEN"

# 4. Stop the service
curl -X POST $BASE_URL/orgs/$NGC_ORG/experiments/$EXPERIMENT_ID/jobs/$JOB_ID/inference_microservice/stop \
    -H "Authorization: Bearer $TOKEN"

API Endpoints#

The inference microservice provides four main endpoints:

1. Start Inference Microservice#

Endpoint: POST /api/v1/orgs/{org}/experiments/{experiment_id}/inference_microservice/start

Purpose: Creates a new persistent inference server in a Kubernetes StatefulSet.

Request body:

{
    "parent_id": "train_job_id"
}

Parameters:

  • parent_id (required): The job ID of the completed training job whose model you want to deploy

Response:

{
    "job_id": "uuid-for-inference-microservice",
    "status": "pending",
    "message": "Inference microservice starting"
}

Example:

INFERENCE_MS_JOB_ID=$(curl -s -X POST \
    $BASE_URL/orgs/$NGC_ORG/experiments/$EXPERIMENT_ID/inference_microservice/start \
    -H "Authorization: Bearer $TOKEN" \
    -H "Content-Type: application/json" \
    -d '{
        "parent_id": "'"$TRAIN_JOB_ID"'"
    }' | jq -r '.job_id')

echo "Inference Microservice Job ID: $INFERENCE_MS_JOB_ID"

Examples:

See the example above for starting the inference microservice.

What happens:

  1. Kubernetes creates a StatefulSet for the inference microservice.

  2. The container starts and launches the Flask server.

  3. The microservice downloads model files from cloud storage in the background.

  4. The microservice loads the model into memory in the background.

  5. The server becomes ready to accept inference requests.

Note

The server starts immediately but may not be ready for inference requests. Use the status endpoint to check when the model is loaded and ready.

2. Run Inference#

Endpoint: POST /api/v1/orgs/{org}/experiments/{experiment_id}/jobs/{job_id}/inference_microservice/inference

Purpose: Send inference requests to the running model server.

Request body:

The request body format is model-specific. Each model defines its own parameter names and formats.

Parameters:

The parameters vary by model. Common parameter patterns include:

  • media, images, files: Input files (images or videos)

  • text, prompt, question: Text input

  • conv_mode, mode: Conversation or inference mode

  • temperature, top_p, max_tokens: Generation parameters

  • confidence_threshold, batch_size: Model-specific settings

Examples:

Vision-language models (VILA, Cosmos-RL):

{
    "media": ["s3://bucket/video.mp4", "s3://bucket/image.jpg"],
    "text": "Describe what you see in these media files",
    "conv_mode": "auto"
}

Image classification models:

{
    "images": ["s3://bucket/image1.jpg", "s3://bucket/image2.jpg"],
    "batch_size": 1
}

Object detection models:

{
    "images": ["s3://bucket/image.jpg"],
    "confidence_threshold": 0.5
}

Response:

Success (HTTP 200):

{
    "status": "completed",
    "results": {
        "response": "Model prediction or generated text",
        "confidence": 0.95,
        "additional_metadata": "..."
    },
    "job_id": "inference-microservice-job-id",
    "message": "Inference completed successfully"
}

Model loading (HTTP 202):

{
    "job_id": "inference-microservice-job-id",
    "status": "loading",
    "message": "Model is currently loading, please wait and try again"
}

Server initializing (HTTP 202):

{
    "job_id": "inference-microservice-job-id",
    "status": "initializing",
    "message": "Server is initializing (downloading files, setting up), please wait"
}

Examples:

VLM inference example (Cosmos-RL):

curl -X POST \
    $BASE_URL/orgs/$NGC_ORG/experiments/$EXPERIMENT_ID/jobs/$INFERENCE_MS_JOB_ID/inference_microservice/inference \
    -H "Authorization: Bearer $TOKEN" \
    -H "Content-Type: application/json" \
    -d '{
        "media": "s3://my-bucket/videos/sample.mp4",
        "text": "What action is happening in this video?",
        "conv_mode": "auto"
    }' | jq '.'

Note

The microservice automatically downloads media files (images and videos) from your cloud storage bucket. Ensure files are accessible with the workspace credentials.

3. Check Status#

Endpoint: GET /api/v1/orgs/{org}/experiments/{experiment_id}/jobs/{job_id}/inference_microservice/status

Purpose: Monitor service health and model loading status.

Response:

{
    "job_id": "inference-microservice-job-id",
    "model_loaded": true,
    "model_loading": false,
    "server_initializing": false,
    "initialization_error": null,
    "model_load_error": null,
    "status": "healthy",
    "pod_name": "statefulset-pod-name",
    "pod_status": "Running"
}

Status fields:

  • model_loaded (Boolean): Whether the model is fully loaded and ready

  • model_loading (Boolean): Whether model loading is in progress

  • server_initializing (Boolean): Whether server initialization is in progress

  • initialization_error (string): Error message if initialization failed

  • model_load_error (string): Error message if model loading failed

  • status (string): Overall service health status

  • pod_name (string): Kubernetes pod name

  • pod_status (string): Kubernetes pod status

Examples:

Check if model is ready:

curl -s -X GET \
    $BASE_URL/orgs/$NGC_ORG/experiments/$EXPERIMENT_ID/jobs/$INFERENCE_MS_JOB_ID/inference_microservice/status \
    -H "Authorization: Bearer $TOKEN" | jq '.'

Poll for readiness:

# Wait until model is loaded
while true; do
    STATUS=$(curl -s -X GET \
        $BASE_URL/orgs/$NGC_ORG/experiments/$EXPERIMENT_ID/jobs/$INFERENCE_MS_JOB_ID/inference_microservice/status \
        -H "Authorization: Bearer $TOKEN")

    MODEL_LOADED=$(echo $STATUS | jq -r '.model_loaded')

    if [ "$MODEL_LOADED" == "true" ]; then
        echo "Model is ready!"
        break
    fi

    echo "Model loading... waiting 10 seconds"
    sleep 10
done

4. Stop Service#

Endpoint: POST /api/v1/orgs/{org}/experiments/{experiment_id}/jobs/{job_id}/inference_microservice/stop

Purpose: Gracefully shut down the inference microservice and clean up resources.

Response:

{
    "status": "stopped",
    "message": "Inference microservice stopped successfully",
    "job_id": "inference-microservice-job-id"
}

Examples:

Stop the inference microservice:

curl -X POST \
    $BASE_URL/orgs/$NGC_ORG/experiments/$EXPERIMENT_ID/jobs/$INFERENCE_MS_JOB_ID/inference_microservice/stop \
    -H "Authorization: Bearer $TOKEN"

What happens:

  1. Kubernetes deletes the StatefulSet.

  2. The container terminates gracefully.

  3. The microservice unloads the model from memory.

  4. Kubernetes cleans up the resources.

Warning

Once stopped, you must create a new inference microservice to run more inference requests. The same job ID cannot be reused.

Status Codes and Error Handling#

Understanding HTTP Status Codes#

The inference microservice uses standard HTTP status codes to communicate the service state:

  • 200 OK: Request successful, model is ready and inference completed

  • 202 Accepted: Request received, but service is still initializing or model is loading; retry after a short wait

  • 503 Service Unavailable: Service is not ready (initialization/loading failed) or encountered an error

  • 404 Not Found: Job ID not found or service not started

Server Lifecycle States#

The inference microservice goes through several states during its lifecycle:

  1. Server initializing (HTTP 202)

    The Flask server has started but is still:

    • Downloading model files from cloud storage

    • Setting up cloud storage connections

    • Preparing the environment

    Response:

    {
        "job_id": "uuid",
        "status": "initializing",
        "message": "Server is initializing (downloading files, setting up), please wait"
    }
    

    Action: Retry after 5-10 seconds.

  2. Initialization failed (HTTP 503)

    Server initialization encountered an error.

    Response:

    {
        "job_id": "uuid",
        "status": "error",
        "error": "Server initialization failed: network timeout downloading model files"
    }
    

    Action: Check cloud storage credentials and network connectivity. Stop and restart the service.

  3. Model loading (HTTP 202)

    Initialization complete. Model is being loaded into memory.

    Response:

    {
        "job_id": "uuid",
        "status": "loading",
        "message": "Model is currently loading, please wait and try again"
    }
    

    Action: Retry every 10-15 seconds. Model loading can take several minutes for large models.

  4. Model load failed (HTTP 503)

    Model loading encountered an error.

    Response:

    {
        "job_id": "uuid",
        "status": "error",
        "error": "Model failed to load: CUDA out of memory"
    }
    

    Action: Check model requirements and available GPU resources. Stop and restart with appropriate resources.

  5. Model not started (HTTP 503)

    Model loading has not been initiated.

    Response:

    {
        "job_id": "uuid",
        "status": "not_ready",
        "message": "Model not loaded yet, please try again later"
    }
    

    Action: Wait for model loading to start, then retry.

  6. Model Ready (HTTP 200)

    Model is loaded and ready for inference.

    Response:

    {
        "status": "completed",
        "results": {
            "response": "Model prediction",
            "metadata": "..."
        },
        "job_id": "uuid",
        "message": "Inference completed successfully"
    }
    

    Action: Continue sending inference requests.

Complete Workflow Example#

This section provides a complete end-to-end example using Cosmos-RL VLM model.

Setup#

# Environment setup
export BASE_URL="http://<your-api-host>:<port>/api/v1"
export NGC_ORG="your-org-name"
export NGC_API_KEY="your-ngc-api-key"

# Authenticate
TOKEN=$(curl -s -X POST $BASE_URL/login \
    -H "Content-Type: application/json" \
    -d '{
        "ngc_key": "'"$NGC_API_KEY"'",
        "ngc_org_name": "'"$NGC_ORG"'"
    }' | jq -r '.token')

echo "Authenticated with token"

# Set experiment and training job IDs (from previous training)
EXPERIMENT_ID="your-experiment-id"
TRAIN_JOB_ID="your-completed-training-job-id"

Step 1: Start Inference Microservice#

echo "Starting inference microservice..."
INFERENCE_MS_JOB_ID=$(curl -s -X POST \
    $BASE_URL/orgs/$NGC_ORG/experiments/$EXPERIMENT_ID/inference_microservice/start \
    -H "Authorization: Bearer $TOKEN" \
    -H "Content-Type: application/json" \
    -d '{
        "parent_id": "'"$TRAIN_JOB_ID"'"
    }' | jq -r '.job_id')

echo "Inference Microservice Job ID: $INFERENCE_MS_JOB_ID"

Step 2: Wait for Model to Load#

echo "Waiting for model to load..."
while true; do
    STATUS=$(curl -s -X GET \
        $BASE_URL/orgs/$NGC_ORG/experiments/$EXPERIMENT_ID/jobs/$INFERENCE_MS_JOB_ID/inference_microservice/status \
        -H "Authorization: Bearer $TOKEN")

    MODEL_LOADED=$(echo $STATUS | jq -r '.model_loaded')
    MODEL_LOADING=$(echo $STATUS | jq -r '.model_loading')
    SERVER_INIT=$(echo $STATUS | jq -r '.server_initializing')
    INIT_ERROR=$(echo $STATUS | jq -r '.initialization_error')
    LOAD_ERROR=$(echo $STATUS | jq -r '.model_load_error')

    # Check for errors
    if [ "$INIT_ERROR" != "null" ]; then
        echo "Initialization error: $INIT_ERROR"
        exit 1
    fi

    if [ "$LOAD_ERROR" != "null" ]; then
        echo "Model load error: $LOAD_ERROR"
        exit 1
    fi

    # Check if ready
    if [ "$MODEL_LOADED" == "true" ]; then
        echo "Model is ready!"
        break
    fi

    # Status message
    if [ "$SERVER_INIT" == "true" ]; then
        echo "Server initializing (downloading files)... waiting 10s"
    elif [ "$MODEL_LOADING" == "true" ]; then
        echo "Model loading... waiting 10s"
    else
        echo "Waiting for service... 10s"
    fi

    sleep 10
done

Step 3: Run Inference Requests#

echo "Running inference..."

# Single video inference
RESULT=$(curl -s -X POST \
    $BASE_URL/orgs/$NGC_ORG/experiments/$EXPERIMENT_ID/jobs/$INFERENCE_MS_JOB_ID/inference_microservice/inference \
    -H "Authorization: Bearer $TOKEN" \
    -H "Content-Type: application/json" \
    -d '{
        "media": "s3://my-bucket/videos/sample.mp4",
        "text": "What is happening in this video?",
        "conv_mode": "auto"
    }')

echo "Inference result:"
echo $RESULT | jq '.'

# Multiple images inference
RESULT2=$(curl -s -X POST \
    $BASE_URL/orgs/$NGC_ORG/experiments/$EXPERIMENT_ID/jobs/$INFERENCE_MS_JOB_ID/inference_microservice/inference \
    -H "Authorization: Bearer $TOKEN" \
    -H "Content-Type: application/json" \
    -d '{
        "media": [
            "s3://my-bucket/images/img1.jpg",
            "s3://my-bucket/images/img2.jpg"
        ],
        "text": "Compare these two images",
        "conv_mode": "auto"
    }')

echo "Second inference result:"
echo $RESULT2 | jq '.'

Step 4: Stop the Service#

echo "Stopping inference microservice..."
curl -s -X POST \
    $BASE_URL/orgs/$NGC_ORG/experiments/$EXPERIMENT_ID/jobs/$INFERENCE_MS_JOB_ID/inference_microservice/stop \
    -H "Authorization: Bearer $TOKEN" | jq '.'

echo "Inference microservice stopped"

Troubleshooting#

Common Issues and Solutions#

Service never becomes ready#

Symptoms: Status endpoint shows server_initializing: true for a long time.

Possible causes:

  • Large model files are taking time to download.

  • Cloud storage is suffering network issues.

  • Cloud storage credentials are incorrect.

Solutions:

  1. Check cloud storage credentials in workspace configuration.

  2. Verify network connectivity to cloud storage.

  3. Check Kubernetes pod logs: kubectl logs <pod-name>.

  4. Ensure sufficient bandwidth for model file downloads.

Model loading fails#

Symptoms: Status endpoint shows model_load_error with error message.

Possible causes:

  • CUDA out of memory: GPU memory is insufficient for the model.

    Solution: Use GPUs with more memory, or reduce model size.

  • Model file not found: Model checkpoint is missing.

    Solution: Verify that the training job completed successfully and the checkpoint was saved.

  • Invalid model format: Model checkpoint is corrupted.

    Solution: Retrain the model or check checkpoint integrity.

Inference request timeouts#

Symptoms: Requests return HTTP 202 repeatedly, never succeed.

Possible causes:

  • Model is still loading (for very large models).

  • Service crashed after starting.

  • Resource constraints (CPU or memory) prevent the model from loading or starting.

Solutions:

  1. Increase retry timeout and wait longer.

  2. Check status endpoint for errors.

  3. Check Kubernetes pod status: kubectl get pods.

  4. Review pod logs for crash information.

Inference returns unexpected results#

Symptoms: Model predictions are incorrect or nonsensical.

Possible causes:

  • Input format is incorrect.

  • Model is not fully trained.

  • Parameters are incompatible with this model.

Solutions:

  1. Verify that input format matches model expectations.

  2. Check model training metrics and convergence.

  3. Review model-specific documentation for parameter requirements.

Cannot stop service#

Symptoms: Stop endpoint returns an error, or service continues running.

Solutions:

  1. Check Kubernetes StatefulSet status: kubectl get statefulsets.

  2. Manually delete StatefulSet if needed: kubectl delete statefulset <name>.

  3. Check for pending operations on the job.