Start a Knowledge Distillation (KD) Customization Job#

Learn how to use the NeMo Microservices Platform to create a Knowledge Distillation (KD) job, transferring knowledge from a large teacher model to a smaller student model using your own dataset.

About Knowledge Distillation#

Knowledge distillation is a technique for transferring knowledge from a large, high-capacity teacher model to a smaller student model. The distilled model (student) often achieves higher accuracy than models trained using standard language modeling loss alone.

KD is useful when you want to deploy smaller models without losing much accuracy compared to a large model.

Prerequisites#

Platform Prerequisites#

New to using NeMo microservices?

NeMo microservices use an entity management system to organize all resources—including datasets, models, and job artifacts—into namespaces and projects. Without setting up these organizational entities first, you cannot use the microservices.

If you’re new to the platform, complete these foundational tutorials first:

  1. Get Started Tutorials: Learn how to deploy, customize, and evaluate models using the platform end-to-end

  2. Set Up Organizational Entities: Learn how to create namespaces and projects to organize your work

If you’re already familiar with namespaces, projects, and how to upload datasets to the platform, you can proceed directly with this tutorial.

Learn more: Entity Concepts

NeMo Customizer Prerequisites#

Microservice Setup Requirements and Environment Variables

Before starting, make sure you have:

  • Access to NeMo Customizer

  • The huggingface_hub Python package installed

  • (Optional) Weights & Biases account and API key for enhanced visualization

Set up environment variables:

# Set up environment variables
export CUSTOMIZER_BASE_URL="<your-customizer-service-url>"
export ENTITY_HOST="<your-entity-store-url>"
export DS_HOST="<your-datastore-url>"
export NAMESPACE="default"
export DATASET_NAME="test-dataset"

# Hugging Face environment variables (for dataset/model file management)
export HF_ENDPOINT="${DS_HOST}/v1/hf"
export HF_TOKEN="dummy-unused-value"  # Or your actual HF token

# Optional monitoring
export WANDB_API_KEY="<your-wandb-api-key>"

Replace the placeholder values with your actual service URLs and credentials.

Tutorial-Specific Prerequisites#

  • A teacher model (already fine-tuned and available as a customization target)

  • The requests Python package installed

Notes and Limitations#

  • Only logit-pair distillation is currently supported.

  • LoRA adapters can’t be used as teacher models.


Select Teacher and Student Models#

You need two models available as customization targets:

  • Teacher model: A large, fine-tuned model

  • Student model: A smaller model you want to distill knowledge into

Both models must use the same tokenizer. Only GPT-based NeMo 2.0 checkpoints are supported for now.


Select Model#

You can either find an existing customization config to use or create a new one.

Find Available Configs#

Identify what model customization configurations are available that support distillation training. KD customization jobs require a model configuration that supports both training_type of distillation and finetuning_type of all_weights.

  1. Get all customization configurations.

    from nemo_microservices import NeMoMicroservices
    import os
    
    # Initialize the client
    client = NeMoMicroservices(
        base_url=os.environ['CUSTOMIZER_BASE_URL']
    )
    
    # Find configurations that support distillation training
    configs = client.customization.configs.list(
        filter={
            "training_type": "distillation",
            "finetuning_type": "all_weights"
        }
    )
    
    print(f"Found {len(configs.data)} distillation configurations")
    for config in configs.data:
        print(f"Config: {config.name}")
        print(f"  Training options: {len(config.training_options)}")
        for option in config.training_options:
            if option.training_type == "distillation":
                print(f"    - {option.training_type}/{option.finetuning_type}: {option.num_gpus} GPUs")
                print(f"      ✓ Supports distillation")
    
    curl -X GET "${CUSTOMIZER_BASE_URL}/v1/customization/configs?filter%5Btraining_type%5D=distillation&filter%5Bfinetuning_type%5D=all_weights" \
      -H 'Accept: application/json' | jq
    
  2. Review the response to find a model configuration that includes distillation in its training_options.

    Example Response
    {
      "object": "list",
      "data": [
        {
          "name": "meta/llama-3.2-1b-instruct@v1.0.0+A100",
          "namespace": "default",
          "dataset_schemas": [
            {
              "title": "Newline-Delimited JSON File",
              "type": "array",
              "items": {
                "description": "Schema for Supervised Fine-Tuning (SFT) training data items.",
                "properties": {
                  "prompt": {
                    "description": "The prompt for the entry",
                    "title": "Prompt",
                    "type": "string"
                  },
                  "completion": {
                    "description": "The completion to train on",
                    "title": "Completion",
                    "type": "string"
                  }
                },
                "required": ["prompt", "completion"],
                "title": "SFTDatasetItemSchema",
                "type": "object"
              }
            }
          ],
          "training_options": [
            {
              "training_type": "sft",
              "finetuning_type": "lora",
              "num_gpus": 1,
              "num_nodes": 1,
              "tensor_parallel_size": 1,
              "use_sequence_parallel": false
            },
            {
              "training_type": "sft",
              "finetuning_type": "all_weights",
              "num_gpus": 1,
              "num_nodes": 1,
              "tensor_parallel_size": 1,
              "use_sequence_parallel": false
            }
          ]
        },
        {
          "name": "nvidia/llama-3.2-nv-embedqa-1b@v2+A100",
          "namespace": "nvidia",
          "dataset_schemas": [
            {
              "title": "Newline-Delimited JSON File",
              "type": "array",
              "items": {
                "description": "Schema for embedding training data items.",
                "properties": {
                  "query": {
                    "description": "The query to use as an anchor",
                    "title": "Query",
                    "type": "string"
                  },
                  "pos_doc": {
                    "description": "A document that should match positively with the anchor",
                    "title": "Positive Document",
                    "type": "string"
                  },
                  "neg_doc": {
                    "description": "Documents that should not match with the anchor",
                    "title": "Negative Documents",
                    "type": "array",
                    "items": {"type": "string"}
                  }
                },
                "required": ["query", "pos_doc", "neg_doc"],
                "title": "EmbeddingDatasetItemSchema",
                "type": "object"
              }
            }
          ],
          "training_options": [
            {
              "training_type": "sft",
              "finetuning_type": "lora_merged",
              "num_gpus": 1,
              "num_nodes": 1,
              "tensor_parallel_size": 1,
              "use_sequence_parallel": false
            }
          ]
        }
      ]
    }
    

Create Config#

If no appropriate configuration is available, you can create one that supports distillation training. Here’s how to create a config with distillation support:

from nemo_microservices import NeMoMicroservices
import os

# Initialize the client
client = NeMoMicroservices(
    base_url=os.environ['CUSTOMIZER_BASE_URL']
)
# Create a customization config with distillation support
config = client.customization.configs.create(
    name="llama-3.2-1b-instruct@v1.0.0+A100",
    namespace="default",
    description="Configuration for Llama 3.2 1B with distillation support",
    target="meta/llama-3.2-1b-instruct",
    training_options=[
        {
            "training_type": "sft",
            "finetuning_type": "lora",
            "num_gpus": 1,
            "tensor_parallel_size": 1,
            "pipeline_parallel_size": 1,
            "use_sequence_parallel": False,
            "micro_batch_size": 1
        },
        {
            "training_type": "distillation",
            "finetuning_type": "all_weights",
            "num_gpus": 1,
            "tensor_parallel_size": 1,
            "pipeline_parallel_size": 1,
            "use_sequence_parallel": False,
            "micro_batch_size": 1
        }
    ],
    training_precision="bf16",
    max_seq_length=2048
)

print(f"Created config: {config.name}")
print(f"Training options: {len(config.training_options)}")
curl -X POST \
  "${CUSTOMIZER_BASE_URL}/v1/customization/configs" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "name": "llama-3.2-1b-instruct@v1.0.0+A100",
    "namespace": "default",
    "description": "Configuration for Llama 3.2 1B with distillation support",
    "target": "meta/llama-3.2-1b-instruct",
    "training_options": [
       {
          "training_type": "sft",
          "finetuning_type": "lora",
          "num_gpus": 1,
          "tensor_parallel_size": 1,
          "pipeline_parallel_size": 1,
          "use_sequence_parallel": false,
          "micro_batch_size": 1
      },
       {
          "training_type": "distillation",
          "finetuning_type": "all_weights",
          "num_gpus": 1,
          "tensor_parallel_size": 1,
          "pipeline_parallel_size": 1,
          "use_sequence_parallel": false,
          "micro_batch_size": 1
      }
    ],
    "training_precision": "bf16",
    "max_seq_length": 2048
  }' | jq

For detailed information about creating configs, see Create Customization Config.


Create Datasets#

Prepare your training and validation datasets in the same format required for SFT jobs. The dataset should be the same as (or similar to) the one used to fine-tune the teacher model.

Refer to the Format Training Datasets tutorial for details on dataset structure and upload instructions.


Start Model Customization Job#

Set Hyperparameters#

When creating a KD job, set the following in your job configuration:

  • training_type: distillation

  • finetuning_type: all_weights (the only supported option)

  • distillation.teacher: The name of the teacher Target (must already exist)

Example hyperparameters section:

{
  "hyperparameters": {
    "training_type": "distillation",
    "finetuning_type": "all_weights",
    "epochs": 3,
    "batch_size": 4,
    "learning_rate": 0.00005,
    "distillation": {
      "teacher": "meta/llama-3.2-3b-instruct@v1.0.0"
    }
  }
}

Create and Submit Customization Job#

from nemo_microservices import NeMoMicroservices
import os

# Initialize the client
client = NeMoMicroservices(
    base_url=os.environ['CUSTOMIZER_BASE_URL']
)
# Set up WandB API key for enhanced visualization
extra_headers = {}
if os.getenv('WANDB_API_KEY'):
    extra_headers['wandb-api-key'] = os.getenv('WANDB_API_KEY')
# Create a knowledge distillation customization job
job = client.customization.jobs.create(
    config="meta/llama-3.2-1b-instruct@v1.0.0+A100",
    dataset={
        "name": "test-dataset",
        "namespace": "default"
    },
    hyperparameters={
        "training_type": "distillation",
        "finetuning_type": "all_weights",
        "epochs": 3,
        "batch_size": 4,
        "learning_rate": 0.00005,
        "distillation": {
            "teacher": "meta/llama-3.2-3b-instruct@v1.0.0"
        }
    },
    extra_headers=extra_headers
)

print(f"Created distillation job with ID: {job.id}")
print(f"Job status: {job.status}")
print(f"Output model: {job.output_model}")
curl -X "POST" \
  "${CUSTOMIZER_BASE_URL}/v1/customization/jobs" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -H "wandb-api-key: ${WANDB_API_KEY}" \
  -d '{
    "config": "meta/llama-3.2-1b-instruct@v1.0.0+A100",
    "dataset": {"name": "test-dataset", "namespace": "default"},
    "hyperparameters": {
      "training_type": "distillation",
      "finetuning_type": "all_weights",
      "epochs": 3,
      "batch_size": 4,
      "learning_rate": 0.00005,
      "distillation": {
        "teacher": "meta/llama-3.2-3b-instruct@v1.0.0"
      }
    }
  }' | jq

Important

The config field must include a version, for example: meta/llama-3.2-1b-instruct@v1.0.0+A100. Omitting the version will result in an error like:

{ "detail": "Version is not specified in the config URN: meta/llama-3.2-1b-instruct" }

You can find the correct config URN (with version) by inspecting the output of the /v1/customization/configs endpoint. Use the name and version fields to construct the URN as name@version.

Example curl:

curl -X GET "${CUSTOMIZER_BASE_URL}/v1/customization/configs?page_size=1000" -H 'Accept: application/json' | jq '.data[] | "\(.namespace)/\(.name)"'

Next Steps#

Learn how to check customization job metrics using the job ID to monitor training progress and performance.

Then view your results at wandb.ai under the nvidia-nemo-customizer project.

Note

The W&B integration is optional. When enabled, we’ll send training metrics to W&B using your API key. While we encrypt your API key and don’t log it internally, please review W&B’s terms of service before use.