Start a Knowledge Distillation (KD) Customization Job#

Learn how to use the NeMo Microservices Platform to create a Knowledge Distillation (KD) job, transferring knowledge from a large teacher model to a smaller student model using your own dataset.

About Knowledge Distillation#

Knowledge distillation is a technique for transferring knowledge from a large, high-capacity teacher model to a smaller student model. The distilled model (student) often achieves higher accuracy than models trained using standard language modeling loss alone.

KD is useful when you want to deploy smaller models without losing much accuracy compared to a large model.

Prerequisites#

Platform Prerequisites#

NeMo Customizer Prerequisites#

Tutorial-Specific Prerequisites#

A teacher model (already fine-tuned and available as a customization target)
The requests Python package installed

Notes and Limitations#

Only logit-pair distillation is currently supported.
LoRA adapters can’t be used as teacher models.

Select Teacher and Student Models#

You need two models available as customization targets:

Teacher model: A large, fine-tuned model
Student model: A smaller model you want to distill knowledge into

Both models must use the same tokenizer. Only GPT-based NeMo 2.0 checkpoints are supported for now.

Select Model#

You can either find an existing customization config to use or create a new one.

Find Available Configs#

Identify what model customization configurations are available that support distillation training. KD customization jobs require a model configuration that supports both training_type of distillation and finetuning_type of all_weights.

Get all customization configurations.

Python SDK

from nemo_microservices import NeMoMicroservices
import os

# Initialize the client
client = NeMoMicroservices(
    base_url=os.environ['CUSTOMIZER_BASE_URL']
)

# Find configurations that support distillation training
configs = client.customization.configs.list(
    filter={
        "training_type": "distillation",
        "finetuning_type": "all_weights"
    }
)

print(f"Found {len(configs.data)} distillation configurations")
for config in configs.data:
    print(f"Config: {config.name}")
    print(f"  Training options: {len(config.training_options)}")
    for option in config.training_options:
        if option.training_type == "distillation":
            print(f"    - {option.training_type}/{option.finetuning_type}: {option.num_gpus} GPUs")
            print(f"      ✓ Supports distillation")

cURL

curl -X GET "${CUSTOMIZER_BASE_URL}/v1/customization/configs?filter%5Btraining_type%5D=distillation&filter%5Bfinetuning_type%5D=all_weights" \
  -H 'Accept: application/json' | jq

Review the response to find a model configuration that includes distillation in its training_options.

Example Response

{
  "object": "list",
  "data": [
    {
      "name": "meta/llama-3.2-1b-instruct@v1.0.0+A100",
      "namespace": "default",
      "dataset_schemas": [
        {
          "title": "Newline-Delimited JSON File",
          "type": "array",
          "items": {
            "description": "Schema for Supervised Fine-Tuning (SFT) training data items.",
            "properties": {
              "prompt": {
                "description": "The prompt for the entry",
                "title": "Prompt",
                "type": "string"
              },
              "completion": {
                "description": "The completion to train on",
                "title": "Completion",
                "type": "string"
              }
            },
            "required": ["prompt", "completion"],
            "title": "SFTDatasetItemSchema",
            "type": "object"
          }
        }
      ],
      "training_options": [
        {
          "training_type": "sft",
          "finetuning_type": "lora",
          "num_gpus": 1,
          "num_nodes": 1,
          "tensor_parallel_size": 1,
          "use_sequence_parallel": false
        },
        {
          "training_type": "sft",
          "finetuning_type": "all_weights",
          "num_gpus": 1,
          "num_nodes": 1,
          "tensor_parallel_size": 1,
          "use_sequence_parallel": false
        }
      ]
    },
    {
      "name": "nvidia/llama-3.2-nv-embedqa-1b@v2+A100",
      "namespace": "nvidia",
      "dataset_schemas": [
        {
          "title": "Newline-Delimited JSON File",
          "type": "array",
          "items": {
            "description": "Schema for embedding training data items.",
            "properties": {
              "query": {
                "description": "The query to use as an anchor",
                "title": "Query",
                "type": "string"
              },
              "pos_doc": {
                "description": "A document that should match positively with the anchor",
                "title": "Positive Document",
                "type": "string"
              },
              "neg_doc": {
                "description": "Documents that should not match with the anchor",
                "title": "Negative Documents",
                "type": "array",
                "items": {"type": "string"}
              }
            },
            "required": ["query", "pos_doc", "neg_doc"],
            "title": "EmbeddingDatasetItemSchema",
            "type": "object"
          }
        }
      ],
      "training_options": [
        {
          "training_type": "sft",
          "finetuning_type": "lora_merged",
          "num_gpus": 1,
          "num_nodes": 1,
          "tensor_parallel_size": 1,
          "use_sequence_parallel": false
        }
      ]
    }
  ]
}

Create Config#

If no appropriate configuration is available, you can create one that supports distillation training. Here’s how to create a config with distillation support:

Python SDK

from nemo_microservices import NeMoMicroservices
import os

# Initialize the client
client = NeMoMicroservices(
    base_url=os.environ['CUSTOMIZER_BASE_URL']
)

# Create a customization config with distillation support
config = client.customization.configs.create(
    name="llama-3.2-1b-instruct@v1.0.0+A100",
    namespace="default",
    description="Configuration for Llama 3.2 1B with distillation support",
    target="meta/llama-3.2-1b-instruct",
    training_options=[
        {
            "training_type": "sft",
            "finetuning_type": "lora",
            "num_gpus": 1,
            "tensor_parallel_size": 1,
            "pipeline_parallel_size": 1,
            "use_sequence_parallel": False,
            "micro_batch_size": 1
        },
        {
            "training_type": "distillation",
            "finetuning_type": "all_weights",
            "num_gpus": 1,
            "tensor_parallel_size": 1,
            "pipeline_parallel_size": 1,
            "use_sequence_parallel": False,
            "micro_batch_size": 1
        }
    ],
    training_precision="bf16",
    max_seq_length=2048
)

print(f"Created config: {config.name}")
print(f"Training options: {len(config.training_options)}")

cURL

curl -X POST \
  "${CUSTOMIZER_BASE_URL}/v1/customization/configs" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "name": "llama-3.2-1b-instruct@v1.0.0+A100",
    "namespace": "default",
    "description": "Configuration for Llama 3.2 1B with distillation support",
    "target": "meta/llama-3.2-1b-instruct",
    "training_options": [
       {
          "training_type": "sft",
          "finetuning_type": "lora",
          "num_gpus": 1,
          "tensor_parallel_size": 1,
          "pipeline_parallel_size": 1,
          "use_sequence_parallel": false,
          "micro_batch_size": 1
      },
       {
          "training_type": "distillation",
          "finetuning_type": "all_weights",
          "num_gpus": 1,
          "tensor_parallel_size": 1,
          "pipeline_parallel_size": 1,
          "use_sequence_parallel": false,
          "micro_batch_size": 1
      }
    ],
    "training_precision": "bf16",
    "max_seq_length": 2048
  }' | jq

For detailed information about creating configs, see Create Customization Config.

Create Datasets#

Prepare your training and validation datasets in the same format required for SFT jobs. The dataset should be the same as (or similar to) the one used to fine-tune the teacher model.

Refer to the Format Training Datasets tutorial for details on dataset structure and upload instructions.

Start Model Customization Job#

Set Hyperparameters#

When creating a KD job, set the following in your job configuration:

training_type: distillation
finetuning_type: all_weights (the only supported option)
distillation.teacher: The name of the teacher Target (must already exist)

Example hyperparameters section:

{
  "hyperparameters": {
    "training_type": "distillation",
    "finetuning_type": "all_weights",
    "epochs": 3,
    "batch_size": 4,
    "learning_rate": 0.00005,
    "distillation": {
      "teacher": "meta/llama-3.2-3b-instruct@v1.0.0"
    }
  }
}

Create and Submit Customization Job#

Python SDK

from nemo_microservices import NeMoMicroservices
import os

# Initialize the client
client = NeMoMicroservices(
    base_url=os.environ['CUSTOMIZER_BASE_URL']
)

# Set up WandB API key for enhanced visualization
extra_headers = {}
if os.getenv('WANDB_API_KEY'):
    extra_headers['wandb-api-key'] = os.getenv('WANDB_API_KEY')

# Create a knowledge distillation customization job
job = client.customization.jobs.create(
    config="meta/llama-3.2-1b-instruct@v1.0.0+A100",
    dataset={
        "name": "test-dataset",
        "namespace": "default"
    },
    hyperparameters={
        "training_type": "distillation",
        "finetuning_type": "all_weights",
        "epochs": 3,
        "batch_size": 4,
        "learning_rate": 0.00005,
        "distillation": {
            "teacher": "meta/llama-3.2-3b-instruct@v1.0.0"
        }
    },
    extra_headers=extra_headers
)

print(f"Created distillation job with ID: {job.id}")
print(f"Job status: {job.status}")
print(f"Output model: {job.output_model}")

cURL

curl -X "POST" \
  "${CUSTOMIZER_BASE_URL}/v1/customization/jobs" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -H "wandb-api-key: ${WANDB_API_KEY}" \
  -d '{
    "config": "meta/llama-3.2-1b-instruct@v1.0.0+A100",
    "dataset": {"name": "test-dataset", "namespace": "default"},
    "hyperparameters": {
      "training_type": "distillation",
      "finetuning_type": "all_weights",
      "epochs": 3,
      "batch_size": 4,
      "learning_rate": 0.00005,
      "distillation": {
        "teacher": "meta/llama-3.2-3b-instruct@v1.0.0"
      }
    }
  }' | jq

Important

The config field must include a version, for example: meta/llama-3.2-1b-instruct@v1.0.0+A100. Omitting the version will result in an error like:

{ "detail": "Version is not specified in the config URN: meta/llama-3.2-1b-instruct" }

You can find the correct config URN (with version) by inspecting the output of the /v1/customization/configs endpoint. Use the name and version fields to construct the URN as name@version.

Example curl:

curl -X GET "${CUSTOMIZER_BASE_URL}/v1/customization/configs?page_size=1000" -H 'Accept: application/json' | jq '.data[] | "\(.namespace)/\(.name)"'

Next Steps#

Learn how to check customization job metrics using the job ID to monitor training progress and performance.

Then view your results at wandb.ai under the nvidia-nemo-customizer project.

Note

The W&B integration is optional. When enabled, we’ll send training metrics to W&B using your API key. While we encrypt your API key and don’t log it internally, please review W&B’s terms of service before use.