Start a DPO Customization Job#

Learn how to use the NeMo Microservices Platform to create a DPO (Direct Preference Optimization) job using a custom dataset.

DPO is an advanced fine-tuning technique for preference-based alignment. If you’re new to fine-tuning, consider starting with LoRA or Full SFT tutorials first.

About DPO Customization Jobs#

Direct Preference Optimization (DPO) is an RL-free alignment algorithm that operates on preference data. Given a prompt and a pair of chosen and rejected responses, DPO aims to increase the probability of the chosen response and decrease the probability of the rejected response relative to a frozen reference model. The actor is initialized using the reference model. For more details, refer to the DPO paper.

DPO shares similarities with Full SFT training workflows but differs in a few key ways:

DPO vs SFT Training Comparison#

Aspect

SFT (Supervised Fine-Tuning)

DPO (Direct Preference Optimization)

Data Requirements

Labeled instruction-response pairs where the desired output is explicitly provided

Pairwise preference data, where for a given input, one response is explicitly preferred over another

Learning Objective

Directly teaches the model to generate a specific “correct” response

Directly optimizes the model to align with human preferences by maximizing the probability of preferred responses and minimizing rejected ones, without needing an explicit reward model

Alignment Focus

Aligns the model with the specific examples present in its training data

Aligns the model with broader human preferences, which can be more effective for subjective tasks or those without a single “correct” answer

Computational Efficiency

Standard fine-tuning efficiency

More computationally efficient than SFT (especially when compared to full RLHF methods) as it bypasses the need to train a separate reward model

Prerequisites#

Before you can start an DPO job, make sure that you have the following:

  • Access to a NeMo Customizer Microservice.

  • Completed the Manage Entities tutorial series, or set up a dedicated project.

  • The huggingface_hub Python package installed.

Set Up Environment#

Before starting, set up the required environment variable for the Customizer service:

export CUSTOMIZER_BASE_URL="<your-customizer-service-url>"

Replace <your-customizer-service-url> with the actual URL of your NeMo Customizer service.


Select Model#

Find Available Configs#

First, we need to identify what model customization configurations are available for you to use. This will describe the models and corresponding techniques you can choose. DPO jobs require a model that supports the following:

  • finetuning_type: all_weights

  • training_type: dpo

Note

GPU requirements are typically higher for all_weights than with PEFT techniques like LoRA.

  1. Get all customization configurations.

    curl -X GET "${CUSTOMIZER_BASE_URL}/v1/customization/configs?filter%5Btraining_type%5D=dpo&filter%5Bfinetuning_type%5D=all_weights" \
      -H 'Accept: application/json' | jq
    
  2. Review the response to find a model that meets your requirements.

    Example Response
    {
      "object": "list",
      "data": [
       {
        "name": "meta/llama-3.2-1b-instruct@v1.0.0+A100",
        "namespace": "default",
        "dataset_schemas": [
          {
            "title": "Newline-Delimited JSON File",
            "type": "array",
            "items": {
                "description": "Schema for Direct Preference Optimization (DPO) training data items.\n\nDefines the structure for training data used in DPO fine-tuning.",
                "properties": {
                    "prompt": {
                        "description": "The prompt for the entry",
                        "title": "Prompt",
                        "type": "string"
                    },
                    "chosen_response": {
                        "description": "The chosen response to train on",
                        "title": "Chosen Response",
                        "type": "string"
                    },
                    "rejected_response": {
                        "description": "The rejected response to train on",
                        "title": "Rejected Response",
                        "type": "string"
                    }
                },
                "required": [
                    "prompt",
                    "chosen_response",
                    "rejected_response"
                ],
                "title": "DPODatasetItemSchema",
                "type": "object"
            },
            "description": "Newline-delimited JSON (application/jsonlines) file containing objects"
          }
        ],
        "training_options": [
          {
            "training_type": "dpo",
            "finetuning_type": "all_weights",
            "num_gpus": 1,
            "num_nodes": 1,
            "tensor_parallel_size": 1,
            "use_sequence_parallel": false
          },
          {
            "training_type": "sft",
            "finetuning_type": "lora",
            "num_gpus": 1,
            "num_nodes": 1,
            "tensor_parallel_size": 1,
            "use_sequence_parallel": false
          },
          {
            "training_type": "sft",
            "finetuning_type": "all_weights",
            "num_gpus": 1,
            "num_nodes": 1,
            "tensor_parallel_size": 1,
            "use_sequence_parallel": false
          },
        ]
      }
    

The response shows that Llama 3.2 1b Instruct is available for DPO, and will require 1 GPU to train.

Review Dataset Schema#

You can examine the dataset_schemas field in the response to understand what data format your model requires.

The schema outlines the specific fields and data types your dataset needs to include, formatted as New Line Delimited JSON (NDJSON). In the next section, we’ll walk through an example to help you understand the schema structure.

Create Datasets#

Now that we know the required shape of the dataset expected by the model configuration, we can prepare our training and validation files and upload them to the dataset.

Prepare Files#

  1. Create two files, train.jsonl and validation.jsonl.

  2. Populate the files with data that’s compatible with the meta/llama-3.2-1b-instruct@v1.0.0+A100 model.

Note

Each record should be on a single line in your .jsonl file, with no line breaks within the JSON objects.

{"prompt": [{"role": "user", "content": "You are a senior python developer with years of experience writing standard, high quality and reusable code in python. You have expert level understanding of flask framework. You code with latest coding standards and best practices that are as latest as of 2021. You write compact, easy to read code with short comments. You respond with code, when you are given a problem or topic."}], "chosen_response": " Sure, I'd be happy to help with any Python or Flask-related problems or topics you have. Please provide the specific problem or topic you'd like assistance with, and I'll respond with the appropriate code and explanations.", "rejected_response": "I'm an AI language model and don't have personal experiences or the ability to write code in real-time. However, I can certainly help you by providing a Python Flask code snippet based on your requirements. Here's a simple example of a Flask application that follows best practices:\n\n```python\nfrom flask import Flask, jsonify\n\n# Create a Flask web server from the '__main__' file\napp = Flask(__name__)\n\n# Define a route for the root URL ('/')\n@app.route('/')\ndef home():\n    \"\"\"\n    This function returns a friendly greeting to the user.\n    \"\"\"\n    return \"Hello, World!\"\n\n# Define a route for a JSON response\n@app.route('/json')\ndef get_json():\n    \"\"\"\n    This function returns a JSON response.\n    \"\"\"\n    data = {\n        'message': 'Hello, JSON!',\n        'status': 200\n    }\n    return jsonify(data)\n\n# Run the Flask web server\nif __name__ == '__main__':\n    app.run(debug=True)\n```\n\nThis code creates a simple Flask web server with two routes: one for a plain text response and another for a JSON response. The `jsonify` function is used to convert a Python dictionary into a JSON response. The server runs in debug mode, which is useful for development but should be turned off in a production environment."}

Upload Training Data#

Initialize Client#

You need to upload the training files to the training path in NeMo Data Store, and validation files to the validation path. You can have multiple files in each path and they will all be used.

To set up the Hugging Face API client, you’ll need these configuration values:

  • Host URL for the entity store service

  • Host URL for the data storage service

  • A namespace to organize your resources

  • Name of your dataset

from huggingface_hub import HfApi
import os

# Configuration

ENTITY_HOST = # Replace with the public url of your Entity Store
DS_HOST = # Replace with the public url of you Datastore
NAMESPACE = "default"
DATASET_NAME = "test-dataset" # dataset name needs to be unique for the namespace

# Initialize Hugging Face API client
HF_API = HfApi(endpoint=f"{DS_HOST}/v1/hf", token="")

Create Namespaces#

Set the namespace we defined in our configuration values in both the NeMo Entity Store and the NeMo Data Store so that they match.

import requests

def create_namespaces(entity_host, ds_host, namespace):
    # Create namespace in entity store
    entity_store_url = f"{entity_host}/v1/namespaces"
    resp = requests.post(entity_store_url, json={"id": namespace})
    assert resp.status_code in (200, 201, 409, 422), \
        f"Unexpected response from Entity Store during Namespace creation: {resp.status_code}"

    # Create namespace in datastore
    nds_url = f"{ds_host}/v1/datastore/namespaces"
    resp = requests.post(nds_url, data={"namespace": namespace})
    assert resp.status_code in (200, 201, 409, 422), \
        f"Unexpected response from datastore during Namespace creation: {resp.status_code}"

create_namespaces(ENTITY_HOST, DS_HOST, NAMESPACE)

Set Up Dataset Repository#

Create a dataset repository in NeMo Data Store.

def setup_dataset_repo(hf_api, namespace, dataset_name, entity_host):
    repo_id = f"{namespace}/{dataset_name}"

    # Create the repo in datastore
    hf_api.create_repo(repo_id, repo_type="dataset", exist_ok=True)

    # Create dataset in entity store
    entity_store_url = f"{entity_host}/v1/datasets"
    payload = {
        "name": dataset_name,
        "namespace": namespace,
        "files_url": f"hf://datasets/{repo_id}"
    }
    resp = requests.post(entity_store_url, json=payload)
    assert resp.status_code in (200, 201, 409, 422), \
        f"Unexpected response from Entity Store creating dataset: {resp.status_code}"

    return repo_id

repo_id = setup_dataset_repo(HF_API, NAMESPACE, DATASET_NAME, ENTITY_HOST)

Upload Files#

Upload the training and validation files to the dataset.

def upload_dataset_files(hf_api, repo_id):
    # Upload training file
    hf_api.upload_file(
        path_or_fileobj="train.jsonl",
        path_in_repo="training/training_file.jsonl",
        repo_id=repo_id,
        repo_type="dataset",
        revision="main",
        commit_message=f"Training file for {repo_id}"
    )

    # Upload validation file
    hf_api.upload_file(
        path_or_fileobj="validation.jsonl",
        path_in_repo="validation/validation_file.jsonl",
        repo_id=repo_id,
        repo_type="dataset",
        revision="main",
        commit_message=f"Validation file for {repo_id}"
    )

upload_dataset_files(HF_API, repo_id)

Checkpoint

At this point, we’ve uploaded our training and validation files to the dataset and are ready to define the details of our customization job.

Start Model Customization Job#

Important

The config field must include a version, for example: meta/llama-3.2-1b-instruct@v1.0.0+A100. Omitting the version will result in an error like:

{ "detail": "Version is not specified in the config URN: meta/llama-3.2-1b-instruct" }

You can find the correct config URN (with version) by inspecting the output of the /customization/configs endpoint. Use the name and version fields to construct the URN as name@version.

Set Hyperparameters#

While model customization configurations come with default settings, you can customize your training by specifying additional hyperparameters in the hyperparameters field of your customization job.

To train with DPO, we must:

  1. Set the training_type to dpo (Direct Preference Optimization).

  2. Set the finetuning_type to all_weights.

To override default DPO specific hyperparameters, include hyperparameters.dpo field.

Example configuration:

{
  "hyperparameters": {
    "training_type": "dpo",
    "finetuning_type": "all_weights",
    "epochs": 2,
    "batch_size": 16,
    "learning_rate": 0.00005,
    "dpo": {
      "ref_policy_kl_penalty": 0.05,
      "preference_loss_weight": 1,
      "preference_average_log_probs": false,
      "sft_loss_weight": 0,
      "sft_average_log_probs": false
    }
  }
}

Note

For more information on hyperparameter options and their description, review the Hyperparameter Options reference.

Create and Submit Training Job#

Use the following command to start a DPO training job. Replace meta/llama-3.2-1b-instruct@v1.0.0+A100 with your chosen model configuration including the version and test-dataset with your dataset name.

  1. Create a job using the model configuration (config), dataset, and hyperparameters we defined in the previous sections.

    from nemo_microservices import NeMoMicroservices
    import os
    
    # Initialize the client
    # Note: Set CUSTOMIZER_BASE_URL environment variable to your Customizer service URL
    client = NeMoMicroservices(
        base_url=os.environ['CUSTOMIZER_BASE_URL']
    )
    
    # Set up WandB API key for enhanced visualization
    extra_headers = {}
    if os.getenv('WANDB_API_KEY'):
        extra_headers['wandb-api-key'] = os.getenv('WANDB_API_KEY')
    
    # Create a DPO customization job
    job = client.customization.jobs.create(
        config="meta/llama-3.2-1b-instruct@v1.0.0+A100",
        dataset={
            "namespace": "default",
            "name": "test-dataset"
        },
        hyperparameters={
            "training_type": "dpo",
            "finetuning_type": "all_weights",
            "epochs": 2,
            "batch_size": 16
        },
        output_model="default/dpo_llama_3@v1",
        extra_headers=extra_headers
    )
    
    print(f"Created DPO job:")
    print(f"  Job ID: {job.id}")
    print(f"  Status: {job.status}")
    print(f"  Output model: {job.output_model}")
    
    curl -X "POST" \
      "${CUSTOMIZER_BASE_URL}/v1/customization/jobs" \
      -H 'accept: application/json' \
      -H 'Content-Type: application/json' \
      -H "wandb-api-key: ${WANDB_API_KEY}" \
      -d '
        {
        "config": "meta/llama-3.2-1b-instruct@v1.0.0+A100",
        "dataset": {"namespace": "default", "name": "test-dataset"},
        "hyperparameters": {
          "training_type": "dpo",
          "finetuning_type": "all_weights",
          "epochs": 2,
          "batch_size": 16
        },
        "output_model": "default/dpo_llama_3@v1"
    }' | jq
    
  2. Review the response.

    Example Response
    {
      "id": "cust-S2qNunob3TNW6JjN75ESCG",
      "created_at": "2025-03-17T02:26:52.731523",
      "updated_at": "2025-03-17T02:26:52.731526",
      "namespace": "default",
      "config": {
         "base_model": "meta/llama-3.2-1b-instruct@v1.0.0+A100",
         "precision": "bf16-mixed",
         "num_gpus": 1,
         "num_nodes": 1,
         "micro_batch_size": 1,
         "tensor_parallel_size": 1,
         "max_seq_length": 4096,
         "prompt_template": "{prompt} {completion}"
      },
      "dataset": {"namespace": "default", "name": "test-dataset"},
      "hyperparameters": {
        "finetuning_type": "all_weights",
        "training_type": "dpo",
        "batch_size": 16,
        "epochs": 2,
        "learning_rate": 0.000009,
        "sequence_packing_enabled": false
      },
      "output_model": "default/dpo_llama_3@v1",
      "status": "created",
      "project": "test-project",
      "ownership": {
        "created_by": "me",
        "access_policies": {
          "arbitrary": "json"
        }
      }
    }
    
  3. Copy the following values from the response:

    • id

    • output_model

You can monitor the job status as detailed in getting the job status.

Deploy the model#

Once the job finishes, Customizer uploads the full model weights to the Data Store.

Important

Unlike LoRA adapters, NIM doesn’t deploy all-weights customized models automatically; you must deploy a new NIM with these weights. This requires having direct access to the Kubernetes cluster. If necessary, request your cluster administrator to perform the following steps.

Create a PVC for Model Weights#

  1. Create a file named pvc_definition.yaml that contains the following code.

    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: dpo-custom-weights
    spec:
      accessModes:
        - ReadWriteMany
      storageClassName: <Your ReadWriteMany storage class>
      resources:
        requests:
          storage: 20Gi
    

    Note

    You must select an existing storage class that supports ReadWriteMany.

  2. Apply this definition to create a PVC.

    kubectl apply -f pvc_definition.yaml
    

Download Weights into PVC#

Let’s create a pod that mounts the PVC and download our custom weights.

  1. Create a file defining our pod.

    Update default/dpo_llama_3 to the output_model from the Create and Submit Training Job section, exclude the @ and everything that follows.

    Set the --revision v1 to be --revision plus everything after the @ in the output_model in the Create and Submit Training Job section

    This pod will exit as soon as the model is downloaded to the PVC.

    apiVersion: batch/v1
    kind: Job
    metadata:
      name: dpo-pvc-job-hf-cli
      annotations:
        sidecar.istio.io/inject: "false"
    spec:
      template:
        metadata:
          name: dpo-pvc-job-hf-cli
        spec:
          securityContext:
            runAsUser: 1000
            fsGroup: 1000
          containers:
          - name: dpo-pvc-job-hf-cli
            image: nvcr.io/nvidia/nemo-microservices/customizer-api:25.04
            command: [ "/bin/sh" ]
            args: ["-c", "mkdir -p -m 775 /mount/models/all_weights/ && huggingface-cli download default/dpo_llama_3 --revision v1 --local-dir /mount/models/all_weights"]
            env:
            - name: HF_TOKEN
              value: "token"
            - name: HF_ENDPOINT
              value: "http://nemo-data-store:3000/v1/hf"
            - name: HF_HOME
              value: /home/nvs
            volumeMounts:
            - name: mount-models
              mountPath: /mount/models
          restartPolicy: Never
          volumes:
          - name: mount-models
            persistentVolumeClaim:
              claimName: dpo-custom-weights
          imagePullSecrets:
          - name: nvcrimagepullsecret
      backoffLimit: 1
    

    Note

    This file is around 15 GB, and can take around 20 minutes to download.

  2. Apply the pod definition.

    kubectl apply -f pod_definition.yaml
    

Start a NIM#

Now let’s deploy a NIM with your custom weights. The NIM will build an optimized TensorRT-LLM engine automatically.

  1. First, create a values file for the Helm deployment with name nim_dpo.yaml.

    image:
      repository: nvcr.io/nim/meta/llama-3.2-1b-instruct
      tag: 1.6.0
    imagePullSecrets:
     - name: nvcrimagepullsecret
    service:
     labels:
       app.nvidia.com/nim-type: inference
    env:
     - name: NIM_FT_MODEL
       value: /model-store/all_weights
     - name: NIM_SERVED_MODEL_NAME
       value: "llama3.2-1b-custom-weights"
     - name: NIM_CUSTOM_MODEL_NAME
       value: custom_1
    persistence:
     enabled: true
     existingClaim: dpo-custom-weights
     accessMode: ReadWriteMany
    
  2. Download the NIM Helm install from NGC.

    helm fetch https://helm.ngc.nvidia.com/nim/charts/nim-llm-1.7.0.tgz --username='$oauthtoken' --password=<YOUR API KEY>
    
  3. Perform a Helm install.

    helm install nim ./nim-llm-1.7.0.tgz -f nim_dpo.yaml
    
  4. Add a DNS entry for your NIM (optional).

If you are not using NIM Proxy, you need to add a DNS entry. This will depend on your cluster.

  1. Query your NIM with the custom values. The DNS entry you query will depend on your setup. If you’re using the Beginner Tutorial, this hostname will be http://nim.test.

    curl -X POST "<Your NIM hostname>/v1/completions" \
       -H 'accept: application/json' \
       -H 'Content-Type: application/json' \
       -d '{
          "model": "llama3.2-1b-custom-weights",
          "prompt": "Extract from the following context the minimal span word for word that best answers the question.\n- If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.\n- If you do not know the answer to a question, please do not share false information.\n- If the answer is not in the context, the answer should be \"?\".\n- Your answer should not include any other text than the answer to the question. Do not include any other text like \"Here is the answer to the question:\" or \"The minimal span word for word that best answers the question is:\" or anything like that.\n\nContext: When is the upcoming GTC event? GTC 2018 attracted over 8,400 attendees. Due to the COVID pandemic of 2020, GTC 2020 was converted to a digital event and drew roughly 59,000 registrants. The 2021 GTC keynote, which was streamed on YouTube on April 12, included a portion that was made with CGI using the Nvidia Omniverse real-time rendering platform. This next GTC will take place in the middle of March, 2023. Answer:",
          "max_tokens": 128
       }' | jq
    

Conclusion#

You have successfully started an DPO job and deployed a NIM with your custom weights. You can now use the NIM endpoint to interact with your fine-tuned model and evaluate its performance on your specific use case.

If you included a WandB API key, you can view your training results at wandb.ai under the nvidia-nemo-customizer project.

Note

The W&B integration is optional. When enabled, we’ll send training metrics to W&B using your API key. While we encrypt your API key and don’t log it internally, please review W&B’s terms of service before use.

Next Steps#

Learn how to check customization job metrics using the id.