nat.plugins.customizer.dpo.trajectory_builder#
DPO (Direct Preference Optimization) Trajectory Builder.
This module provides a trajectory builder that collects preference data from workflows that produce TTC_END intermediate steps with TTCEventData.
The builder: 1. Runs evaluation to collect intermediate steps 2. Filters for TTC_END steps with the configured name 3. Extracts data from TTCEventData (turn_id, candidate_index, score, input, output) 4. Groups candidates by turn_id 5. Generates preference pairs based on score differences 6. Builds trajectories with DPOItem episodes for DPO training
Attributes#
Classes#
Parsed candidate from a TTC intermediate step. |
|
A preference pair for DPO training. |
|
Trajectory builder for DPO (Direct Preference Optimization) training. |
Module Contents#
- logger#
- PromptType#
- class CandidateStep#
Parsed candidate from a TTC intermediate step.
Represents a single candidate response that was generated and scored for a particular turn in the workflow.
- prompt: PromptType#
Input prompt that produced this response (string or list of OpenAIMessage).
- class PreferencePair#
A preference pair for DPO training.
Represents a single (prompt, chosen, rejected) triple where the chosen response has a higher score than the rejected response.
- prompt: PromptType#
Input prompt (same for both responses).
- class DPOTrajectoryBuilder(
- trajectory_builder_config: nat.plugins.customizer.dpo.config.DPOTrajectoryBuilderConfig,
Bases:
nat.finetuning.interfaces.trajectory_builder.TrajectoryBuilderTrajectory builder for DPO (Direct Preference Optimization) training.
This builder collects preference pairs from workflows that produce TTC_END intermediate steps with TTCEventData. It uses the structured data model to extract turn_id, candidate_index, score, input (prompt), and output.
Key features: - Uses TTCEventData model directly (no brittle dictionary key configuration) - Supports prompts as strings or list of OpenAIMessage - Exhaustive or best-vs-worst pair generation modes - Configurable score difference filtering - Grouping by example for curriculum learning - Builds trajectories with DPOItem episodes
Example workflow integration:
trajectory_builders: dpo_builder: _type: dpo_traj_builder ttc_step_name: dpo_candidate_move exhaustive_pairs: true min_score_diff: 0.05
Initialize the DPO Trajectory Builder.
- Args:
trajectory_builder_config: Configuration for the builder.
- evaluation_runs: dict[str, asyncio.Task[nat.eval.config.EvaluationRunOutput]]#
- async start_run(run_id: str, meta: dict | None = None) None#
Start a single evaluation run to collect intermediate steps.
- Args:
run_id: Unique identifier for this run. meta: Optional metadata for the run.
- Raises:
ValueError: If a run with this ID is already in progress.
- async finalize( ) nat.data_models.finetuning.TrajectoryCollection#
Wait for evaluation, collect TTC steps, and build DPO trajectories.
This method: 1. Waits for the evaluation run to complete 2. Collects and groups candidates by turn_id using TTCEventData 3. Generates preference pairs 4. Builds trajectories with DPOItem episodes 5. Groups trajectories by example for curriculum learning
- Args:
run_id: Unique identifier for the run. meta: Optional metadata for the run.
- Returns:
TrajectoryCollection with DPO preference trajectories.
- Raises:
ValueError: If no run with this ID exists.
- log_progress( ) None#
Log trajectory building progress.
- Args:
run_id: The training run ID. metrics: Dictionary of metrics to log. output_dir: Optional output directory override.
- _collect_candidates(
- eval_result: nat.eval.config.EvaluationRunOutput,
Extract TTC_END intermediate steps and group by turn_id.
This method: 1. Iterates through all evaluation input items 2. Filters for TTC_END steps with the configured name 3. Extracts data from TTCEventData model directly 4. Groups candidates by (example_id, turn_id)
- Args:
eval_result: The evaluation run output.
- Returns:
Dictionary mapping turn keys to lists of candidates.
- _is_target_step( ) bool#
Check if an intermediate step is a target TTC step.
- Args:
step: The intermediate step to check.
- Returns:
True if this is a TTC_END step with the configured name.
- _parse_candidate(
- example_id: str,
- step: nat.data_models.intermediate_step.IntermediateStep,
Parse a CandidateStep from a TTC intermediate step using TTCEventData.
- Args:
example_id: The example ID this step belongs to. step: The intermediate step to parse.
- Returns:
CandidateStep if parsing succeeds, None otherwise.
- _extract_prompt(input_data: Any) PromptType#
Extract prompt from TTCEventData.input.
Handles both string prompts and list of OpenAIMessage.
- Args:
input_data: The input field from TTCEventData.
- Returns:
String prompt or list of OpenAIMessage.
- _generate_preference_pairs(
- candidates_by_turn: dict[str, list[CandidateStep]],
Generate preference pairs from grouped candidates.
- If exhaustive_pairs=True:
For candidates [A, B, C] with scores [0.9, 0.7, 0.5]: Pairs: (A>B), (A>C), (B>C) - all pairwise comparisons
- If exhaustive_pairs=False:
For candidates [A, B, C] with scores [0.9, 0.7, 0.5]: Pairs: (A>C) only - best vs worst
- Args:
candidates_by_turn: Dictionary mapping turn keys to candidate lists.
- Returns:
List of preference pairs.
- _generate_exhaustive_pairs(
- sorted_candidates: list[CandidateStep],
Generate all pairwise comparisons where score(chosen) > score(rejected).
- Args:
sorted_candidates: Candidates sorted by score (descending).
- Returns:
List of preference pairs, sorted by score difference (descending).
- _generate_best_vs_worst_pair(
- sorted_candidates: list[CandidateStep],
Generate a single pair: best candidate vs worst candidate.
- Args:
sorted_candidates: Candidates sorted by score (descending).
- Returns:
List with at most one preference pair.
- _build_trajectories(
- pairs: list[PreferencePair],
Convert preference pairs to Trajectory format with DPOItem episodes.
Each trajectory contains: - episode: [DPOItem] with prompt, chosen_response, rejected_response - reward: score_diff (if reward_from_score_diff) or chosen_score - metadata: Contains pair information for tracking
- Args:
pairs: List of preference pairs.
- Returns:
List of trajectories with DPOItem episodes.
- _group_by_example(
- trajectories: list[nat.data_models.finetuning.Trajectory],
Group trajectories by example ID for curriculum learning.
This grouping enables: - Filtering by average reward per example - Expansion from easy to hard examples
- Args:
trajectories: List of trajectories to group.
- Returns:
List of trajectory lists, where each inner list contains trajectories for one example.