Output Projections#
The output_projection block in a config maps raw Data Designer records into the schema expected by downstream training steps. Each projection type extracts specific columns and writes one JSON object per line.
OpenAI Messages#
Produces single-turn OpenAI chat-format records. Use for SFT chat data that feeds data_prep/sft_packing or AutoModel SFT.
YAML:
output_projection:
type: openai_messages
user_field: user_query # column containing the user turn
assistant_field: assistant_response # column containing the assistant turn
metadata_fields: [persona, topic] # additional columns to include at top level
Output (one JSON object per line):
{
"messages": [
{"role": "user", "content": "How do I calibrate the sensor threshold?"},
{"role": "assistant", "content": "Set the threshold in the device settings under Calibration → Sensor Range. A value of 0.85 works well for most environments."}
],
"persona": "engineer",
"topic": "industrial sensor calibration"
}
Fields:
Field |
Required |
Description |
|---|---|---|
|
yes |
|
|
yes |
Column name for the user message content |
|
yes |
Column name for the assistant message content |
|
no |
List of additional column names to include at the top level |
DPO Preference#
Produces preference triples for DPO training. Use with rl_pref.yaml and the llm_judge column pattern. Output feeds data_prep/rl_prep.
YAML:
output_projection:
type: dpo_preference
prompt_field: prompt # column containing the input prompt
response_a_field: response_a # column containing the first candidate response
response_b_field: response_b # column containing the second candidate response
judge_field: judge # column containing the judge's structured output
winner_field: winner # key inside the judge output that holds "A" or "B"
Output (one JSON object per line):
{
"prompt": "Explain why retrieval-augmented generation can reduce hallucinations in enterprise assistants.",
"chosen": "RAG grounds the model in retrieved passages, so factual claims are tied to source documents rather than purely to learned weights.",
"rejected": "RAG is better because it uses the internet and knows more things than a regular model."
}
Fields:
Field |
Required |
Description |
|---|---|---|
|
yes |
|
|
yes |
Column name for the input prompt |
|
yes |
Column name for candidate A |
|
yes |
Column name for candidate B |
|
yes |
Column name for the judge’s structured output |
|
yes |
Key within the judge output JSON that holds |
The projection raises ValueError if winner is not "A" or "B". The llm_judge column must be configured to return exactly this structure.
Structured Messages#
Produces multi-turn records with messages and an optional tools array.
The shipped customer_support_tools.yaml config generates this shape with an llm_text column that returns a JSON object; llm_structured columns can also feed this projection when their output is a mapping with the same fields.
Output feeds data_prep/sft_packing or AutoModel SFT.
YAML:
output_projection:
type: structured_messages
source_field: conversation # column containing the structured JSON object
messages_field: messages # key inside the structured object for the messages array
tools_field: tools # key inside the structured object for the tools array
metadata_fields: [customer_name, issue, urgency, channel]
Output (one JSON object per line):
{
"messages": [
{"role": "system", "content": "You are a helpful ecommerce support agent."},
{"role": "user", "content": "I haven't received my order yet."},
{"role": "assistant", "content": "", "tool_calls": [{"id": "call_001", "type": "function", "function": {"name": "lookup_order", "arguments": "{\"order_id\":\"ORD-10492\"}"}}]},
{"role": "tool", "tool_call_id": "call_001", "name": "lookup_order", "content": "{\"status\":\"delayed\",\"eta\":\"tomorrow\"}"},
{"role": "assistant", "content": "Your order is delayed and will arrive tomorrow. I can arrange an expedited replacement if needed."}
],
"tools": [{"type": "function", "function": {"name": "lookup_order", "description": "Look up order status by ID.", "parameters": {"type": "object", "properties": {"order_id": {"type": "string"}}, "required": ["order_id"]}}}],
"customer_name": "Priya",
"issue": "late delivery",
"urgency": "frustrated",
"channel": "web_chat"
}
Fields:
Field |
Required |
Description |
|---|---|---|
|
yes |
|
|
yes |
Column containing the structured JSON conversation object |
|
no |
Key in |
|
no |
Key in |
|
no |
List of additional column names to include at the top level |
The source_field column value may be a JSON string or a dict; both are handled.