Multi-Turn Agent (Tool-Calling) SFT with NeMo AutoModel
Multi-Turn Agent (Tool-Calling) SFT with NeMo AutoModel
This guide fine-tunes Qwen2.5-3B for multi-turn agentic tool use with NeMo AutoModel: given a set of tool definitions and a conversation that interleaves tool calls and tool responses, the model learns to emit correct tool_calls (tool name + arguments) and to use the returned results across several turns.
This is the multi-turn counterpart to the single-turn Function Calling with FunctionGemma guide. That guide maps one user query to one set of tool calls via make_xlam_dataset; this guide handles full multi-turn agent traces (user → tool_call → tool response → assistant → ...) via make_agent_chat_dataset.
What is Multi-Turn Agent SFT?
A multi-turn agent trace contains, in order:
- a user request,
- one or more assistant turns that may emit
tool_calls(with parallel calls in a single turn), - tool responses paired back to those calls, and
- a final assistant answer that uses the tool results.
The training data is a set of such traces plus the tool schema available to the model. The dataset adapter renders each trace through the tokenizer’s chat template with answer_only_loss_mask=True, so only assistant and tool-call tokens contribute to the loss — user and tool tokens are masked out.
The Task
We fine-tune on the llamafactory/glaive_toolcall_en dataset (ShareGPT function-calling traces).
Guide Overview
Hardware Requirements
- Full-parameter SFT: the example below shards across 8 GPUs with FSDP2. Qwen2.5-3B is small enough to also train on a single 80 GB GPU, which runs unsharded (FSDP2 only applies across multiple GPUs).
- LoRA / PEFT: trainable on a single GPU (see the PEFT variant at the end).
glaive_toolcall_enis small (a few thousand traces), so a full pass is fast.
Step 0 — Environment Setup
This guide runs inside the NeMo AutoModel Docker container:
Outside the container, install from source with uv (the project standard): uv sync from a checkout of the repo. Avoid pip install for development setups.
Step 1 — Explore the glaive_toolcall Dataset
Each row carries a tools schema (JSON string) and a ShareGPT conversations list whose from field is one of human, gpt, function_call, or observation.
Illustrative output:
How the adapter renders a trace
make_agent_chat_dataset converts each trace to OpenAI chat-completions format (merging consecutive function_call turns into one assistant message with parallel tool_calls, pairing observation turns back to those calls), then tokenizes with answer_only_loss_mask=True:
This prints (the dataset injects a 4th ___PAD_TOKEN_IDS___ key alongside the standard three):
Only the assistant/tool-call spans are supervised; the user prompt and tool responses are -100 in labels.
Step 2 — Evaluate the Base Model (Before Fine-Tuning)
Render a held-out prompt up to the point the model should call a tool, pass the tools schema through the chat template, and generate:
Greedy decoding (do_sample=False) on base Qwen2.5-3B answers in prose, then hallucinates an HTML page and repeats it until the 256-token budget runs out:
Instead of a clean get_stock_price(symbol="AAPL") tool call, the base model answers in prose and fabricates content — exactly the failure fine-tuning fixes.
Greedy output is deterministic for a given model snapshot + Transformers version, so you may see slight variation from the trace above.
For a rigorous, repeatable score, NeMo AutoModel ships a generation-based tool-call accuracy evaluator (nemo_automodel.components.eval.tool_call_evaluator.ToolCallAccuracyEvaluator) that builds held-out prompts with make_agent_chat_eval_samples, parses generated tool calls with nemo_automodel.components.eval.tool_call_parser, and compares against the ground-truth calls. Step 4 wires it to run automatically during training.
Step 3 — Training Configuration
Use the ready-made config at examples/llm_finetune/agent/qwen2_5_3b_function_calling.yaml. The key blocks:
Why answer_only_loss_mask? The agent dataset supervises only the tokens the model must produce — assistant text and tool_calls. User turns and tool responses are masked to -100, so the model is never trained to “predict the environment”. This is on by default inside make_agent_chat_dataset.
make_agent_chat_dataset exposes three flags worth knowing for agent traces:
train_on_last_turn_only— supervise only the final assistant turn (mask_history).mask_reasoning_content— render assistantreasoning_content(thinking) into the prompt but exclude it from the loss.drop_history_reasoning_content— strip prior-turn thinking from the prompt entirely (matches inference). Most coherent together withtrain_on_last_turn_only=true.
seq_length only takes effect when truncation: true (to cap length) or padding: max_length (to pad). With the defaults (truncation: false, padding: false), seq_length is ignored and long traces pass through uncapped — risking OOM. Set truncation: true if you want a hard cap.
Step 4 — Launch Fine-Tuning
What to watch
lossshould fall steadily as the model learns the tool-call format.tool_call/accuracy(and relatedtool_call/*keys) report generation-based tool-call correctness at each validation step — this is the metric that distinguishes “format learned” from “tools actually correct”.
Under FSDP2 (the default strategy here), the in-loop generation eval is skipped unless you set tool_call_eval.run_on_fsdp2: true, because in-loop generation with sharded weights is expensive. If you keep it off, evaluate tool-call accuracy from a saved checkpoint instead (Step 5).
Illustrative training log (replace with your run):
Checkpoints are written under the configured checkpoint_dir, each with an HF-compatible model/consolidated/ directory and LATEST / LOWEST_VAL symlinks.
Step 5 — Evaluate the Fine-Tuned Model
Load the consolidated checkpoint (HF-compatible) and re-run the Step 2 prompt; the fine-tuned model should now emit a clean tool call:
Expected (fine-tuned): a structured call such as get_stock_price with {"symbol": "AAPL"}.
Step 6 — Results Comparison
Illustrative tool-call accuracy on a held-out slice (replace with your run):
The numbers above are placeholders to show the expected shape of the result (base low, fine-tuned high). Fill in your own from the tool_call/* metrics in the training log or a checkpoint eval.
PEFT (LoRA) variant
For a single-GPU run, use the LoRA config examples/llm_finetune/agent/qwen2_5_3b_function_calling_lora.yaml, which adds a peft block on top of the same dataset and eval setup:
See the SFT and PEFT guide for tuning LoRA rank, alpha, and target modules.