Multi-Turn Agent (Tool-Calling) SFT with NeMo AutoModel

View as Markdown

This guide fine-tunes Qwen2.5-3B for multi-turn agentic tool use with NeMo AutoModel: given a set of tool definitions and a conversation that interleaves tool calls and tool responses, the model learns to emit correct tool_calls (tool name + arguments) and to use the returned results across several turns.

This is the multi-turn counterpart to the single-turn Function Calling with FunctionGemma guide. That guide maps one user query to one set of tool calls via make_xlam_dataset; this guide handles full multi-turn agent traces (user → tool_call → tool response → assistant → ...) via make_agent_chat_dataset.

What is Multi-Turn Agent SFT?

A multi-turn agent trace contains, in order:

  • a user request,
  • one or more assistant turns that may emit tool_calls (with parallel calls in a single turn),
  • tool responses paired back to those calls, and
  • a final assistant answer that uses the tool results.

The training data is a set of such traces plus the tool schema available to the model. The dataset adapter renders each trace through the tokenizer’s chat template with answer_only_loss_mask=True, so only assistant and tool-call tokens contribute to the lossuser and tool tokens are masked out.

The Task

We fine-tune on the llamafactory/glaive_toolcall_en dataset (ShareGPT function-calling traces).

Behavior
Base modelOften answers in free-form text, invents tool names, or emits malformed argument JSON.
After SFTEmits structured tool_calls whose names and arguments match the provided tool schema, and chains them across turns.

Guide Overview

StepDescription
Step 0Environment setup
Step 1Explore the glaive_toolcall dataset
Step 2Evaluate the base model (before fine-tuning)
Step 3Training configuration
Step 4Launch fine-tuning
Step 5Evaluate the fine-tuned model
Step 6Compare results

Hardware Requirements

  • Full-parameter SFT: the example below shards across 8 GPUs with FSDP2. Qwen2.5-3B is small enough to also train on a single 80 GB GPU, which runs unsharded (FSDP2 only applies across multiple GPUs).
  • LoRA / PEFT: trainable on a single GPU (see the PEFT variant at the end).
  • glaive_toolcall_en is small (a few thousand traces), so a full pass is fast.

Step 0 — Environment Setup

This guide runs inside the NeMo AutoModel Docker container:

$docker run -it --rm --gpus all --ipc=host --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo-automodel:26.06.00
$huggingface-cli login # for gated model/dataset access if needed
$cd /opt/Automodel

Outside the container, install from source with uv (the project standard): uv sync from a checkout of the repo. Avoid pip install for development setups.


Step 1 — Explore the glaive_toolcall Dataset

Each row carries a tools schema (JSON string) and a ShareGPT conversations list whose from field is one of human, gpt, function_call, or observation.

1from datasets import load_dataset
2
3ds = load_dataset("llamafactory/glaive_toolcall_en", split="train")
4print(f"train: {len(ds)} traces")
5
6ex = ds[0]
7print("fields:", list(ex.keys())) # e.g. ['conversations', 'tools', 'system']
8print("\ntools:", ex["tools"][:200], "...")
9
10for turn in ex["conversations"]:
11 print(f" {turn['from']:>14} | {turn['value'][:70]}")

Illustrative output:

train: 5000 traces
fields: ['conversations', 'tools', 'system']
tools: [{"name": "get_stock_price", "description": "Get the current stock price", "parameters": {...}}] ...
human | Hi, can you tell me the current stock price of Apple?
function_call | {"name": "get_stock_price", "arguments": {"symbol": "AAPL"}}
observation | {"price": 150.75}
gpt | The current stock price of Apple (AAPL) is $150.75.

How the adapter renders a trace

make_agent_chat_dataset converts each trace to OpenAI chat-completions format (merging consecutive function_call turns into one assistant message with parallel tool_calls, pairing observation turns back to those calls), then tokenizes with answer_only_loss_mask=True:

1from transformers import AutoTokenizer
2from nemo_automodel.components.datasets.llm.agent_chat import make_agent_chat_dataset
3
4tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B")
5dataset = make_agent_chat_dataset(
6 tokenizer=tok,
7 dataset_name="llamafactory/glaive_toolcall_en",
8 split="train[:100]",
9 seq_length=4096,
10 truncation=True, # see the seq_length note in Step 3
11)
12
13sample = dataset[0]
14print("keys:", list(sample.keys())) # input_ids, labels, attention_mask, ___PAD_TOKEN_IDS___
15supervised = sum(1 for x in sample["labels"] if x != -100)
16print(f"supervised tokens: {supervised} / {len(sample['labels'])}")

This prints (the dataset injects a 4th ___PAD_TOKEN_IDS___ key alongside the standard three):

keys: ['input_ids', 'labels', 'attention_mask', '___PAD_TOKEN_IDS___']
supervised tokens: 127 / 502

Only the assistant/tool-call spans are supervised; the user prompt and tool responses are -100 in labels.


Step 2 — Evaluate the Base Model (Before Fine-Tuning)

Render a held-out prompt up to the point the model should call a tool, pass the tools schema through the chat template, and generate:

1import json
2import torch
3from transformers import AutoTokenizer, AutoModelForCausalLM
4
5MODEL = "Qwen/Qwen2.5-3B"
6tok = AutoTokenizer.from_pretrained(MODEL)
7model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype=torch.bfloat16).eval().to("cuda")
8
9tools = [{
10 "type": "function",
11 "function": {
12 "name": "get_stock_price",
13 "description": "Get the current stock price",
14 "parameters": {"type": "object", "properties": {"symbol": {"type": "string"}}, "required": ["symbol"]},
15 },
16}]
17messages = [{"role": "user", "content": "What's Apple's stock price right now?"}]
18
19inputs = tok.apply_chat_template(
20 messages, tools=tools, add_generation_prompt=True, return_tensors="pt", return_dict=True
21).to(model.device)
22
23with torch.inference_mode():
24 out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
25print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Greedy decoding (do_sample=False) on base Qwen2.5-3B answers in prose, then hallucinates an HTML page and repeats it until the 256-token budget runs out:

I'm looking up Apple's stock price. 🚀
<!DOCTYPE html>
<html>
<head>
<title>Stock Price Lookup</title>
</head>
<body>
<h1>Stock Price Lookup</h1>
<p>Apple's stock price is $135.00.</p>
</body>
</html>
... (the HTML block repeats until the token budget is exhausted)

Instead of a clean get_stock_price(symbol="AAPL") tool call, the base model answers in prose and fabricates content — exactly the failure fine-tuning fixes.

Greedy output is deterministic for a given model snapshot + Transformers version, so you may see slight variation from the trace above.

For a rigorous, repeatable score, NeMo AutoModel ships a generation-based tool-call accuracy evaluator (nemo_automodel.components.eval.tool_call_evaluator.ToolCallAccuracyEvaluator) that builds held-out prompts with make_agent_chat_eval_samples, parses generated tool calls with nemo_automodel.components.eval.tool_call_parser, and compares against the ground-truth calls. Step 4 wires it to run automatically during training.


Step 3 — Training Configuration

Use the ready-made config at examples/llm_finetune/agent/qwen2_5_3b_function_calling.yaml. The key blocks:

1model:
2 _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
3 pretrained_model_name_or_path: Qwen/Qwen2.5-3B
4 attn_implementation: sdpa
5 output_hidden_states: true # required by FusedLinearCrossEntropy below
6
7loss_fn:
8 _target_: nemo_automodel.components.loss.linear_ce.FusedLinearCrossEntropy
9
10dataset:
11 _target_: nemo_automodel.components.datasets.llm.agent_chat.make_agent_chat_dataset
12 dataset_name: llamafactory/glaive_toolcall_en
13 split: train
14 seq_length: 4096
15 tokenizer:
16 pretrained_model_name_or_path: Qwen/Qwen2.5-3B
17
18# Generation-based tool-call accuracy eval, runs at every val step alongside val_loss.
19tool_call_eval:
20 _target_: nemo_automodel.components.eval.tool_call_evaluator.ToolCallAccuracyEvaluator
21 dataset_name: llamafactory/glaive_toolcall_en
22 split: train[:128]
23 max_eval_samples: 128
24 max_new_tokens: 256
25 max_prompt_tokens: 3584

Why answer_only_loss_mask? The agent dataset supervises only the tokens the model must produce — assistant text and tool_calls. User turns and tool responses are masked to -100, so the model is never trained to “predict the environment”. This is on by default inside make_agent_chat_dataset.

make_agent_chat_dataset exposes three flags worth knowing for agent traces:

  • train_on_last_turn_only — supervise only the final assistant turn (mask_history).
  • mask_reasoning_content — render assistant reasoning_content (thinking) into the prompt but exclude it from the loss.
  • drop_history_reasoning_content — strip prior-turn thinking from the prompt entirely (matches inference). Most coherent together with train_on_last_turn_only=true.

seq_length only takes effect when truncation: true (to cap length) or padding: max_length (to pad). With the defaults (truncation: false, padding: false), seq_length is ignored and long traces pass through uncapped — risking OOM. Set truncation: true if you want a hard cap.


Step 4 — Launch Fine-Tuning

$automodel --nproc-per-node=8 examples/llm_finetune/agent/qwen2_5_3b_function_calling.yaml 2>&1 | tee train_agent_sft.log

What to watch

  • loss should fall steadily as the model learns the tool-call format.
  • tool_call/accuracy (and related tool_call/* keys) report generation-based tool-call correctness at each validation step — this is the metric that distinguishes “format learned” from “tools actually correct”.

Under FSDP2 (the default strategy here), the in-loop generation eval is skipped unless you set tool_call_eval.run_on_fsdp2: true, because in-loop generation with sharded weights is expensive. If you keep it off, evaluate tool-call accuracy from a saved checkpoint instead (Step 5).

Illustrative training log (replace with your run):

step 0 | loss 1.84 | grad_norm 12.1 | lr 1.0e-6 | mem 41 GiB | tps/gpu 380
step 50 | loss 0.42 | grad_norm 4.3 | lr 9.8e-6 | mem 41 GiB | tps/gpu 410
step 200 | loss 0.21 | grad_norm 2.1 | lr 7.4e-6 | mem 41 GiB | tps/gpu 412
step 500 | loss 0.14 | grad_norm 1.6 | lr 3.1e-6 | mem 41 GiB | tps/gpu 415
Validation:
step 100 | val_loss 0.23 | tool_call/accuracy 0.61
step 300 | val_loss 0.18 | tool_call/accuracy 0.78
step 500 | val_loss 0.16 | tool_call/accuracy 0.83

Checkpoints are written under the configured checkpoint_dir, each with an HF-compatible model/consolidated/ directory and LATEST / LOWEST_VAL symlinks.


Step 5 — Evaluate the Fine-Tuned Model

Load the consolidated checkpoint (HF-compatible) and re-run the Step 2 prompt; the fine-tuned model should now emit a clean tool call:

1import os, torch
2from transformers import AutoTokenizer, AutoModelForCausalLM
3
4CKPT = os.path.realpath("agent_checkpoints/qwen2_5_3b_glaive/LOWEST_VAL")
5consolidated = os.path.join(CKPT, "model", "consolidated")
6
7tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B")
8model = AutoModelForCausalLM.from_pretrained(consolidated, torch_dtype=torch.bfloat16, device_map="auto").eval()
9
10# same `messages` + `tools` as Step 2
11inputs = tok.apply_chat_template(messages, tools=tools, add_generation_prompt=True, return_tensors="pt", return_dict=True).to(model.device)
12with torch.inference_mode():
13 out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
14print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Expected (fine-tuned): a structured call such as get_stock_price with {"symbol": "AAPL"}.


Step 6 — Results Comparison

Illustrative tool-call accuracy on a held-out slice (replace with your run):

MetricBaseFine-Tuned
Tool-name accuracy~0.30~0.95
Argument-match accuracy~0.15~0.85
Overall tool-call accuracy~0.20~0.83

The numbers above are placeholders to show the expected shape of the result (base low, fine-tuned high). Fill in your own from the tool_call/* metrics in the training log or a checkpoint eval.


PEFT (LoRA) variant

For a single-GPU run, use the LoRA config examples/llm_finetune/agent/qwen2_5_3b_function_calling_lora.yaml, which adds a peft block on top of the same dataset and eval setup:

$automodel examples/llm_finetune/agent/qwen2_5_3b_function_calling_lora.yaml

See the SFT and PEFT guide for tuning LoRA rank, alpha, and target modules.