Multi-Turn Agent (Tool-Calling) SFT with NeMo AutoModel

This guide fine-tunes Qwen2.5-3B for multi-turn agentic tool use with NeMo AutoModel: given a set of tool definitions and a conversation that interleaves tool calls and tool responses, the model learns to emit correct tool_calls (tool name + arguments) and to use the returned results across several turns.

This is the multi-turn counterpart to the single-turn Function Calling with FunctionGemma guide. That guide maps one user query to one set of tool calls via make_xlam_dataset; this guide handles full multi-turn agent traces (user → tool_call → tool response → assistant → ...) via make_agent_chat_dataset.

What is Multi-Turn Agent SFT?

A multi-turn agent trace contains, in order:

a user request,
one or more assistant turns that may emit tool_calls (with parallel calls in a single turn),
tool responses paired back to those calls, and
a final assistant answer that uses the tool results.

The training data is a set of such traces plus the tool schema available to the model. The dataset adapter renders each trace through the tokenizer’s chat template with answer_only_loss_mask=True, so only assistant and tool-call tokens contribute to the loss — user and tool tokens are masked out.

The Task

We fine-tune on the llamafactory/glaive_toolcall_en dataset (ShareGPT function-calling traces).

	Behavior
Base model	Often answers in free-form text, invents tool names, or emits malformed argument JSON.
After SFT	Emits structured `tool_calls` whose names and arguments match the provided tool schema, and chains them across turns.

Guide Overview

Step	Description
Step 0	Environment setup
Step 1	Explore the `glaive_toolcall` dataset
Step 2	Evaluate the base model (before fine-tuning)
Step 3	Training configuration
Step 4	Launch fine-tuning
Step 5	Evaluate the fine-tuned model
Step 6	Compare results

Hardware Requirements

Full-parameter SFT: the example below shards across 8 GPUs with FSDP2. Qwen2.5-3B is small enough to also train on a single 80 GB GPU, which runs unsharded (FSDP2 only applies across multiple GPUs).
LoRA / PEFT: trainable on a single GPU (see the PEFT variant at the end).
glaive_toolcall_en is small (a few thousand traces), so a full pass is fast.

Step 0 — Environment Setup

This guide runs inside the NeMo AutoModel Docker container:

$ docker run -it --rm --gpus all --ipc=host --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo-automodel:26.06.00

$ huggingface-cli login   # for gated model/dataset access if needed
$ cd /opt/Automodel

Outside the container, install from source with uv (the project standard): uv sync from a checkout of the repo. Avoid pip install for development setups.

Step 1 — Explore the `glaive_toolcall` Dataset

Each row carries a tools schema (JSON string) and a ShareGPT conversations list whose from field is one of human, gpt, function_call, or observation.

1 from datasets import load_dataset
2 
3 ds = load_dataset("llamafactory/glaive_toolcall_en", split="train")
4 print(f"train: {len(ds)} traces")
5 
6 ex = ds[0]
7 print("fields:", list(ex.keys()))           # e.g. ['conversations', 'tools', 'system']
8 print("\ntools:", ex["tools"][:200], "...")
9 
10 for turn in ex["conversations"]:
11     print(f"  {turn['from']:>14} | {turn['value'][:70]}")

Illustrative output:

train: 5000 traces
fields: ['conversations', 'tools', 'system']
tools: [{"name": "get_stock_price", "description": "Get the current stock price", "parameters": {...}}] ...
            human | Hi, can you tell me the current stock price of Apple?
    function_call | {"name": "get_stock_price", "arguments": {"symbol": "AAPL"}}
      observation | {"price": 150.75}
              gpt | The current stock price of Apple (AAPL) is $150.75.

How the adapter renders a trace

make_agent_chat_dataset converts each trace to OpenAI chat-completions format (merging consecutive function_call turns into one assistant message with parallel tool_calls, pairing observation turns back to those calls), then tokenizes with answer_only_loss_mask=True:

1 from transformers import AutoTokenizer
2 from nemo_automodel.components.datasets.llm.agent_chat import make_agent_chat_dataset
3 
4 tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B")
5 dataset = make_agent_chat_dataset(
6     tokenizer=tok,
7     dataset_name="llamafactory/glaive_toolcall_en",
8     split="train[:100]",
9     seq_length=4096,
10     truncation=True,   # see the seq_length note in Step 3
11 )
12 
13 sample = dataset[0]
14 print("keys:", list(sample.keys()))                       # input_ids, labels, attention_mask, ___PAD_TOKEN_IDS___
15 supervised = sum(1 for x in sample["labels"] if x != -100)
16 print(f"supervised tokens: {supervised} / {len(sample['labels'])}")

This prints (the dataset injects a 4th ___PAD_TOKEN_IDS___ key alongside the standard three):

keys: ['input_ids', 'labels', 'attention_mask', '___PAD_TOKEN_IDS___']
supervised tokens: 127 / 502

Only the assistant/tool-call spans are supervised; the user prompt and tool responses are -100 in labels.

Step 2 — Evaluate the Base Model (Before Fine-Tuning)

Render a held-out prompt up to the point the model should call a tool, pass the tools schema through the chat template, and generate:

1 import json
2 import torch
3 from transformers import AutoTokenizer, AutoModelForCausalLM
4 
5 MODEL = "Qwen/Qwen2.5-3B"
6 tok = AutoTokenizer.from_pretrained(MODEL)
7 model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype=torch.bfloat16).eval().to("cuda")
8 
9 tools = [{
10     "type": "function",
11     "function": {
12         "name": "get_stock_price",
13         "description": "Get the current stock price",
14         "parameters": {"type": "object", "properties": {"symbol": {"type": "string"}}, "required": ["symbol"]},
15     },
16 }]
17 messages = [{"role": "user", "content": "What's Apple's stock price right now?"}]
18 
19 inputs = tok.apply_chat_template(
20     messages, tools=tools, add_generation_prompt=True, return_tensors="pt", return_dict=True
21 ).to(model.device)
22 
23 with torch.inference_mode():
24     out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
25 print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Greedy decoding (do_sample=False) on base Qwen2.5-3B answers in prose, then hallucinates an HTML page and repeats it until the 256-token budget runs out:

I'm looking up Apple's stock price. 🚀
<!DOCTYPE html>
<html>
<head>
  <title>Stock Price Lookup</title>
</head>
<body>
  <h1>Stock Price Lookup</h1>
  <p>Apple's stock price is $135.00.</p>
</body>
</html>
... (the HTML block repeats until the token budget is exhausted)

Instead of a clean get_stock_price(symbol="AAPL") tool call, the base model answers in prose and fabricates content — exactly the failure fine-tuning fixes.

Greedy output is deterministic for a given model snapshot + Transformers version, so you may see slight variation from the trace above.

For a rigorous, repeatable score, NeMo AutoModel ships a generation-based tool-call accuracy evaluator (nemo_automodel.components.eval.tool_call_evaluator.ToolCallAccuracyEvaluator) that builds held-out prompts with make_agent_chat_eval_samples, parses generated tool calls with nemo_automodel.components.eval.tool_call_parser, and compares against the ground-truth calls. Step 4 wires it to run automatically during training.

Step 3 — Training Configuration

Use the ready-made config at examples/llm_finetune/agent/qwen2_5_3b_function_calling.yaml. The key blocks:

1 model:
2   _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
3   pretrained_model_name_or_path: Qwen/Qwen2.5-3B
4   attn_implementation: sdpa
5   output_hidden_states: true   # required by FusedLinearCrossEntropy below
6 
7 loss_fn:
8   _target_: nemo_automodel.components.loss.linear_ce.FusedLinearCrossEntropy
9 
10 dataset:
11   _target_: nemo_automodel.components.datasets.llm.agent_chat.make_agent_chat_dataset
12   dataset_name: llamafactory/glaive_toolcall_en
13   split: train
14   seq_length: 4096
15   tokenizer:
16     pretrained_model_name_or_path: Qwen/Qwen2.5-3B
17 
18 # Generation-based tool-call accuracy eval, runs at every val step alongside val_loss.
19 tool_call_eval:
20   _target_: nemo_automodel.components.eval.tool_call_evaluator.ToolCallAccuracyEvaluator
21   dataset_name: llamafactory/glaive_toolcall_en
22   split: train[:128]
23   max_eval_samples: 128
24   max_new_tokens: 256
25   max_prompt_tokens: 3584

Why answer_only_loss_mask? The agent dataset supervises only the tokens the model must produce — assistant text and tool_calls. User turns and tool responses are masked to -100, so the model is never trained to “predict the environment”. This is on by default inside make_agent_chat_dataset.

make_agent_chat_dataset exposes three flags worth knowing for agent traces:

train_on_last_turn_only — supervise only the final assistant turn (mask_history).
mask_reasoning_content — render assistant reasoning_content (thinking) into the prompt but exclude it from the loss.
drop_history_reasoning_content — strip prior-turn thinking from the prompt entirely (matches inference). Most coherent together with train_on_last_turn_only=true.

seq_length only takes effect when truncation: true (to cap length) or padding: max_length (to pad). With the defaults (truncation: false, padding: false), seq_length is ignored and long traces pass through uncapped — risking OOM. Set truncation: true if you want a hard cap.

Step 4 — Launch Fine-Tuning

$ automodel --nproc-per-node=8 examples/llm_finetune/agent/qwen2_5_3b_function_calling.yaml 2>&1 | tee train_agent_sft.log

What to watch

loss should fall steadily as the model learns the tool-call format.
tool_call/accuracy (and related tool_call/* keys) report generation-based tool-call correctness at each validation step — this is the metric that distinguishes “format learned” from “tools actually correct”.

Under FSDP2 (the default strategy here), the in-loop generation eval is skipped unless you set tool_call_eval.run_on_fsdp2: true, because in-loop generation with sharded weights is expensive. If you keep it off, evaluate tool-call accuracy from a saved checkpoint instead (Step 5).

Illustrative training log (replace with your run):

step    0 | loss 1.84 | grad_norm 12.1 | lr 1.0e-6 | mem 41 GiB | tps/gpu 380
step   50 | loss 0.42 | grad_norm  4.3 | lr 9.8e-6 | mem 41 GiB | tps/gpu 410
step  200 | loss 0.21 | grad_norm  2.1 | lr 7.4e-6 | mem 41 GiB | tps/gpu 412
step  500 | loss 0.14 | grad_norm  1.6 | lr 3.1e-6 | mem 41 GiB | tps/gpu 415
Validation:
  step 100 | val_loss 0.23 | tool_call/accuracy 0.61
  step 300 | val_loss 0.18 | tool_call/accuracy 0.78
  step 500 | val_loss 0.16 | tool_call/accuracy 0.83

Checkpoints are written under the configured checkpoint_dir, each with an HF-compatible model/consolidated/ directory and LATEST / LOWEST_VAL symlinks.

Step 5 — Evaluate the Fine-Tuned Model

Load the consolidated checkpoint (HF-compatible) and re-run the Step 2 prompt; the fine-tuned model should now emit a clean tool call:

1 import os, torch
2 from transformers import AutoTokenizer, AutoModelForCausalLM
3 
4 CKPT = os.path.realpath("agent_checkpoints/qwen2_5_3b_glaive/LOWEST_VAL")
5 consolidated = os.path.join(CKPT, "model", "consolidated")
6 
7 tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B")
8 model = AutoModelForCausalLM.from_pretrained(consolidated, torch_dtype=torch.bfloat16, device_map="auto").eval()
9 
10 # same `messages` + `tools` as Step 2
11 inputs = tok.apply_chat_template(messages, tools=tools, add_generation_prompt=True, return_tensors="pt", return_dict=True).to(model.device)
12 with torch.inference_mode():
13     out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
14 print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Expected (fine-tuned): a structured call such as get_stock_price with {"symbol": "AAPL"}.

Step 6 — Results Comparison

Illustrative tool-call accuracy on a held-out slice (replace with your run):

Metric	Base	Fine-Tuned
Tool-name accuracy	~0.30	~0.95
Argument-match accuracy	~0.15	~0.85
Overall tool-call accuracy	~0.20	~0.83

The numbers above are placeholders to show the expected shape of the result (base low, fine-tuned high). Fill in your own from the tool_call/* metrics in the training log or a checkpoint eval.

PEFT (LoRA) variant

For a single-GPU run, use the LoRA config examples/llm_finetune/agent/qwen2_5_3b_function_calling_lora.yaml, which adds a peft block on top of the same dataset and eval setup:

$ automodel examples/llm_finetune/agent/qwen2_5_3b_function_calling_lora.yaml

See the SFT and PEFT guide for tuning LoRA rank, alpha, and target modules.