Download this tutorial as a Jupyter notebook
Tool-Calling Fine-Tuning with Synthetic Data#
Learn how to fine-tune Llama 3.2 1B Instruct for tool calling using LoRA and synthetic data from Data Designer.
About#
Tool calling enables LLMs to interact with external functions and APIs by generating structured function calls instead of plain text. Fine-tuning a model specifically for tool calling improves its ability to select the correct function and produce valid arguments.
This tutorial walks through the full workflow:
Data preparation – Generate a synthetic dataset with Data Designer, convert it to OpenAI chat format, and filter for compatibility with the Llama 3.2 chat template
LoRA fine-tuning – Train a LoRA adapter on the prepared dataset
Deployment and inference – Deploy the fine-tuned model and compare tool calling behavior against the base model
Evaluation – Measure tool calling accuracy using the platform’s evaluation service
Prerequisites#
Before starting this tutorial, ensure you have:
Installed the Python SDK with the Data Designer extra (
uv pip install nemo-platform[data-designer])Completed the Quickstart to install and deploy NeMo Platform locally
(Optional if running outside of Quickstart) Authenticated with the platform using the CLI:
nemo auth login
For non-default URLs:
nemo auth login --base-url <YOUR_NMP_BASE_URL>Set your Hugging Face token as an environment variable before running the notebook (
export HF_TOKEN="<your-token>")Accepted the dataset license for Salesforce/xlam-function-calling-60k on Hugging Face (used for evaluation)
1. Initialize SDK#
The SDK reads its cluster URL and authentication credentials from the config file managed by nemo auth login. For the default Quickstart deployment at http://localhost:8080, no extra configuration is needed.
We also define the provider and model we will be using for synthetic data generation. These must be available in your NeMo Platform deployment.
Note: This tutorial uses config-based initialization (
NeMoPlatform(workspace=...)) instead of the explicitbase_urlpattern shown elsewhere in the docs to make it easier to switch between Quickstart and running on a cluster.
import os
from datasets import load_dataset as hf_load_dataset
from nemo_platform import NeMoPlatform, ConflictError
import data_designer.config as dd
from pydantic import BaseModel, Field
from typing import List
WORKSPACE = "default"
sdk = NeMoPlatform(workspace=WORKSPACE)
MODEL_PROVIDER = f"{WORKSPACE}/nvidia-build"
MODEL_ID = "nvidia/nemotron-3-nano-30b-a3b"
MODEL_ALIAS = "nemotron"
model_configs = [
dd.ModelConfig(
provider=MODEL_PROVIDER,
model=MODEL_ID,
alias=MODEL_ALIAS,
inference_parameters=dd.ChatCompletionInferenceParams(
temperature=1.0,
top_p=1.0,
max_tokens=2048,
),
)
]
2. Generate synthetic data#
Define tool calling schemas#
We use Pydantic models to define the exact structure of our tool definitions and tool calls.
Data Designer enforces these schemas during generation.
class ParameterDef(BaseModel):
"""A single parameter in a function definition."""
name: str = Field(description="Parameter name")
type: str = Field(description="JSON schema type (string, integer, number, boolean, array, object)")
description: str = Field(description="What this parameter does")
class FunctionDef(BaseModel):
"""A tool/function definition."""
name: str = Field(description="Function name in snake_case")
description: str = Field(description="What this function does")
parameters: List[ParameterDef] = Field(description="List of function parameters")
class ToolDefinitions(BaseModel):
"""A set of available tools."""
tools: List[FunctionDef] = Field(description="2-4 tool/function definitions")
class ArgumentDef(BaseModel):
"""A single argument name/value pair in a tool call instance."""
name: str = Field(description="Parameter name")
value: str = Field(description="The value to pass, as a JSON-encoded string")
class ToolCallArg(BaseModel):
"""A single tool call with arguments."""
name: str = Field(description="Name of the function to call")
arguments: List[ArgumentDef] = Field(description="Arguments to pass to the function")
class ExpectedToolCalls(BaseModel):
"""The expected tool calls for a given query. The model we're fine-tuning has a chat template that only supports one tool call per query in its template."""
tool_calls: List[ToolCallArg] = Field(description="The tool call to be made.")
Build data columns#
The config builder provides an intuitive interface for defining your dataset schema.
config_builder = dd.DataDesignerConfigBuilder(model_configs=model_configs)
config_builder.add_column(
dd.SamplerColumnConfig(
name="domain",
sampler_type="category",
params={"values": [
"weather",
"calendar",
"web_search",
"math",
"file_management",
"database",
"email",
"navigation",
"e_commerce",
"social_media",
]},
)
)
config_builder.add_column(
dd.LLMStructuredColumnConfig(
name="tools",
prompt=(
"Generate 2-4 realistic API function definitions for the '{{ domain }}' domain. "
"Each function should have a descriptive snake_case name, clear description, "
"and well-typed parameters. Make them diverse — include functions with "
"different parameter counts and types."
),
output_format=ToolDefinitions,
model_alias=MODEL_ALIAS,
)
)
config_builder.add_column(
dd.LLMTextColumnConfig(
name="user_query",
prompt=(
"Given these available tools:\n{{ tools }}\n\n"
"Generate a natural, realistic user query that would require calling "
"one of these tools. The query should be something a real user "
"would type. Vary complexity, but due to the limitations of our "
"downstream model only require single tool calls. "
"Respond with only the query, no other text."
),
model_alias=MODEL_ALIAS,
)
)
config_builder.add_column(
dd.LLMStructuredColumnConfig(
name="expected_tool_call",
prompt=(
"Given this user query: '{{ user_query }}'\n"
"And these available tools: {{ tools }}\n\n"
"Determine exactly which tool should be called and with what arguments. "
"Be precise with argument values — they should directly address the user's request."
),
output_format=ExpectedToolCalls,
model_alias=MODEL_ALIAS,
)
)
Preview the dataset#
Use the
previewmethod to generate a small sample quickly.Inspect the results for quality and format issues.
Adjust column configurations, prompts, or parameters as needed.
Re-run the preview until satisfied.
preview = sdk.data_designer.preview(config_builder, num_records=4)
You can view one record at a time to inspect quality:
preview.display_sample_record()
Or view the entire preview dataset:
preview.dataset
Scale up#
The preview functionality lets you iterate quickly on your data to ensure the right structure and quality. Once ready, the create method is used to generate the full training dataset.
For this tutorial we generate 60 samples (~5 minutes) to keep iteration fast. For production use, generate 500+ samples (~1 hour) for significantly better results — in our testing, 500 samples yielded ~93% function name accuracy, compared to ~1.4% for the un-fine-tuned base model.
job = sdk.data_designer.create(config_builder, num_records=60)
job.wait_until_done()
Fetch the dataset from the job outputs and check out a preview.
results = job.download_artifacts()
dataset = results.load_dataset()
dataset.head()
Convert Dataset#
Fine-tuning requires the OpenAI messages+tools format with specific nesting.
Data Designer generated flat structures — now we wrap them in the OpenAI format.
Key transformations: wrap tools/calls in
{"type": "function", "function": {...}}, convert parameter lists to properties dicts, add"content": ""to the assistant message.
import json
def to_openai_format(row):
"""Convert a DD-generated row to the OpenAI tool calling format expected by the Customizer."""
openai_tools = []
for tool_def in row["tools"]["tools"]:
properties = {
p["name"]: {"type": p["type"], "description": p["description"]}
for p in tool_def["parameters"]
}
openai_tools.append({
"type": "function",
"function": {
"name": tool_def["name"],
"description": tool_def["description"],
"parameters": {"type": "object", "properties": properties},
},
})
openai_tool_calls = []
for tc in row["expected_tool_call"]["tool_calls"]:
arguments = {}
for arg in tc["arguments"]:
try:
arguments[arg["name"]] = json.loads(arg["value"])
except:
arguments[arg["name"]] = arg["value"]
openai_tool_calls.append({
"type": "function",
"function": {
"name": tc["name"],
"arguments": arguments,
},
})
return {
"messages": [
{"role": "user", "content": row["user_query"]},
{"role": "assistant", "content": "", "tool_calls": openai_tool_calls},
],
"tools": openai_tools,
}
training_data = dataset.apply(to_openai_format, axis=1, result_type="expand")
# Llama 3.2 1B's chat template supports only single tool calls per turn — filter out any
# generated samples that contain multiple tool calls in the assistant response.
pre_filter = len(training_data)
training_data = training_data[
training_data["messages"].apply(lambda msgs: len(msgs[1].get("tool_calls", [])) == 1)
].reset_index(drop=True)
print(f"Filtered {pre_filter} -> {len(training_data)} rows (kept single tool-call only)")
training_data.head()
Sample and Split into Train / Validation#
We use all available synthetic examples and split them 85/15 into training and validation sets. Evaluation uses a separate golden dataset (xLAM) prepared in the next section.
import random
from pathlib import Path
SEED = 1234
random.seed(SEED)
SAMPLE_SIZE = 60
training_records = training_data.to_dict("records")
sampled = random.sample(training_records, min(SAMPLE_SIZE, len(training_records)))
train_size = max(1, int(0.85 * len(sampled)))
val_size = max(1, len(sampled) - train_size)
train_data = sampled[:train_size]
val_data = sampled[train_size:]
print(f"Training: {len(train_data)} examples")
print(f"Validation: {len(val_data)} examples")
# Save training and validation data
DATASET_PATH = Path("tool-calling-dataset").absolute()
os.makedirs(DATASET_PATH, exist_ok=True)
def save_jsonl(filepath: str, data: list) -> None:
with open(filepath, "w") as f:
for entry in data:
f.write(json.dumps(entry) + "\n")
save_jsonl(f"{DATASET_PATH}/training.jsonl", train_data)
print(f"\nSaved training.jsonl ({len(train_data)} rows)")
save_jsonl(f"{DATASET_PATH}/validation.jsonl", val_data)
print(f"Saved validation.jsonl ({len(val_data)} rows)")
Prepare Evaluation Dataset from xLAM#
Rather than evaluating on the same synthetic data we trained on, we use the Salesforce/xlam-function-calling-60k golden dataset as an independent test set. This gives a more meaningful measure of how well fine-tuning generalizes.
We download the dataset, convert it to OpenAI format, filter for single tool-call examples (matching our model’s capability), and sample a test set sized at 15% of our training data.
LIMIT_TOOL_PROPERTIES = 8
def normalize_type(param_type: str) -> str:
"""Normalize Python type hints to JSON schema types."""
param_type = param_type.strip()
if "," in param_type and "default" in param_type:
param_type = param_type.split(",")[0].strip()
if param_type.startswith("default="):
return "string"
param_type = param_type.replace(", optional", "").strip()
if param_type.startswith("Callable"):
return "string"
if param_type.startswith(("Tuple", "List[", "Set")) or param_type in ("list", "set"):
return "array"
type_mapping = {
"str": "string", "int": "integer", "float": "number",
"bool": "boolean", "list": "array", "dict": "object",
"List": "array", "Dict": "object", "Set": "array",
}
return type_mapping.get(param_type, "string")
def convert_xlam_tools(tools):
"""Convert xLAM tool definitions to OpenAI function spec format."""
if isinstance(tools, str):
tools = json.loads(tools)
if not isinstance(tools, list):
return []
openai_tools = []
for tool in tools:
if not isinstance(tool, dict) or not isinstance(tool.get("parameters"), dict):
continue
properties = {}
for name, info in tool["parameters"].items():
if not isinstance(info, dict):
continue
param = {
"description": info.get("description", ""),
"type": normalize_type(info.get("type", "")),
}
properties[name] = param
if len(properties) > LIMIT_TOOL_PROPERTIES:
return []
openai_tools.append({
"type": "function",
"function": {
"name": tool["name"],
"description": tool["description"],
"parameters": {"type": "object", "properties": properties},
},
})
return openai_tools
def convert_xlam_example(example):
"""Convert a single xLAM example to OpenAI eval format."""
tools = convert_xlam_tools(example.get("tools", "[]"))
if not tools:
return None
answers = json.loads(example.get("answers", "[]"))
tool_calls = [
{"type": "function", "function": {"name": t["name"], "arguments": t.get("arguments", {})}}
for t in answers
]
if len(tool_calls) != 1:
return None
return {
"messages": [{"role": "user", "content": example["query"]}],
"tools": tools,
"tool_calls": tool_calls,
}
xlam_dataset = hf_load_dataset("Salesforce/xlam-function-calling-60k", split="train")
print(f"xLAM dataset size: {len(xlam_dataset)}")
xlam_converted = [r for ex in xlam_dataset if (r := convert_xlam_example(ex)) is not None]
print(f"After filtering (single tool-call, ≤{LIMIT_TOOL_PROPERTIES} params): {len(xlam_converted)}")
eval_size = max(1, int(0.15 * len(train_data)))
eval_rows = random.sample(xlam_converted, min(eval_size, len(xlam_converted)))
print(f"Sampled {len(eval_rows)} eval examples (15% of {len(train_data)} training examples)")
EVAL_DATASET_NAME = "tool-calling-eval-dataset"
EVAL_DATASET_PATH = Path("tool-calling-eval-dataset").absolute()
os.makedirs(EVAL_DATASET_PATH, exist_ok=True)
save_jsonl(f"{EVAL_DATASET_PATH}/evaluation.jsonl", eval_rows)
print(f"\nSaved to {EVAL_DATASET_PATH}/evaluation.jsonl")
print("\nSample evaluation row:")
print(json.dumps(eval_rows[0], indent=2))
Upload Training Dataset#
Create a fileset and upload the training/validation data to the platform:
DATASET_NAME = "tool-calling-dataset"
try:
sdk.files.filesets.create(
name=DATASET_NAME,
description="synthetic tool-calling training and validation data in OpenAI chat format",
)
print(f"Created fileset: {DATASET_NAME}")
except ConflictError:
print(f"Fileset '{DATASET_NAME}' already exists, continuing...")
sdk.files.upload(
local_path=DATASET_PATH,
remote_path="",
fileset=DATASET_NAME,
)
print("\nTraining data files:")
print(json.dumps(
[f.model_dump() for f in sdk.files.list(fileset=DATASET_NAME).data],
indent=2,
))
Upload Evaluation Dataset#
Upload the evaluation data as a separate fileset:
try:
sdk.files.filesets.create(
name=EVAL_DATASET_NAME,
description="synthetic tool-calling evaluation data (messages, tools, ground-truth tool_calls)",
)
print(f"Created fileset: {EVAL_DATASET_NAME}")
except ConflictError:
print(f"Fileset '{EVAL_DATASET_NAME}' already exists, continuing...")
sdk.files.upload(
local_path=EVAL_DATASET_PATH,
remote_path="",
fileset=EVAL_DATASET_NAME,
)
print("\nEvaluation data files:")
print(json.dumps(
[f.model_dump() for f in sdk.files.list(fileset=EVAL_DATASET_NAME).data],
indent=2,
))
3. Model and Secrets Setup#
Configure HuggingFace Authentication#
We will fine-tune meta-llama/Llama-3.2-1B-Instruct from HuggingFace. Since this is a gated model, you need:
A HuggingFace token with Read access from HuggingFace Settings
Accept the model’s license on the model page
HF_TOKEN = os.environ.get("HF_TOKEN")
if HF_TOKEN is None:
raise ValueError("HF_TOKEN is not set.")
def create_or_get_secret(name: str, value: str | None, label: str):
if not value:
raise ValueError(f"{label} is not set")
try:
secret = sdk.secrets.create(
name=name,
data=value,
)
print(f"Created secret: {name}")
return secret
except ConflictError:
print(f"Secret '{name}' already exists, continuing...")
return sdk.secrets.retrieve(name=name)
hf_secret = create_or_get_secret("hf-token", HF_TOKEN, "HF_TOKEN")
print(hf_secret.model_dump_json(indent=2))
Create Base Model FileSet and Model Entity#
Create a fileset pointing to Llama 3.2 1B Instruct on HuggingFace, then create a Model Entity that references it. The model will be downloaded at training time.
For tool calling, set tool_call_config in the fileset’s metadata. The NIM serving layer needs to know how to parse tool calls from the model’s raw output, but that information isn’t embedded in the model checkpoint itself – it depends on the model family and chat template. By declaring it on the fileset, the platform’s model-spec task merges it into the Model Entity’s spec.tool_call_config, which the deployment backend then reads to configure the NIM automatically:
tool_call_parser– tells the NIM how to parse tool calls from the model output (e.g.,"llama3_json"for Llama 3.x models)auto_tool_choice– enables the model to decide when to call tools
import time
from nemo_platform.types.files import HuggingfaceStorageConfigParam
HF_REPO_ID = "meta-llama/Llama-3.2-1B-Instruct"
MODEL_NAME = "llama-3-2-1b-base"
try:
base_model_fs = sdk.files.filesets.create(
name=MODEL_NAME,
description="Llama 3.2 1B Instruct base model from HuggingFace",
storage=HuggingfaceStorageConfigParam(
type="huggingface",
repo_id=HF_REPO_ID,
repo_type="model",
token_secret=hf_secret.name,
),
metadata={
"model": {
"tool_calling": {
"tool_call_parser": "llama3_json",
"auto_tool_choice": True,
}
}
},
)
print(f"Created base model fileset: {MODEL_NAME}")
except ConflictError:
print(f"Base model fileset already exists. Skipping creation.")
base_model_fs = sdk.files.filesets.retrieve(
name=MODEL_NAME,
)
try:
base_model = sdk.models.create(
name=MODEL_NAME,
fileset=f"{WORKSPACE}/{MODEL_NAME}",
)
print(f"Created Model Entity: {MODEL_NAME}")
except ConflictError:
print(f"Base model already exists. Updating fileset if different.")
base_model = sdk.models.update(
name=MODEL_NAME,
fileset=f"{WORKSPACE}/{MODEL_NAME}",
)
print(f"\nBase model fileset: fileset://{WORKSPACE}/{base_model.name}")
print("Base model fileset files list:")
print(json.dumps(
[f.model_dump() for f in sdk.files.list(fileset=MODEL_NAME).data],
indent=2,
))
# Wait for ModelSpec to be populated from the checkpoint
print("\nWaiting for ModelSpec to be populated...")
SPEC_TIMEOUT_SECONDS = 120
spec_start = time.time()
while not base_model.spec:
if time.time() - spec_start > SPEC_TIMEOUT_SECONDS:
raise TimeoutError(f"ModelSpec not populated within {SPEC_TIMEOUT_SECONDS} seconds")
time.sleep(2)
base_model = sdk.models.retrieve(
name=MODEL_NAME,
)
print(f"ModelSpec populated: {base_model.spec.model_dump()}")
4. Create LoRA Fine-Tuning Job#
Create a customization job using SFT training with LoRA PEFT. The peft=LoRaParamsParam() parameter enables LoRA instead of full-weight fine-tuning.
LoRA defaults (can be overridden in LoRaParamsParam()):
rank: LoRA rank (dimensionality of the low-rank matrices)alpha: Scaling factor for LoRA updatestarget_modules: Which model layers to apply LoRA to
import uuid
from nemo_platform.types.customization import (
CustomizationJobInputParam,
SftTrainingParam,
ParallelismParamsParam,
LoRaParamsParam,
DeploymentParamsParam,
)
job_suffix = uuid.uuid4().hex[:4]
JOB_NAME = f"tool-calling-lora-{job_suffix}"
job = sdk.customization.jobs.create(
name=JOB_NAME,
spec=CustomizationJobInputParam(
model=f"{WORKSPACE}/{base_model.name}",
dataset=f"fileset://{WORKSPACE}/{DATASET_NAME}",
training=SftTrainingParam(
type="sft",
epochs=4,
batch_size=4,
learning_rate=0.0001,
max_seq_length=2048,
micro_batch_size=1,
peft=LoRaParamsParam(),
parallelism=ParallelismParamsParam(
num_gpus_per_node=1,
num_nodes=1,
tensor_parallel_size=1,
pipeline_parallel_size=1,
),
),
deployment_config=DeploymentParamsParam(
lora_enabled=True,
),
),
)
print(job.model_dump_json(indent=2))
Track Training Progress#
import time
from IPython.display import clear_output
TERMINAL_JOB_STATUSES = {"completed", "cancelled", "error"}
def wait_for_job(poll_fn, label, timeout_minutes=60, poll_interval=10, display_fn=None):
"""Poll a platform job until it reaches a terminal state.
Both Customizer and Evaluator use the Core Jobs service, so the same
PlatformJobStatus values apply: created, pending, active, cancelled,
cancelling, error, completed, paused, pausing, resuming.
Args:
poll_fn: Callable returning a status object with .name and .status attributes.
label: Display label for progress output.
timeout_minutes: Maximum time to wait before returning.
poll_interval: Seconds between polls.
display_fn: Optional callable(status) to print extra details after the header.
Returns:
The final status object.
"""
start = time.time()
timeout = timeout_minutes * 60
while True:
status = poll_fn()
elapsed = time.time() - start
elapsed_min, elapsed_sec = divmod(int(elapsed), 60)
clear_output(wait=True)
print(f"[{label}] Job: {status.name}")
print(f"[{label}] Status: {status.status}")
print(f"[{label}] Elapsed: {elapsed_min}m {elapsed_sec}s")
if display_fn:
display_fn(status)
if status.status in TERMINAL_JOB_STATUSES:
print(f"\n[{label}] Job finished: {status.status}")
return status
if elapsed > timeout:
print(f"\n[{label}] Timeout after {timeout_minutes} minutes")
return status
time.sleep(poll_interval)
def training_progress(status):
"""Extract and display training step progress."""
for step in status.steps or []:
if step.name == "customization-training-job":
for task in step.tasks or []:
details = task.status_details or {}
s, mx = details.get("step"), details.get("max_steps")
if s is not None and mx is not None:
print(f"Training: Step {s}/{mx} ({(s / mx) * 100:.1f}%)")
if phase := details.get("phase"):
print(f"Phase: {phase}")
return
print("Training step not started yet")
print("Defined wait_for_job helper function")
job_status = wait_for_job(
poll_fn=lambda: sdk.customization.jobs.get_status(name=job.name),
label="Training",
timeout_minutes=120,
display_fn=training_progress,
)
5. Verify Auto-Deployed Model#
Since we set lora_enabled=True in the customization job’s deployment_config, the platform automatically creates a NIM deployment for the base model after training completes. The LoRA adapter is attached to the base model entity (enabled by default) and the deployment serves both the base weights and the adapter through a single NIM instance.
ADAPTER_NAME = job.spec.output.name
print(f"Looking for adapter: {ADAPTER_NAME}")
# The adapter may not be attached to the model entity immediately after
# training completes — poll until it appears.
ADAPTER_TIMEOUT = 120
adapter_start = time.time()
adapter = None
while time.time() - adapter_start < ADAPTER_TIMEOUT:
base_model = sdk.models.retrieve(name=MODEL_NAME)
matches = [a for a in (base_model.adapters or []) if a.name == ADAPTER_NAME]
if matches:
adapter = matches[0]
break
print(f"Adapter not yet attached, retrying... ({int(time.time() - adapter_start)}s)")
time.sleep(5)
if adapter is None:
raise TimeoutError(
f"Adapter '{ADAPTER_NAME}' not found on model '{MODEL_NAME}' within {ADAPTER_TIMEOUT}s"
)
print(f"Base model: {base_model.name}")
print(f"Adapter:\n{adapter.model_dump_json(indent=2)}")
Track Deployment Status#
DEPLOYMENT_NAME = f"sft-deploy-{MODEL_NAME}"
TIMEOUT_MINUTES = 30
start_time = time.time()
timeout_seconds = TIMEOUT_MINUTES * 60
print(f"Monitoring deployment '{DEPLOYMENT_NAME}'...")
print(f"Timeout: {TIMEOUT_MINUTES} minutes\n")
while True:
deployment_status = sdk.inference.deployments.retrieve(
name=DEPLOYMENT_NAME,
)
elapsed = time.time() - start_time
elapsed_min = int(elapsed // 60)
elapsed_sec = int(elapsed % 60)
clear_output(wait=True)
print(f"Deployment: {DEPLOYMENT_NAME}")
print(f"Status: {deployment_status.status}")
print(f"Elapsed time: {elapsed_min}m {elapsed_sec}s")
if deployment_status.status == "READY":
print("\nDeployment is ready!")
if not sdk.models.wait_for_gateway(DEPLOYMENT_NAME, workspace=WORKSPACE, timeout=60):
raise RuntimeError("Inference gateway did not become ready")
break
if deployment_status.status in ("FAILED", "ERROR", "TERMINATED", "LOST", "DELETED"):
raise RuntimeError(f"Deployment failed with status: {deployment_status.status}")
if elapsed > timeout_seconds:
raise TimeoutError(f"Deployment timeout after {TIMEOUT_MINUTES} minutes")
time.sleep(15)
6. Compare Base vs Fine-Tuned Inference#
Test the deployed model with a tool calling prompt. Since lora_enabled=True, the NIM can serve both the base model and the LoRA-adapted model through the same deployment.
test_messages = [
{"role": "user", "content": "Calculate the factorial of 12 using math functions."},
]
test_tools = [
{
"type": "function",
"function": {
"name": "math_factorial",
"description": "Calculate the factorial of a given number.",
"parameters": {
"type": "object",
"properties": {
"number": {
"type": "integer",
"description": "The number for which factorial needs to be calculated.",
}
},
"required": ["number"],
},
},
}
]
def test_tool_calling(model_name: str, label: str):
"""Send a tool calling request and display the response."""
response = sdk.inference.gateway.model.post(
"v1/chat/completions",
name=model_name,
body={
"messages": test_messages,
"tools": test_tools,
"tool_choice": "auto",
"temperature": 0,
"max_tokens": 256,
},
)
print(f"{'=' * 60}")
print(f" {label}")
print(f"{'=' * 60}")
print(json.dumps(response, indent=2))
test_tool_calling(MODEL_NAME, "BASE MODEL (before fine-tuning)")
test_tool_calling(ADAPTER_NAME, "FINE-TUNED MODEL (after LoRA)")
7. Evaluate with Tool Calling Metrics#
Use the platform’s evaluation service to systematically measure tool calling accuracy on the test dataset. The Tool Calling metric produces two scores:
function_name_accuracy– Did the model call the correct function?function_name_and_args_accuracy– Did it call the correct function with the correct arguments?
Create Metric Entity#
METRIC_NAME = "tool-calling-accuracy"
try:
sdk.evaluation.metrics.create(
name=METRIC_NAME,
type="tool-calling",
reference="{{tool_calls}}",
)
print(f"Created metric: {METRIC_NAME}")
except ConflictError:
print(f"Metric '{METRIC_NAME}' already exists, continuing...")
Launch Evaluation Jobs#
Submit evaluation jobs for both the fine-tuned adapter and the base model. Both jobs start running on the platform in parallel.
from nemo_platform.types.evaluation import (
MetricOnlineJobParam,
EvaluationJobParamsParam,
)
ft_eval_job = sdk.evaluation.metric_jobs.create(
spec=MetricOnlineJobParam(
metric=f"{WORKSPACE}/{METRIC_NAME}",
model=f"{WORKSPACE}/{ADAPTER_NAME}",
prompt_template={
"messages": "{{ messages | tojson }}",
"tools": "{{ tools | tojson }}",
"tool_choice": "auto",
},
dataset=f"{WORKSPACE}/{EVAL_DATASET_NAME}",
params=EvaluationJobParamsParam(parallelism=8),
),
)
print(f"Fine-tuned model eval job: {ft_eval_job.name}")
base_eval_job = sdk.evaluation.metric_jobs.create(
spec=MetricOnlineJobParam(
metric=f"{WORKSPACE}/{METRIC_NAME}",
model=f"{WORKSPACE}/{MODEL_NAME}",
prompt_template={
"messages": "{{ messages | tojson }}",
"tools": "{{ tools | tojson }}",
"tool_choice": "auto",
},
dataset=f"{WORKSPACE}/{EVAL_DATASET_NAME}",
params=EvaluationJobParamsParam(parallelism=8),
),
)
print(f"Base model eval job: {base_eval_job.name}")
Wait for Model Evaluation Jobs#
ft_result = wait_for_job(
poll_fn=lambda: sdk.evaluation.metric_jobs.retrieve(name=ft_eval_job.name),
label="Fine-tuned eval",
)
print(ft_result)
base_result = wait_for_job(
poll_fn=lambda: sdk.evaluation.metric_jobs.retrieve(name=base_eval_job.name),
label="Base eval",
)
print(base_result)
Compare Results#
def get_scores(eval_job) -> dict:
"""Download aggregate scores from a completed evaluation job."""
result = sdk.evaluation.metric_jobs.results.aggregate_scores.download(
job=eval_job.name,
)
scores = {}
for score in result.scores:
scores[score.name] = score.mean
return scores
ft_scores = get_scores(ft_eval_job)
base_scores = get_scores(base_eval_job)
print("=" * 70)
print(f"{'Metric':<40} {'Base Model':>12} {'Fine-Tuned':>12}")
print("=" * 70)
for metric_name in ["function_name_accuracy", "function_name_and_args_accuracy"]:
base_val = base_scores.get(metric_name, "N/A")
ft_val = ft_scores.get(metric_name, "N/A")
base_str = f"{base_val:.3f}" if isinstance(base_val, (int, float)) else str(base_val)
ft_str = f"{ft_val:.3f}" if isinstance(ft_val, (int, float)) else str(ft_val)
print(f"{metric_name:<40} {base_str:>12} {ft_str:>12}")
print("=" * 70)
if isinstance(ft_scores.get("function_name_accuracy"), (int, float)) and isinstance(
base_scores.get("function_name_accuracy"), (int, float)
):
improvement = ft_scores["function_name_accuracy"] - base_scores["function_name_accuracy"]
print(f"\nFunction name accuracy improvement: {improvement:+.3f}")
Done!#
Next Steps:
Hyperparameter tuning: See the Training Configuration Reference for all available hyperparameters, valid ranges, and tuning guidance.
Troubleshooting:
If you run into issues with customization jobs see Troubleshooting Customizer.
If you run into issues with evaluation jobs see Troubleshooting Evaluator.