nemo_automodel.components.eval.tool_call_evaluator#

Generation-based evaluator for tool-call accuracy during agent SFT.

The loss-only validation that ships with the training recipe cannot distinguish ā€œloss going down because the model learned the formatā€ from ā€œloss going down because the model is overfitting the response style while emitting wrong tool namesā€. This evaluator closes that gap by running model.generate() on held-out prompts that terminate right before an assistant tool-call turn, parsing the generated text with

mod:

nemo_automodel.components.eval.tool_call_parser, and comparing against the ground-truth tool calls extracted from the dataset.

The evaluator is intentionally framework-agnostic: it operates on any HuggingFace-style model with a .generate() method and a tokenizer that supports apply_chat_template(..., tools=...). Distributed sharding and all-reduce of metrics are left to the caller (the training recipe), which already has the dist environment in hand.

Module Contents#

Classes#

ToolCallAccuracyEvaluator

Generation-based tool-call accuracy evaluator for agent SFT.

Data#

API#

nemo_automodel.components.eval.tool_call_evaluator.logger#

ā€˜getLogger(…)’

nemo_automodel.components.eval.tool_call_evaluator._METRIC_KEYS#

(ā€˜has_call’, ā€˜name_correct’, ā€˜args_json_valid’, ā€˜args_field_recall’, ā€˜args_field_precision’, ā€˜args_e…

class nemo_automodel.components.eval.tool_call_evaluator.ToolCallAccuracyEvaluator(
*,
dataset_name: Optional[str] = None,
path: Optional[Union[str, List[str]]] = None,
split: str = 'train',
limit_dataset_samples: Optional[int] = None,
max_eval_samples: Optional[int] = None,
max_new_tokens: int = 256,
max_prompt_tokens: Optional[int] = None,
do_sample: bool = False,
metric_prefix: str = 'tool_call',
sample_shard: Optional[tuple] = None,
raise_on_cuda_oom: bool = True,
run_on_fsdp2: bool = False,
)#

Generation-based tool-call accuracy evaluator for agent SFT.

The evaluator lazily loads a list of eval samples (one per assistant tool-call position in the source dataset). On each call to

Meth:

evaluate it renders each sample’s prompt_messages and tools through the tokenizer’s chat template, generates a continuation, parses any tool calls out of the generated text, and aggregates per-sample metrics into a corpus-level dict.

Constructor args (all keyword-only): dataset_name: HF Hub dataset id to load eval samples from. Mutually exclusive with path. path: Local JSON/JSONL file (or list of files) to load eval samples from. Mutually exclusive with dataset_name. split: Dataset split (only used with dataset_name). limit_dataset_samples: Cap on dialogues read before expansion. max_eval_samples: Cap on total expanded eval samples. max_new_tokens: Generation budget per sample. max_prompt_tokens: If set, prompts longer than this many tokens are skipped (logged once). Prevents OOM on degenerate samples. do_sample: Generation sampling toggle. Default greedy for reproducibility across validation checkpoints. metric_prefix: Prefix applied to all returned metric keys. sample_shard: Optional (rank, world_size) tuple. When set, only every world_size-th sample starting at rank is processed; the caller is responsible for all-reducing the returned _count and weighted-summed metrics.

Initialization

_cleanup_cuda() None#
_greedy_generate_manual(
model,
input_ids: torch.Tensor,
attention_mask: torch.Tensor,
max_new_tokens: int,
eos_token_id: Optional[int],
) torch.Tensor#

Greedy decode using only model.forward().

Several Automodel custom model classes (notably Qwen2ForCausalLM) inherit from HFCheckpointingMixin + Qwen2PreTrainedModel but not from transformers.generation.GenerationMixin, so the FSDP-wrapped instance has no .generate() method. We fall back to a minimal token-by-token greedy decode that only requires the forward pass to return logits. No KV cache, so cost is O(L * (P + L)) per sample where P is prompt length and L is max_new_tokens — fine for the small eval budgets used here (default 256 tokens).

_load_samples() List[Dict[str, Any]]#
_iter_my_samples() List[Dict[str, Any]]#
_render_prompt_ids(
tokenizer,
sample: Dict[str, Any],
skip_reasons: Optional[Dict[str, int]] = None,
) Optional[List[int]]#

Render one eval sample’s prompt through apply_chat_template.

We deliberately split the chat-template render (tokenize=False) from the tokenization step: some templates / transformers versions return a list of token strings under tokenize=True, which then crashes torch.tensor(..., dtype=long) downstream. Going through text first sidesteps that and matches the canonical HF usage shown in the model cards.

Returns None if the template raises (e.g. doesn’t accept the tools kwarg) or if the prompt exceeds max_prompt_tokens.

evaluate(model, tokenizer) Dict[str, float]#

Run generation-based tool-call evaluation against model.

Caller is expected to have placed the model in eval mode and on the appropriate device. The evaluator infers the device from the first model parameter so it works with FSDP, DDP, or single-GPU layouts without explicit configuration.

Parameters:
  • model – a HuggingFace causal-LM with a .generate() method.

  • tokenizer – tokenizer paired with model; must have a chat template that supports the tools kwarg.

Returns:

Dict of metric name -> float. All metric values are means in [0.0, 1.0] except _count which is the number of samples actually scored on this rank.