nat.plugins.langchain.langsmith.langsmith_evaluation_callback#

Attributes#

`logger`
`_LS_PIPELINE_LATENCY_S`
`_LS_THROUGHPUT_RPS`
`_LS_RETRY_DELAY_S`

Classes#

LangSmithEvaluationCallback

Links OTEL traces to LangSmith experiments for structured eval result viewing.

Functions#

`_estimate_indexing_time`(→ float)	Estimate the time (seconds) for LangSmith to index expected_count runs.
`_humanize_dataset_name`(→ str)	Convert a raw dataset name to title case (underscores and hyphens become spaces).
`_span_id_to_langsmith_run_id`(→ str)	Derive LangSmith run_id from OTEL span_id.
`_eager_link_run_to_item`(→ bool)	Link a run to an eval item using a pre-computed run_id (no polling required).
`_get_run_input_str`(→ str)	Extract a comparable input string from an OTEL run.
`_link_run_to_item`(→ bool)	Link a single OTEL run to an eval item in LangSmith.
`_normalize_input`(→ str)	Strip JSON quoting and whitespace for robust comparison.
`_match_and_link_otel_runs`(→ int)	Match OTEL runs to eval items by substring and link them in LangSmith.
`_find_unlinked_items_for_feedback_fallback`(→ list[Any])	Return items that are still unlinked after eager+substring linking.
`_retry_unlinked_references`(→ int)	Retry setting reference_example_id for items whose link silently failed.
`_create_run_feedback_for_unlinked_items`(→ int)	Create run-level feedback for items that could not be linked to dataset examples.
`_backfill_feedback_for_unlinked_items`(→ int)	Backfill run-level feedback for items that remained unlinked.

Module Contents#

logger#

_LS_PIPELINE_LATENCY_S: float = 10.0#

_LS_THROUGHPUT_RPS: float = 10.0#

_LS_RETRY_DELAY_S: float = 10.0#

_estimate_indexing_time(expected_count: int) → float#: Estimate the time (seconds) for LangSmith to index expected_count runs.

_humanize_dataset_name(name: str) → str#: Convert a raw dataset name to title case (underscores and hyphens become spaces).

_span_id_to_langsmith_run_id(span_id: int) → str#

Derive LangSmith run_id from OTEL span_id.

LangSmith deterministically maps OTEL span_ids to run UUIDs: the first 8 bytes are zeroed, the last 8 bytes are the span_id.

_eager_link_run_to_item( client: Any, run_id: str, item: Any, example_ids: dict[Any, str], ) → bool#

Link a run to an eval item using a pre-computed run_id (no polling required).

Uses the deterministic span_id-to-run_id mapping to call update_run() immediately via LangSmith’s write path, bypassing the indexing delay. Returns True if the linking succeeded.

_get_run_input_str(run: Any) → str#

Extract a comparable input string from an OTEL run.

OTEL spans store inputs in various formats depending on the framework. This normalizes to a plain string for comparison.

_link_run_to_item( client: Any, run: Any, item: Any, example_ids: dict[Any, str], ) → bool#

Link a single OTEL run to an eval item in LangSmith.

Sets reference_example_id on the run (links it to the dataset example) and attaches evaluator scores as feedback. Returns True if successful.

_normalize_input(text: str) → str#: Strip JSON quoting and whitespace for robust comparison.

_match_and_link_otel_runs( *, client: Any, project_name: str, eval_result: Any, example_ids: dict[Any, str], expected_count: int, max_retries: int = 10, retry_delay: float = 10.0, processed_run_ids: set[str] | None = None, ) → int#

Match OTEL runs to eval items by substring and link them in LangSmith.

OTEL traces are exported asynchronously in batches, so they may not all be available immediately. This function retries up to max_retries times, waiting retry_delay seconds between attempts.

On each attempt, fetches all root runs in the project and matches them to eval items using substring comparison: if the eval item’s input text appears anywhere in the OTEL run’s input (or vice versa), they match. Matched runs get reference_example_id set and evaluator scores attached as feedback.

Returns the number of successfully matched and linked runs.

_find_unlinked_items_for_feedback_fallback( *, client: Any, project_name: str, items: list[Any], example_ids: dict[Any, str], ) → list[Any]#

Return items that are still unlinked after eager+substring linking.

If runs cannot be queried, treat all candidate items as unlinked so feedback can still be recorded at the experiment level.

_retry_unlinked_references( *, client: Any, project_name: str, items: list[Any], example_ids: dict[Any, str], max_attempts: int | None = None, retry_delay: float | None = None, ) → int#

Retry setting reference_example_id for items whose link silently failed.

update_run() can return 200 OK before the run is fully indexed, causing the reference_example_id to be silently dropped. This function queries list_runs to discover truly unlinked items and retries the link.

Because runs may not be indexed when this function first runs (especially for fast-completing items where Phase 2 was skipped), we retry up to max_attempts times with retry_delay seconds between each attempt.

When not explicitly provided, max_attempts and retry_delay are computed from the dataset size using the same empirical indexing constants as _match_and_link_otel_runs:

indexing_time = pipeline_latency + (item_count / throughput)
max_attempts  = clamp(indexing_time / retry_delay, min=3, max=10)
retry_delay   = _LS_RETRY_DELAY_S  (10 s)

Returns the total number of items whose reference was successfully retried.

_create_run_feedback_for_unlinked_items( *, client: Any, items: list[Any], ) → int#

Create run-level feedback for items that could not be linked to dataset examples.

For each item with a root_span_id, derives the LangSmith run_id deterministically and attaches evaluator scores as feedback on that run. Items without a root_span_id are skipped (rare — both callbacks set needs_root_span_ids = True).

_backfill_feedback_for_unlinked_items( *, client: Any, project_name: str, items: list[Any], example_ids: dict[Any, str], ) → int#: Backfill run-level feedback for items that remained unlinked.

class LangSmithEvaluationCallback( *, project: str, experiment_prefix: str = 'NAT', )#

Links OTEL traces to LangSmith experiments for structured eval result viewing.

Pre-creates the OTEL project as an experiment (with reference_dataset_id) so OTEL traces land in an experiment project. After eval completes, retroactively links OTEL runs to dataset examples and attaches evaluator feedback scores.

needs_root_span_ids = True#

_client#

_project#

_experiment_prefix = 'NAT'#

_dataset_id: str | None = None#

_dataset_name: str | None = None#

_example_ids: dict[Any, str]#

get_eval_project_name() → str#

Return a unique eval project name with auto-incrementing Run #.

Called from evaluate.py BEFORE the OTEL exporter starts to set the project name on the config. Each eval run gets its own experiment.

on_dataset_loaded(*, dataset_name: str, items: list) → None#

_pre_create_experiment_project() → None#: Pre-create the OTEL project with reference_dataset_id so it’s an experiment.

on_eval_complete( result: nat.plugins.eval.eval_callbacks.EvalResult, ) → None#