Artifact Tracking#
nemo_runspec.artifacts provides the tracking infrastructure that connects data prep, training, and evaluation stages through versioned artifact references. It supports two backends that can run simultaneously:
Manifest-based tracking (
nemo_runspec.manifest_tracker) โ writes structured JSON manifests to any fsspec-compatible storage (local, Lustre, S3, GCS, HF Hub). Always reliable, zero external dependencies.W&B artifact tracking โ logs artifacts to Weights & Biases for team collaboration, UI browsing, and lineage graphs.
Manifests are the always-on foundation; W&B is best-effort on top.
For artifact types (PretrainBlendsArtifact, SFTDataArtifact, etc.) and the Pydantic base class, see nemotron.kit artifacts.
Configuration#
env.toml#
Configure artifact tracking in your env.toml. These are team-wide defaults merged into every job config by build_job_config():
# Both backends: manifest always, wandb best-effort
[artifacts]
wandb = true
[artifacts.manifest]
root = "/lustre/team/artifacts"
[wandb]
project = "nemotron"
entity = "YOUR-TEAM"
Backend combinations:
Setup |
Use Case |
|---|---|
|
Recommended. Manifest is always reliable; wandb adds UI + team features |
|
Offline / local development, no wandb dependency |
|
Legacy behavior, wandb-only tracking |
Neither |
No artifact tracking |
Disabling per-run#
Override via dotlist on any command:
# Run with manifest only (no wandb)
uv run nemotron super3 sft --run dlw artifacts.wandb=false
# Fully disable artifact tracking
uv run nemotron super3 sft --run dlw artifacts.manifest.root=null artifacts.wandb=false
Manifest Storage Backends#
The manifest.root value supports any fsspec-compatible URI:
Backend |
Example |
Notes |
|---|---|---|
Local / Lustre / NFS |
|
Best for training clusters |
S3 |
|
Requires AWS credentials |
GCS |
|
Requires GCP credentials |
HF Hub |
|
Great for sharing; batched commits |
Config Flow#
build_job_config()merges[artifacts]from env.toml into the job config (YAML overrides env.toml)extract_train_config()preserves the top-levelartifacts:sectionInside the container,
setup_artifact_tracking()readsconfig.artifactsto initialize backendsbuild_env_vars()extractsWANDB_PROJECT/WANDB_ENTITYfrom[wandb]in env.toml
The Two-Function API#
Scripts use two functions from nemo_runspec.artifacts:
from nemo_runspec.artifacts import setup_artifact_tracking, log_artifact
setup_artifact_tracking(config) โ Initialize backends and resolve references#
Called early, before dataclass conversion. Reads config.artifacts, initializes the active backends, and resolves ${art:...} references.
config = load_omegaconf_yaml(config_path)
tracking = setup_artifact_tracking(config) # BEFORE dataclass conversion
cfg = omegaconf_to_dataclass(config, MyConfig)
Returns an ArtifactTrackingResult with flags:
tracking.manifestโ whether manifest backend is activetracking.wandbโ whether wandb backend is activetracking.qualified_namesโ wandb artifact qualified names for lineage registration
log_artifact(artifact, tracking) โ Save to all active backends#
Saves the artifact to all active backends. Replaces artifact.save() in data-prep scripts:
artifact = SFTDataArtifact(path=output_dir, pack_size=4096, ...)
log_artifact(artifact, tracking) # logs to manifest + wandb
When both backends are active:
artifact.save()logs via the global lineage tracker (WandbTracker afterinit_wandb_from_env())log_artifactadditionally writes to ManifestTracker (stored reference survives the global overwrite)
When only one backend is active, artifact.save() handles everything and log_artifact is effectively a pass-through.
Resolution Priority#
When resolving ${art:...} references:
Active Backends |
Resolution Method |
|---|---|
wandb + manifest |
Resolve via wandb API (downloads artifact, gets lineage) |
manifest only |
Resolve from local/fsspec manifest files |
wandb only |
Resolve via wandb API |
neither |
Fall through to local resolver |
Usage Patterns#
Train Scripts (self-contained packager)#
Train scripts are inlined into main.py by the self-contained packager. They use setup_artifact_tracking for initialization and stage-specific monkey-patches for checkpoint logging:
tracking = setup_artifact_tracking(config, artifacts_key="run")
# Stage-specific monkey-patches based on active backends
if tracking.manifest and tracking.wandb:
patch_checkpoint_logging_both()
elif tracking.wandb:
patch_wandb_checkpoint_logging()
elif tracking.manifest:
patch_manifest_checkpoint_logging()
# Wandb lineage: register input artifacts after wandb.init()
if tracking.wandb:
patch_wandb_init_for_lineage(
artifact_qualified_names=tracking.qualified_names,
tags=["pretrain"],
)
Data Prep Scripts (code packager)#
Data prep scripts use the code packager (full rsync), so the complete nemo_runspec API is available at runtime. They use log_artifact for explicit artifact saving:
tracking = setup_artifact_tracking(config)
if tracking.wandb:
init_wandb_from_env() # creates wandb run for metrics
# ... run pipeline ...
log_artifact(artifact, tracking) # logs to manifest + wandb
wandb_kit.finish_run(exit_code=0)
Manifest Directory Structure#
{root}/
โโโ nano3-pretrain-data/
โ โโโ v1/
โ โ โโโ manifest.json # Full provenance record
โ โ โโโ metadata.json # Resolver-compatible format
โ โโโ v2/
โ โ โโโ manifest.json
โ โ โโโ metadata.json
โ โโโ latest # Plain text file: "v2"
โ
โโโ nano3-sft-model/
โ โโโ v1/
โ โ โโโ manifest.json
โ โ โโโ metadata.json
โ โโโ latest
Zero-copy: Data stays at its original location. Only JSON metadata is written.
latestfile: Plain text containing the version directory name (e.g.,v2). Works on all filesystems โ no symlink issues.Version discovery: Scan
v*/directories.latestprovides O(1) latest lookup.
manifest.json#
Full provenance and lineage record:
{
"name": "nano3-sft-data",
"version": 2,
"type": "SFTDataArtifact",
"path": "/lustre/data/nano3/sft/output/splits",
"created_at": "2026-03-11T10:30:00-07:00",
"producer": "nemo_abc123",
"metadata": {
"pack_size": 4096,
"total_tokens": 15000000
},
"inputs": ["hf://nvidia/dataset-a"],
"used_artifacts": ["nano3-pretrain-data:v2"]
}
metadata.json#
Resolver-compatible format โ the full Pydantic model_dump() output. The ${art:data,field} resolver reads fields from this file.
CLI Display#
Job Submission Panel#
Shows which artifact stores are active before the job runs:
Job Submission
โโโ configs
โ โโโ job: /path/to/job.yaml
โ โโโ train: /path/to/train.yaml
โโโ artifacts
โ โโโ manifest: /lustre/.../artifacts
โ โโโ wandb: โ enabled
โโโ mode: attached
Completion Panel#
After a data prep or training job completes:
โญโโ Step Complete (super3-pretrain-data-tiny) โโโฎ
โ ... โ
โ Manifest: /lustre/.../super3-pretrain-data/v1 โ
โ W&B: https://wandb.ai/romeyn/nemotron/.. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
Viewing Lineage#
From Manifest Files#
Inspect manifests directly on the filesystem:
# See latest version
cat /lustre/artifacts/nano3-sft-data/latest
# v2
# Read manifest
cat /lustre/artifacts/nano3-sft-data/v2/manifest.json
# {"name": "nano3-sft-data", "version": 2, "path": "...", ...}
In W&B#
Navigate to your projectโs Artifacts tab
Select any artifact (e.g.,
ModelArtifact-rl)Click the Graph view to see upstream dependencies
Troubleshooting#
โArtifact not foundโ#
Check artifact name and version (inspect manifest directory or W&B UI)
Verify
manifest.rootin env.toml points to the correct locationFor wandb: ensure
projectandentitymatch, and runwandb login
Manifest path empty in completion panel#
Verify
[artifacts.manifest]is configured inenv.tomlCheck that the manifest root directory is writable
Wandb digest mismatch#
The pipeline automatically patches wandbโs digest verification for local file references. If you still see errors, ensure youโre using the latest version of the pipeline code.
Running without wandb#
Set artifacts.wandb=false on the command line or remove the wandb = true line from [artifacts] in env.toml. Manifest tracking works independently.
Further Reading#
Artifact Types โ Pydantic artifact classes and custom types
OmegaConf Resolvers โ
${art:...}interpolationsW&B Integration โ credentials and configuration
Execution & env.toml โ execution profiles and env.toml reference