Continuous Checkpoint Evaluation#
During model training, you often want to evaluate checkpoints as they appear - without manually kicking off each run. The nel-watch command polls a directory for new checkpoint subdirectories and automatically submits evaluation jobs using your existing launcher config.
This is especially useful for:
Long training runs where checkpoints land every few hours and you want evaluation results as early as possible.
Overnight / unattended campaigns where you do not want to wait until morning to start evals.
Choosing the best checkpoint by evaluating all of them and comparing scores afterward.
Note
nel-watch currently requires SLURM execution. Evaluation configs must have an execution section with type: slurm.
Before You Start#
Ensure you have:
NeMo Evaluator Launcher installed (
pip install nemo-evaluator-launcher[all]).A working eval config with SLURM execution (
nel run --config <eval-config.yaml> --dry-runcompletes without errors).SSH access to the cluster login node from where you run
nel-watch.Good understanding of your checkpoint naming pattern (e.g.
step_1000/,iter_500/) and directory structure (to configure directory readiness markers).
Quick Start#
nel-watch is driven by a watch config YAML that combines cluster connection settings, monitoring parameters, an optional checkpoint conversion step, and one or more evaluation configs:
nel-watch --config my-watch-config.yaml
A minimal watch config looks like this:
# my-watch-config.yaml
defaults:
- cluster_config: default
- monitoring_config: default
- _self_
cluster_config:
hostname: my-cluster-login.example.com
account: my-slurm-account
output_dir: /shared/results/watch-run
monitoring_config:
directories:
- /path/to/training/checkpoints
evaluation_configs:
- /path/to/my-eval-config.yaml
This will:
Connect to the cluster via SSH.
Scan
/path/to/training/checkpointsfor subdirectories matchingiter_*orstep_*.Check that each checkpoint is ready (contains
metadata.jsonorconfig.yaml).Submit a SLURM evaluation job for each new checkpoint, setting
deployment.checkpoint_pathautomatically.Sleep 5 minutes (default), then repeat.
Press Ctrl+C to stop gracefully — the current cycle finishes before exiting.
Watch Config Reference#
cluster_config#
Shared settings for all SLURM jobs submitted by nel-watch (monitoring, conversion, and evaluation).
cluster_config:
hostname: my-cluster-login.example.com # Use 'localhost' when running on the login node directly
username: ${oc.env:USER} # Defaults to $USER
account: my-slurm-account
partition: batch # Default: batch
output_dir: /shared/results/watch-run # Per-checkpoint subdirectories are created here
sbatch_extra_flags: # Additional #SBATCH directives for every job
gres: "gpu:8"
time: "04:00:00"
monitoring_config#
Controls how directories are scanned for new checkpoints.
monitoring_config:
directories:
- /path/to/training/checkpoints # Multiple directories are supported
interval: 300 # Seconds between scans; set to null to scan once and exit
ready_markers: # Checkpoint is ready when ANY of these files exist inside it
- metadata.json
- config.yaml
- config.json
checkpoint_patterns: # Only subdirectories matching ANY pattern are considered
- "iter_*"
- "step_*"
order: last # 'last' = highest name first (step_10000 before step_1000)
# 'first' = lowest name first
conversion_config (optional)#
If you would like to evaluate your checkpoints in a different format than the one used for training, add a conversion step. A conversion SLURM job is submitted first; the evaluation job is held with afterok dependency until it completes.
conversion_config:
container: my-training-container:latest
command_pattern: "convert -i {{ input_path }} -o {{ output_path }}"
Note
command_pattern must contain {{ input_path }} and {{ output_path }} placeholders.
Omit conversion_config (or set it to null) to use the raw checkpoint path directly in the evaluation.
See the Hydra config group templates in configs/watcher/conversion_config/ for ready-to-use conversion presets (e.g. mbridge).
evaluation_configs#
A list of paths to standard NEL evaluation config files. Each config is run for every discovered checkpoint. nel-watch automatically sets deployment.checkpoint_path and execution.output_dir in each config before submission and there is no need to pre-define them.
evaluation_configs:
- /path/to/eval-config-A.yaml
- /path/to/eval-config-B.yaml
Using Hydra Config Groups#
Watch config fields can be factored out using Hydra defaults lists. The package ships templates in configs/watcher/:
# my-watch-config.yaml
defaults:
- cluster_config: default
- monitoring_config: default
- conversion_config: mbridge # conversion template for Megatron-Bridge -> Hugging Face
- _self_
cluster_config:
hostname: my-cluster.example.com
account: my-account
output_dir: /shared/results
monitoring_config:
directories:
- /path/to/checkpoints
conversion_config:
container: /path/to/mbridge.sqsh
command_params:
hf_model: /path/to/reference-hf-model
tp: 8
evaluation_configs:
- /path/to/eval.yaml
Troubleshooting#
No checkpoints found: Verify that subdirectories match checkpoint_patterns and contain at least one ready_markers file. Run with --dry-run to see what the watcher detects.
SSH errors: Ensure your SSH key is loaded (ssh-add) and you can manually ssh <username>@<hostname> without a password prompt. When running directly on the login node, set hostname: localhost in cluster_config.
Evaluation config validation error: Each eval config must have both deployment and execution sections. execution.type must be slurm. Do not set execution.output_dir — nel-watch manages it per checkpoint.
Re-evaluating a specific checkpoint: Remove its entry from the state file, or use --resubmit-previous-sessions to resubmit all previously seen checkpoints.
See Also#
Dry Run — Preview resolved configuration before running.
Python API — Programmatic access to launcher functionality.
CLI — Commandline interface.