CI Regression Gate#
Block merge requests that cause evaluation regressions.
GitLab CI#
Include the eval template in your .gitlab-ci.yml:
include:
- local: deploy/gitlab-ci-eval.yml
Or add the stages directly:
stages:
- eval
- regression
eval:candidate:
stage: eval
image: nemo-evaluator:latest
script:
- nel eval run --bench gsm8k --repeats 2 --max-problems 50 -o results/candidate --no-progress
artifacts:
paths: [results/candidate/]
rules:
- if: $CI_MERGE_REQUEST_IID
eval:baseline:
stage: eval
image: nemo-evaluator:latest
script:
- git checkout $CI_MERGE_REQUEST_TARGET_BRANCH_NAME
- pip install -e ".[scoring]"
- nel eval run --bench gsm8k --repeats 2 --max-problems 50 -o results/baseline --no-progress
artifacts:
paths: [results/baseline/]
rules:
- if: $CI_MERGE_REQUEST_IID
regression:check:
stage: regression
image: nemo-evaluator:latest
needs: [eval:candidate, eval:baseline]
script:
- nel compare results/baseline/eval-*.json results/candidate/eval-*.json --max-drop 0.05 --strict
artifacts:
paths: [results/regression.json]
rules:
- if: $CI_MERGE_REQUEST_IID
How it works#
sequenceDiagram
participant MR as Merge Request
participant B as eval:baseline
participant C as eval:candidate
participant R as regression:check
MR->>B: Trigger (target branch)
MR->>C: Trigger (MR branch)
B-->>R: results/baseline/eval-*.json
C-->>R: results/candidate/eval-*.json
R->>R: compare_runs()
alt delta > max_drop
R-->>MR: ❌ Pipeline failed
else delta ≤ max_drop
R-->>MR: ✅ Pipeline passed
end
Statistical significance#
With scipy installed (pip install nemo-evaluator[stats]), nel compare includes McNemar significance testing, effect size confidence intervals, and power analysis. This distinguishes meaningful regressions from benchmark noise. See Comparing Evaluation Runs for details on interpreting the statistical output.
Threshold tuning#
Scenario |
Threshold |
Repeats |
Max problems |
|---|---|---|---|
Quick smoke test |
0.10 |
1 |
50 |
Standard gate |
0.05 |
2 |
100 |
High-confidence |
0.03 |
4 |
full dataset |
Higher repeats reduce noise in pass@k estimation. More problems reduce sampling variance. P-values require at least 2 samples per run.
GitHub Actions#
jobs:
eval-baseline:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with: { ref: main }
- run: pip install -e ".[scoring]"
- run: nel eval run --bench gsm8k --repeats 2 --max-problems 50 -o results/baseline --no-progress
- uses: actions/upload-artifact@v4
with: { name: baseline, path: results/baseline/ }
eval-candidate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install -e ".[scoring]"
- run: nel eval run --bench gsm8k --repeats 2 --max-problems 50 -o results/candidate --no-progress
- uses: actions/upload-artifact@v4
with: { name: candidate, path: results/candidate/ }
regression:
needs: [eval-baseline, eval-candidate]
runs-on: ubuntu-latest
steps:
- uses: actions/download-artifact@v4
- run: pip install -e ".[scoring]"
- run: nel compare baseline/eval-*.json candidate/eval-*.json --strict