Kubernetes Deployment#

Single evaluation job#

kubectl apply -f deploy/k8s/eval-job.yaml

Manifest: deploy/k8s/eval-job.yaml

apiVersion: batch/v1
kind: Job
metadata:
  name: nel-eval-gsm8k
spec:
  template:
    spec:
      containers:
        - name: eval
          image: nemo-evaluator:latest
          args: ["eval", "run", "--bench", "gsm8k", "--repeats", "4",
                 "--output-dir", "/data/results", "--no-progress"]
          env:
            - name: NEMO_API_KEY
              valueFrom:
                secretKeyRef:
                  name: nemo-api
                  key: api_key
          volumeMounts:
            - name: data
              mountPath: /data
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: eval-data
      restartPolicy: Never

Distributed evaluation (Indexed Job)#

        flowchart TB
    IJ["Indexed Job<br/>completions=8"] --> P0["Pod 0<br/>NEL_SHARD_IDX=0"]
    IJ --> P1["Pod 1<br/>NEL_SHARD_IDX=1"]
    IJ --> P7["Pod 7<br/>NEL_SHARD_IDX=7"]
    P0 --> PVC["Shared PVC<br/>shard_0/ ... shard_7/"]
    P1 --> PVC
    P7 --> PVC
    PVC --> MJ["Merge Job"]
    MJ --> RESULT["merged/eval-*.json"]
    

Apply deploy/k8s/eval-indexed-job.yaml:

apiVersion: batch/v1
kind: Job
metadata:
  name: nel-eval-sharded
spec:
  completions: 8
  parallelism: 8
  completionMode: Indexed
  template:
    spec:
      containers:
        - name: eval
          image: nemo-evaluator:latest
          args: ["eval", "run", "--bench", "gsm8k", "--repeats", "8",
                 "--output-dir", "/data/shard_$(NEL_SHARD_IDX)", "--no-progress"]
          env:
            - name: NEL_SHARD_IDX
              valueFrom:
                fieldRef:
                  fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
            - name: NEL_TOTAL_SHARDS
              value: "8"

Then merge:

kubectl apply -f deploy/k8s/eval-merge.yaml

Persistent serving#

For Gym training integration:

kubectl apply -f deploy/k8s/serve-deployment.yaml

Creates a Deployment + ClusterIP Service at nel-serve.default.svc:9090.

Includes readiness and liveness probes:

readinessProbe:
  httpGet:
    path: /health
    port: 9090
  initialDelaySeconds: 5
  periodSeconds: 10
livenessProbe:
  httpGet:
    path: /health
    port: 9090
  initialDelaySeconds: 15
  periodSeconds: 30

From Gym training pods:

resource_servers:
  nemo_evaluator:
    endpoint: http://nel-serve.default.svc:9090