Container Direct#

Best for: Users who prefer container-based workflows

The Container Direct approach gives you full control over the container environment with volume mounting, environment variable management, and integration into Docker-based CI/CD pipelines.

Prerequisites#

  • Docker with GPU support

  • OpenAI-compatible endpoint

Quick Start#

# 1. Pull evaluation container
docker pull nvcr.io/nvidia/eval-factory/simple-evals:25.09

# 2. Run container interactively
docker run --rm -it nvcr.io/nvidia/eval-factory/simple-evals:25.09 bash

# 3. Inside container - set up environment
export NGC_API_KEY=nvapi-your-key-here
export HF_TOKEN=hf_your-token-here  # If using gated datasets

# 4. Run evaluation
nemo-evaluator run_eval \
    --eval_type mmlu_pro \
    --model_id meta/llama-3.1-8b-instruct \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --model_type chat \
    --api_key_name NGC_API_KEY \
    --output_dir /tmp/results \
    --overrides 'config.params.limit_samples=10'  # Remove to run on full benchmark

Complete Container Workflow#

Here’s a complete example with volume mounting and advanced configuration:

# 1. Create local directories for persistent storage
mkdir -p ./results ./cache ./logs

# 2. Run container with volume mounts
docker run --rm -it \
    -v $(pwd)/results:/workspace/results \
    -v $(pwd)/cache:/workspace/cache \
    -v $(pwd)/logs:/workspace/logs \
    -e NGC_API_KEY=nvapi-your-key-here \
    -e HF_TOKEN=hf_your-token-here \
    nvcr.io/nvidia/eval-factory/simple-evals:25.09 bash

# 3. Inside container - run evaluation
nemo-evaluator run_eval \
    --eval_type mmlu_pro \
    --model_id meta/llama-3.1-8b-instruct \
    --model_url https://integrate.api.nvidia.com/v1/chat/completions \
    --model_type chat \
    --api_key_name NGC_API_KEY \
    --output_dir /workspace/results \
    --overrides 'config.params.limit_samples=3'    # Remove to run on full benchmark

# 4. Exit container and check results
exit
ls -la ./results/

One-Liner Container Execution#

For automated workflows, you can run everything in a single command:

NGC_API_KEY=nvapi-your-key-here

# Run evaluation directly in container
docker run --rm \
    -v $(pwd)/results:/workspace/results \
    -e NGC_API_KEY="${NGC_API_KEY}" \
    nvcr.io/nvidia/eval-factory/simple-evals:25.09 \
    nemo-evaluator run_eval \
        --eval_type mmlu_pro \
        --model_url https://integrate.api.nvidia.com/v1/chat/completions \
        --model_id meta/llama-3.1-8b-instruct \
        --api_key_name NGC_API_KEY \
        --output_dir /workspace/results

Key Features#

Full Container Control#

  • Direct access to container environment

  • Custom volume mounting strategies

  • Environment variable management

  • GPU resource allocation

CI/CD Integration#

  • Single-command execution for automation

  • Docker Compose compatibility

  • Kubernetes deployment ready

  • Pipeline integration capabilities

Persistent Storage#

  • Volume mounting for results persistence

  • Cache directory management

  • Log file preservation

  • Custom configuration mounting

Environment Isolation#

  • Clean, reproducible environments

  • Dependency management handled

  • Version pinning through container tags

  • No local Python environment conflicts

Advanced Container Patterns#

Docker Compose Integration#

# docker-compose.yml
version: '3.8'
services:
  nemo-eval:
    image: nvcr.io/nvidia/eval-factory/simple-evals:25.09
    volumes:
      - ./results:/workspace/results
      - ./cache:/workspace/cache
      - ./configs:/workspace/configs
    environment:
      - MY_API_KEY=${NGC_API_KEY}
    command: |
      nemo-evaluator run_eval 
        --eval_type mmlu_pro 
        --model_id meta/llama-3.1-8b-instruct 
        --model_url https://integrate.api.nvidia.com/v1/chat/completions 
        --model_type chat 
        --api_key_name MY_API_KEY 
        --output_dir /workspace/results

Batch Processing Script#

#!/bin/bash
# batch_eval.sh

BENCHMARKS=("mmlu_pro" "gpqa_diamond" "humaneval")
NGC_API_KEY=nvapi-your-key-here
HF_TOKEN=hf_your-token-here  # Needed for GPQA-Diamond (gated dataset)

for benchmark in "${BENCHMARKS[@]}"; do
    echo "Running evaluation for $benchmark..."
    
    docker run --rm \
        -v $(pwd)/results:/workspace/results \
        -e MY_API_KEY=$NGC_API_KEY \
        -e HF_TOKEN=$HF_TOKEN \
        nvcr.io/nvidia/eval-factory/simple-evals:25.09 \
        nemo-evaluator run_eval \
            --eval_type $benchmark \
            --model_id meta/llama-3.1-8b-instruct \
            --model_url https://integrate.api.nvidia.com/v1/chat/completions \
            --model_type chat \
            --api_key_name MY_API_KEY \
            --output_dir /workspace/results/$benchmark \
            --overrides 'config.params.limit_samples=10'
            
    echo "Completed $benchmark evaluation"
done

echo "All evaluations completed. Results in ./results/"

Next Steps#

  • Integrate into your CI/CD pipelines

  • Explore Docker Compose for multi-service setups

  • Consider Kubernetes deployment for scale

  • Try NeMo Evaluator Launcher for simplified workflows

  • See NeMo Evaluator Core for programmatic API and advanced adapter features