SGLang Deployment#

SGLang is a serving framework for large language models. This deployment type launches SGLang servers using the lmsysorg/sglang Docker image.

Configuration#

Required Settings#

See the complete configuration structure in configs/deployment/sglang.yaml.

deployment:
  type: sglang
  image: lmsysorg/sglang:latest
  checkpoint_path: /path/to/model  # Path to model (local or HuggingFace model ID)
  served_model_name: your-model-name
  port: 8000

Required Fields:

  • checkpoint_path: Model path or HuggingFace model ID (e.g., meta-llama/Llama-3.1-8B-Instruct)

  • served_model_name: Name for the served model

Optional Settings#

deployment:
  tensor_parallel_size: 8    # Default: 8
  data_parallel_size: 1      # Default: 1
  extra_args: ""             # Extra SGLang server arguments
  env_vars: {}               # Environment variables (key: value dict)

Configuration Fields:

  • tensor_parallel_size: Number of GPUs for tensor parallelism (default: 8)

  • data_parallel_size: Number of data parallel replicas (default: 1)

  • extra_args: Extra command-line arguments to pass to SGLang server

  • env_vars: Environment variables for the container

API Endpoints#

The SGLang deployment exposes OpenAI-compatible endpoints:

endpoints:
  chat: /v1/chat/completions
  completions: /v1/completions
  health: /health

Example Configuration#

defaults:
  - execution: slurm/default
  - deployment: sglang
  - _self_

deployment:
  checkpoint_path: meta-llama/Llama-3.1-8B-Instruct
  served_model_name: llama-3.1-8b-instruct
  tensor_parallel_size: 4
  data_parallel_size: 1
  extra_args: ""
  env_vars:
    HF_HOME: "/cache/huggingface"   # make sure you have access to GPQA-Diamond and meta-llama/Llama-3.1-8B-Instruct

execution:
  account: your-account
  output_dir: /path/to/output
  walltime: 02:00:00

evaluation:
  tasks:
    - name: gpqa_diamond
    - name: ifeval
      env_vars:
        HF_TOKEN: HF_TOKEN_FOR_GPQA_DIAMOND  # or use HF_HOME

Command Template#

The launcher uses the following command template to start the SGLang server (from configs/deployment/sglang.yaml):

python3 -m sglang.launch_server \
  --model-path ${oc.select:deployment.hf_model_handle,/checkpoint} \
  --host 0.0.0.0 \
  --port ${deployment.port} \
  --served-model-name ${deployment.served_model_name} \
  --tp ${deployment.tensor_parallel_size} \
  --dp ${deployment.data_parallel_size} \
  ${deployment.extra_args}

Note

The ${oc.select:deployment.hf_model_handle,/checkpoint} syntax uses OmegaConf’s select resolver. In practice, set checkpoint_path with your model path or HuggingFace model ID.

Reference#

Configuration File:

  • Source: packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/configs/deployment/sglang.yaml

Related Documentation: