Create LLM Endpoints
Learn how to deploy LLM endpoints on DGX Cloud Lepton.
In this guide, we'll show you how to create a dedicated endpoint from LLMs with inference engine vLLM and SGLang to serve models from Hugging Face.
Create LLM Endpoints with vLLM
vLLM is a fast and easy-to-use library for LLM inference and serving. To create a dedicated endpoint with vLLM, follow these steps:
- Go to the Endpoints page and click Create Endpoint.
- Select Create LLM Endpoint.
- Select vLLM as the inference engine.
- For Endpoint name, enter
vllm-endpointor any name you prefer. - For Model, click Load from Hugging Face and search by keyword. For example,
meta-llama/Llama-3.1-8B-Instruct. If the model is gated, provide a Hugging Face token. Create a token in your Hugging Face account and save it as a secret in your workspace. - For Resource, choose an appropriate resource based on the model size.
- For Image configuration, leave defaults as is. To add arguments, expand the command-line arguments section and add your own. vLLM arguments are listed here.
- Leave other configurations at their defaults, or refer to endpoint configurations for details.
We recommend setting up an access token for your endpoint instead of making it public.
Once created, the endpoint appears on the Endpoints page. View logs for each replica by clicking the Logs button in the Replica section.
Test the endpoint in the playground by clicking the endpoint you created.
To access the endpoint via API, click the API tab on the endpoint detail page to find the API key and endpoint URL.
Create LLM Endpoints with SGLang
SGLang is a fast serving framework for large language models and vision language models. It enables faster, more controllable interactions by co-designing the backend runtime and frontend language. To create a dedicated endpoint with SGLang, follow these steps:
- Go to the Endpoints page and click Create Endpoint.
- Select Create LLM Endpoint.
- Select SGLang as the inference engine.
- For Endpoint name, enter
sglang-endpointor any name you prefer. - For Model, click Load from Hugging Face and search by keyword. For example,
meta-llama/Llama-3.1-8B-Instruct. If the model is gated, provide a Hugging Face token. Create a token in your Hugging Face account and save it as a secret in your workspace. - For Resource, choose an appropriate resource based on the model size.
- For Image configuration, leave defaults as is. To add arguments, expand the command-line arguments section and add your own. SGLang arguments are listed here.
- Leave other configurations at their defaults, or refer to endpoint configurations for details.
We recommend setting up an access token for your endpoint instead of making it public.
Once created, the endpoint appears on the Endpoints page. View logs for each replica by clicking the Logs button in the Replica section.
Test the endpoint in the playground by clicking the endpoint you created.
To access the endpoint via API, click the API tab on the endpoint detail page to find the API key and endpoint URL.