Create LLM Endpoints

Learn how to deploy LLM endpoints on DGX Cloud Lepton.

In this guide, we'll show you how to create a dedicated endpoint from LLMs with inference engine vLLM and SGLang to serve models from Hugging Face.

Create LLM Endpoints with vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. To create a dedicated endpoint with vLLM, follow these steps:

Go to the Endpoints page and click Create Endpoint.
Select Create LLM Endpoint.
Select vLLM as the inference engine.
For Endpoint name, enter vllm-endpoint or any name you prefer.
For Model, click Load from Hugging Face and search by keyword. For example, meta-llama/Llama-3.1-8B-Instruct. If the model is gated, provide a Hugging Face token. Create a token in your Hugging Face account and save it as a secret in your workspace.
For Resource, choose an appropriate resource based on the model size.
For Image configuration, leave defaults as is. To add arguments, expand the command-line arguments section and add your own. vLLM arguments are listed here.
Leave other configurations at their defaults, or refer to endpoint configurations for details.

We recommend setting up an access token for your endpoint instead of making it public.

Once created, the endpoint appears on the Endpoints page. View logs for each replica by clicking the Logs button in the Replica section.

Test the endpoint in the playground by clicking the endpoint you created.

To access the endpoint via API, click the API tab on the endpoint detail page to find the API key and endpoint URL.