Create LLM Endpoints with vLLM and SGLang

In this guide, we'll show you how to create a dedicated endpoint from LLMs with inference engine vLLM and SGLang to serve models from Hugging Face.

Create LLM Endpoints with vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. To create a dedicated endpoint with vLLM, you can follow the steps below:

Go to Endpoints tab and click on Create Endpoint
Select Create LLM Endpoint
Select vLLM as the inference engine
For Endpoint name, you can set it to vllm-endpoint or any other name you like
For Model to be used, click on Load from Hugging Face button and you type keyword to search for the model you want to use. Eg, we can use meta-llama/Llama-3.1-8B-Instruct as an example. If the model is gated, you need to provide a Hugging Face token to access it. You can create a new token in your Hugging Face account and save it as a secret in your workspace.
For Resource, choose the proper resource for your model according to the model size.
For Image configuration, you can leave it as default. If you'd like to add more arguments, you can expand command line arguments section and add your own arguments. You can find all vLLM arguments here
For other configurations, you can leave it as default. You can refer to endpoint configurations for more details.

Note

It is recommended to have an access token setup for your endpoint instead of setting it as public.

Once the endpoint is created, you can see the created endpoint in the Endpoints tab. You can also view the logs from each replica by clicking on the logs button from replica section.

You can test it out via playground directly by clicking on the endpoint you created.

If you'd like to access the endpoint via API, you can click on the API tab on the endpoint detail page. You can find the API key and the endpoint URL.

Create LLM Endpoints with SGLang

SGLang is a fast serving framework for large language models and vision language models. It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language. To create a dedicated endpoint with SGLang, you can follow the steps below:

Go to Endpoints tab and click on Create Endpoint
Select Create LLM Endpoint
Select SGLang as the inference engine
For Endpoint name, you can set it to sglang-endpoint or any other name you like
For Model to be used, click on Load from Hugging Face button and you type keyword to search for the model you want to use. Eg, we can use meta-llama/Llama-3.1-8B-Instruct as an example. If the model is gated, you need to provide a Hugging Face token to access it. You can create a new token in your Hugging Face account and save it as a secret in your workspace.
For Resource, choose the proper resource for your model according to the model size.
For Image configuration, you can leave it as default. If you'd like to add more arguments, you can expand command line arguments section and add your own arguments. You can find all SGLang arguments here
For other configurations, you can leave it as default. You can refer to endpoint configurations for more details.

Note

It is recommended to have an access token setup for your endpoint instead of setting it as public.

Once the endpoint is created, you can see the created endpoint in the Endpoints tab. You can also view the logs from each replica by clicking on the logs button from replica section.

You can test it out via playground directly by clicking on the endpoint you created.

If you'd like to access the endpoint via API, you can click on the API tab on the endpoint detail page. You can find the API key and the endpoint URL.

1. Bring Your Own Compute

1. Endpoint

2. Dev Pod

3. Batch Job

4. Node Group

8. Workspace

1. Dev Pod

2. Batch Job

1. API Reference

2. CLI Reference

3. Limits

Create LLM Endpoints with vLLM and SGLang

Create LLM Endpoints with vLLM

Create LLM Endpoints with SGLang

Corporate Info

NVIDIA Developer

Resources