Endpoint
An Endpoint corresponds to a running instance of an AI model, exposing itself as an HTTP server.
DGX Cloud Lepton allows you to deploy various types of AI models as endpoints, making them accessible via RESTful APIs with high performance and scalability.
Create an Endpoint
Navigate to the create LLM endpoint page.
Select vLLM as the LLM engine, and load a model from Hugging Face in the Model section. In this case, we will use the nvidia/Nemotron-Research-Reasoning-Qwen-1.5B
model.
Then in the Resource section, select the node group and your desired resource shape. In this case, we will use H100-80GB-HBM3
x 1 from node group h100
.

Click on the Create button and we can create an endpoint that
- Use one H100 GPU from node group
h100
- Deployed the
nvidia/Nemotron-Research-Reasoning-Qwen-1.5B
model with vLLM
You need to have a node group with available nodes in your workspace first.
Use the Endpoint
The endpoint we created is a public endpoint that everyone can access through the URL by default. Refer to the endpoint configurations for managing the endpoint access control.
Playground
After your dedicated endpoint is created, you can see a chat playground in the endpoint details page, where you can start chatting with the model you just deployed on DGX Cloud Lepton.

API Request
You can also use the endpoint URL generated to make API requests to the endpoint. You can find more details under the API tab of the endpoint details page.
For example, you can use the following command to list the available models in the endpoint.
curl -X 'GET' \
'${your-endpoint-url}/v1/models' \
-H 'accept: application/json' \
Next Steps
For more features about endpoint, please refer to the following documents: