Endpoint

An Endpoint corresponds to a running instance of an AI model, exposing itself as an HTTP server.

DGX Cloud Lepton allows you to deploy various types of AI models as endpoints, making them accessible via RESTful APIs with high performance and scalability.

Create an Endpoint

Navigate to the create LLM endpoint page.

Select vLLM as the LLM engine, and load a model from Hugging Face in the Model section. In this case, we will use the nvidia/Nemotron-Research-Reasoning-Qwen-1.5B model.

Then in the Resource section, select the node group and your desired resource shape. In this case, we will use H100-80GB-HBM3 x 1 from node group h100.

create endpoint 0.8x

Click on the Create button and we can create an endpoint that

  • Use one H100 GPU from node group h100
  • Deployed the nvidia/Nemotron-Research-Reasoning-Qwen-1.5B model with vLLM
Note

You need to have a node group with available nodes in your workspace first.

Use the Endpoint

Note

The endpoint we created is a public endpoint that everyone can access through the URL by default. Refer to the endpoint configurations for managing the endpoint access control.

Playground

After your dedicated endpoint is created, you can see a chat playground in the endpoint details page, where you can start chatting with the model you just deployed on DGX Cloud Lepton.

endpoint playground 0.8x

API Request

You can also use the endpoint URL generated to make API requests to the endpoint. You can find more details under the API tab of the endpoint details page.

For example, you can use the following command to list the available models in the endpoint.

curl -X 'GET' \
  '${your-endpoint-url}/v1/models' \
  -H 'accept: application/json' \

Next Steps

For more features about endpoint, please refer to the following documents:

Copyright @ 2025, NVIDIA Corporation.