Endpoint

An Endpoint corresponds to a running instance of an AI model, exposing itself as an HTTP server.

DGX Cloud Lepton allows you to deploy various types of AI models as endpoints, making them accessible via RESTful APIs with high performance and scalability.

Create an Endpoint

Navigate to the create LLM endpoint page.

Select vLLM as the LLM engine, and load a model from Hugging Face in the Model section. In this case, we will use the nvidia/Nemotron-Research-Reasoning-Qwen-1.5B model.

Then in the Resource section, select the node group and your desired resource shape. In this case, we will use H100-80GB-HBM3 x 1 from node group h100.

Click on the Create button and we can create an endpoint that

Use one H100 GPU from node group h100
Deployed the nvidia/Nemotron-Research-Reasoning-Qwen-1.5B model with vLLM

Note

You need to have a node group with available nodes in your workspace first.

curl -X 'GET' \
  '${your-endpoint-url}/v1/models' \
  -H 'accept: application/json' \

Next Steps

For more features about endpoint, please refer to the following documents:

1. Bring Your Own Compute

1. Endpoint

2. Dev Pod

3. Batch Job

4. Node Group

7. Observability

9. Workspace

1. Dev Pod

2. Batch Job

3. Endpoint

4. RayCluster

5. Connections

1. API Reference

2. CLI Reference

3. Limits

Endpoint

Create an Endpoint

Use the Endpoint

Playground

API Request

Next Steps

Corporate Info

NVIDIA Developer

Resources