Endpoints

An endpoint corresponds to a running instance of an AI model, exposing itself as a HTTP server. Any service can be run as a dedicated endpoint, the most common use case is to deploy an AI model, exposed with an OpenAPI.

Create Endpoint

An endpoint can be created in four different ways, checkout the following guides for more details:

Autoscaling

By default, DGX Cloud Lepton will create your endpoints with a single replica and automatic scale down to zero after 1 hour of inactivity.

You can override this behavior with the three other autoscaling options and related flags.

Scale replicas to zero based on no-traffic timeout: You can specify the initial number of replicas and the no-traffic timeout(seconds).
Autoscale replicas based on traffic QPM: You can specify the minimum and maximum number of replicas, and the target QPM. You can also specify the query methods and query paths to include in the traffic metrics.
Autoscale replicas based on GPU utilization: You can specify the minimum and maximum number of replicas, and the target GPU utilization.

Note

We do not currently support scaling up from zero replicas. If a deployment is scaled down to zero replicas, it will not be able to serve any requests until it is scaled up again.

Access tokens

DGX Cloud Lepton provides a built-in access control system for your endpoints, you can create two types of access control policies:

Use endpoint token: Only allow accessing your endpoint with a valid endpoint token.
Public access: Allow anyone to access your endpoint.

Use endpoint token

If you want to use endpoint token to access your endpoint, you can select the Use endpoint token option.

After selected, click on the Add endpoint token button to create a new endpoint token, DGX Cloud Lepton will generate a token for you.

You can add multiple endpoint tokens at the same time, you can also modify the generated token value to your own.

Enable public access

If you want to allow public access to your endpoint, you can just select the Enable public access option.

Environment variables and secrets

Environment variables are key-value pairs that will be passed to the deployment. All the variables will be automatically inject in the deployment container, so the runtime can refer to them as needed.

Secret values are similar to environment variables, but their values are pre-stored in the platform so it is not exposed in the development environment. You can learn more about secrets here.

You can also store multiple secret values, and specify which secret value to use with the --secret flag like the following: Inside the deployment, the secret value will be available as an environment variable with the same name as the secret name.

Note

Your defined environment variables should not start with the name prefix LEPTON_, as this prefix is reserved for some predefined environment variables.

Storage

Mount storage for the deployment container, refer to this guide for more details.

Advanced configurations

Visibility

Public: Visible to all the team members in your workspace.
Private: Visible to the user who created it and the admin.

Shared memory

The size of the shared memory that will be allocated to the container.

Healthcheck initial delay seconds

By default, there are two types of probes configured:

Readiness Probe: Starts with an initial delay of 5 seconds and checks every 5 seconds. It requires 1 successful check to mark the container as ready, but will mark the container as not ready after 10 consecutive failures. This probe ensures the service is ready to accept traffic.
Liveness Probe: Has a longer initial delay of 600 seconds (10 minutes) and checks every 5 seconds. It requires 1 successful check to mark the container as healthy, but will only mark the container as unhealthy after 12 consecutive failures. This probe ensures the service remains healthy during operation.

Note

As some endpoints might need longer time to start up the container and initialize the model, you can also specify a custom delay seconds to meet the requirements, simply select the Custom option and input the delay seconds.

curl --location 'YOUR_ENDPOINT_URL' \
--header 'X-Lepton-Replica-Target: YOUR_REPLICA_ID' \
--data '{"prompt": "Hello, world!"}'

Log Collection

Whether to collect logs from the replicas. By default, the option is synced with the workspace setting.

1. Bring Your Own Compute

1. Endpoint

2. Dev Pod

3. Batch Job

4. Node Group

7. Observability

9. Workspace

1. Dev Pod

2. Batch Job

3. Endpoint

4. RayCluster

5. Connections

1. API Reference

2. CLI Reference

3. Limits

Endpoints

Create Endpoint

Autoscaling

Access tokens

Use endpoint token

Enable public access

Environment variables and secrets

Storage

Advanced configurations

Visibility

Shared memory

Healthcheck initial delay seconds

Require approval to make replicas ready

Pulling metrics from replica

Header-based replica routing

Log Collection

Corporate Info

NVIDIA Developer

Resources