Endpoints
An endpoint corresponds to a running instance of an AI model, exposing itself as a HTTP server. Any service can be run as a dedicated endpoint, the most common use case is to deploy an AI model, exposed with an OpenAPI.
Create Endpoint
An endpoint can be created in four different ways, checkout the following guides for more details:
Autoscaling
By default, DGX Cloud Lepton will create your endpoints with a single replica and automatic scale down to zero after 1 hour of inactivity.
You can override this behavior with the three other autoscaling options and related flags.
-
Scale replicas to zero based on no-traffic timeout: You can specify the initial number of replicas and the no-traffic timeout(seconds).
-
Autoscale replicas based on traffic QPM: You can specify the minimum and maximum number of replicas, and the target QPM. You can also specify the query methods and query paths to include in the traffic metrics.
-
Autoscale replicas based on GPU utilization: You can specify the minimum and maximum number of replicas, and the target GPU utilization.

We do not currently support scaling up from zero replicas. If a deployment is scaled down to zero replicas, it will not be able to serve any requests until it is scaled up again.
Access tokens
DGX Cloud Lepton provides a built-in access control system for your endpoints, you can create two types of access control policies:
- Public access: Allow anyone to access your endpoint.
- Use endpoint token: Only allow accessing your endpoint with a valid endpoint token.
Enable public access
If you want to allow public access to your endpoint, you can just select the Enable public access option.
Use endpoint token
If you want to use endpoint token to access your endpoint, you can select the Use endpoint token option.
After selected, click on the Add endpoint token button to create a new endpoint token, DGX Cloud Lepton will generate a token for you.
You can add multiple endpoint tokens at the same time, you can also modify the generated token value to your own.

Environment variables and secrets
Environment variables are key-value pairs that will be passed to the deployment. All the variables will be automatically inject in the deployment container, so the runtime can refer to them as needed.
Secret values are similar to environment variables, but their values are pre-stored in the platform so it is not exposed in the development environment. You can learn more about secrets here.
You can also store multiple secret values, and specify which secret value to use with the --secret
flag like the following:
Inside the deployment, the secret value will be available as an environment variable with the same name as the secret name.
Your defined environment variables should not start with the name prefix LEPTON_
, as this prefix is reserved for some predefined environment variables.
The following environment variables are predefined and will be available in the deployment:
LEPTON_DEPLOYMENT_NAME
: The name of the deploymentLEPTON_PHOTON_NAME
: The name of the photon used to create the deploymentLEPTON_PHOTON_ID
: The ID of the photon used to create the deploymentLEPTON_WORKSPACE_ID
: The ID of the workspace where the deployment is createdLEPTON_WORKSPACE_TOKEN
: The workspace token of the deployment, if--include-workspace-token
is passedLEPTON_RESOURCE_ACCELERATOR_TYPE
: The resource accelerator type of the deployment
Storage
Mount storage for the deployment container, refer to this guide for more details.
Advanced configurations
Visibility
- Public: Visible to all the team members in your workspace.
- Private: Visible to the user who created it and the admin.
Shared memory
The size of the shared memory that will be allocated to the container.
Healthcheck initial delay seconds
By default, there are two types of probes configured:
- Readiness Probe: Starts with an initial delay of 5 seconds and checks every 5 seconds. It requires 1 successful check to mark the container as ready, but will mark the container as not ready after 10 consecutive failures. This probe ensures the service is ready to accept traffic.
- Liveness Probe: Has a longer initial delay of 600 seconds (10 minutes) and checks every 5 seconds. It requires 1 successful check to mark the container as healthy, but will only mark the container as unhealthy after 12 consecutive failures. This probe ensures the service remains healthy during operation.
As some endpoints might need longer time to start up the container and initialize the model, you can also specify a custom delay seconds to meet the requirements, simply select the Custom option and input the delay seconds.
Require approval to make replicas ready
Specify whether to require approval to make replicas ready. By default, the replicas will be ready immediately.
Pulling metrics from replica
Specify whether to pull metrics from the replicas. By default, the metrics will be pulled from the replicas.
Header-based replica routing
Specify the header-based replica routing. By default, the requests will be load balanced across all the replicas.
Log Collection
Whether to collect logs from the replicas. By default, the option is synced with the workspace setting.