lep endpoint

Manage endpoints (formerly called deployments) on the DGX Cloud Lepton.

An endpoint is a running instance (previously referred to as a deployment). Endpoints are created with the lep endpoint create command and typically expose one or more HTTP routes that users can call, either via a RESTful API or through the Python client provided by leptonai.client.

The endpoint commands let you list, manage, and remove endpoints on the DGX Cloud Lepton.

Usage

lep endpoint [OPTIONS] COMMAND [ARGS]...

Options

  • --help : Show this message and exit.

Commands

  • create : Creates an endpoint from either a photon or container image.
  • events : Lists events of the endpoint
  • get : Shows Endpoint detail and optionally saves its spec JSON.
  • list : Lists all endpoints in the current workspace.
  • log : Gets the log of an endpoint.
  • remove : Removes an endpoint.
  • restart : Restarts an endpoint.
  • status : Gets the status of an endpoint.
  • update : Updates an endpoint.

lep endpoint create

Creates an endpoint from either a photon or container image.

Usage

lep endpoint create [OPTIONS]

Options

  • -n, --name TEXT : Name of the endpoint being created.
  • -f, --file FILE : If specified, load the endpoint spec from this JSON file before applying additional CLI overrides. The file should contain a LeptonDeploymentUserSpec JSON produced by lep endpoint get -p.
  • -p, --photon TEXT : Name of the photon to run.
  • -i, --photon-id TEXT : Specific version id of the photon to run. If not specified, we will run the most recent version of the photon.
  • --container-image TEXT : Container image to run.
  • --container-port INTEGER : Guest OS port to listen to in the container. If not specified, default to 8080.
  • --container-command TEXT : Command to run in the container. Your command should listen to the port specified by --container-port.
  • --resource-shape TEXT : Resource shape for the endpoint. Available types are: 'cpu.small', 'cpu.medium', 'cpu.large', 'gpu.a10', 'gpu.a10.6xlarge', 'gpu.a100-40gb', 'gpu.2xa100-40gb', 'gpu.4xa100-40gb', 'gpu.8xa100-40gb', 'gpu.a100-80gb', 'gpu.2xa100-80gb', 'gpu.4xa100-80gb', 'gpu.8xa100-80gb', 'gpu.h100-sxm', 'gpu.2xh100-sxm', 'gpu.4xh100-sxm', 'gpu.8xh100-sxm'.
  • --min-replicas INTEGER : (Will be deprecated soon) Minimum number of replicas.
  • --max-replicas INTEGER : (Will be deprecated) Maximum number of replicas.
  • --mount TEXT : Persistent storage to be mounted to the endpoint, in the format STORAGE_PATH:MOUNT_PATH or STORAGE_PATH:MOUNT_PATH:MOUNT_FROM.
  • -e, --env TEXT : Environment variables to pass to the endpoint, in the format NAME=VALUE.
  • -s, --secret TEXT : Secrets to pass to the endpoint, in the format NAME=SECRET_NAME. If secret name is also the environment variable name, you can omit it and simply pass SECRET_NAME.
  • --public : If specified, the photon will be publicly accessible. See docs for details on access control.
  • --tokens TEXT : Additional tokens that can be used to access the photon. See docs for details on access control.
  • --no-traffic-timeout INTEGER : (Will be deprecated soon)If specified, the endpoint will be scaled down to 0 replicas after the specified number of seconds without traffic. Minimum is 60 seconds if set. Note that actual timeout may be up to 30 seconds longer than the specified value.
  • --target-gpu-utilization INTEGER : (Will be deprecated soon)If min and max replicas are set, if the gpu utilization is higher than the target gpu utilization, autoscaler will scale up the replicas. If the gpu utilization is lower than the target gpu utilization, autoscaler will scale down the replicas. The value should be between 0 and 99.
  • --initial-delay-seconds INTEGER : If specified, the endpoint will allow the specified amount of seconds for the photon to initialize before it starts the service. Usually you should not need this. If you have a endpoint that takes a long time to initialize, set it to a longer value.
  • --include-workspace-token : If specified, the workspace token will be included as an environment variable. This is used when the photon code uses Lepton SDK capabilities such as queue, KV, objectstore etc. Note that you should trust the code in the photon, as it will have access to the workspace token.
  • --rerun : If specified, shutdown the endpoint of the same endpoint name and rerun it. Note that this may cause downtime of the photon if it is for production use, so use with caution. In a production environment, you should do photon create, push, and lep endpoint update instead.
  • --public-photon : If specified, get the photon from the public photon registry. This is only supported for remote execution.
  • --image-pull-secrets TEXT : Secrets to use for pulling images.
  • -ng, --node-group TEXT : Node group for the endpoint. If not set, use on-demand resources. You can repeat this flag multiple times to choose multiple node groups. Multiple node group option is currently not supported but coming soon for enterprise users. Only the first node group will be set if you input multiple node groups at this time.
  • --visibility TEXT : Visibility of the endpoint. Can be 'public' or 'private'. If private, the endpoint will only be viewable by the creator and workspace admin.
  • -r, -replicas, --replicas-static INTEGER : Use this option if you want a fixed number of replicas and want to turn off autoscaling. For example, to set a fixed number of replicas to 2, you can use: --replicas-static 2 or -r 2
  • -ad, --autoscale-down TEXT : Use this option if you want to have replicas but scale down after a specified time of no traffic. For example, to set 2 replicas and scale down after 3600 seconds of no traffic, use: --autoscale-down 2,3600s or --autoscale-down 2,3600 (Note: Do not include spaces around the comma.)
  • -agu, --autoscale-gpu-util TEXT : Use this option to set a threshold for GPU utilization and enable the system to scale between a minimum and maximum number of replicas. For example, to scale between 1 (min_replica) and 3 (max_replica) with a 50% threshold, use: --autoscale-gpu-util 1,3,50% or --autoscale-gpu-util 1,3,50 (Note: Do not include spaces around the comma.)

If the GPU utilization is higher than the target GPU utilization, the autoscaler will scale up the replicas. If the GPU utilization is lower than the target GPU utilization, the autoscaler will scale down the replicas. The threshold value should be between 0 and 99.

  • -aq, --autoscale-qpm TEXT : Use this option to set a threshold for QPM and enable the system to scale between a minimum and maximum number of replicas. For example, to scale between 1 (min_replica) and 3 (max_replica) with a 2.5 QPM, use: --autoscale-qpm 1,3,2.5 (Note: Do not include spaces around the comma.)

This sets up autoscaling based on queries per minute, scaling between 1 and 3 replicas when QPM per replica exceeds 2.5.

  • -lg, --log-collection BOOLEAN : Enable or disable log collection (true/false). If not provided, the workspace setting will be used.
  • -ni, --node-id TEXT : Node for the endpoint. You can repeat this flag multiple times to choose multiple nodes. Please specify the node group when you are using this option
  • --shared-memory-size INTEGER : Specify the shared memory size for this endpoint, in MiB.
  • --help : Show this message and exit.

lep endpoint list

Lists all endpoints in the current workspace.

Usage

lep endpoint list [OPTIONS]

Options

  • -p, --pattern TEXT : Regular expression pattern to filter endpoint names.
  • --help : Show this message and exit.

lep endpoint restart

Restarts an endpoint.

Usage

lep endpoint restart [OPTIONS]

Options

  • -n, --name TEXT : The endpoint name to restart. [required]
  • --help : Show this message and exit.

lep endpoint remove

Removes an endpoint.

Usage

lep endpoint remove [OPTIONS]

Options

  • -n, --name TEXT : The endpoint name to remove. [required]
  • --help : Show this message and exit.

lep endpoint status

Gets the status of an endpoint.

Usage

lep endpoint status [OPTIONS]

Options

  • -n, --name TEXT : The endpoint name to get status. [required]
  • -t, --show-tokens : Show tokens for the endpoint. Use with caution as this displays the tokens in plain text, and may be visible to others if you log the output.
  • -d, --detail : Show the endpoint detail
  • --help : Show this message and exit.

lep endpoint log

Gets the log of an endpoint. If replica is not specified, the first replica is selected. Otherwise, the log of the specified replica is shown. To get the list of replicas, use lep endpoint status.

Usage

lep endpoint log [OPTIONS]

Options

  • -n, --name TEXT : The endpoint name to get log. [required]
  • -r, --replica TEXT : The replica name to get log.
  • --help : Show this message and exit.

lep endpoint update

Updates an endpoint. Note that for all the update options, changes are made as replacements, and not incrementals. For example, if you specify --tokens, old tokens are replaced by the new set of tokens.

Usage

lep endpoint update [OPTIONS]

Options

  • -n, --name TEXT : The endpoint name to update. [required]
  • -i, --id TEXT : The new photon id to update to. Use latest for the latest id.
  • --min-replicas INTEGER : Number of replicas to update to. Pass 0 to scale the number of replicas to zero, in which case the deployment status page will show the endpoint to be not ready until you scale it back with a positive number of replicas.
  • --resource-shape TEXT : Resource shape for the pod. Available types are: 'cpu.small', 'cpu.medium', 'cpu.large', 'gpu.a10', 'gpu.a10.6xlarge', 'gpu.a100-40gb', 'gpu.2xa100-40gb', 'gpu.4xa100-40gb', 'gpu.8xa100-40gb', 'gpu.a100-80gb', 'gpu.2xa100-80gb', 'gpu.4xa100-80gb', 'gpu.8xa100-80gb', 'gpu.h100-sxm', 'gpu.2xh100-sxm', 'gpu.4xh100-sxm', 'gpu.8xh100-sxm'.
  • --public / --no-public : If --public is specified, the endpoint will be made public. If --no-public is specified, the endpoint will be made non-public, with access tokens being the workspace token and the tokens specified by --tokens. If neither is specified, no change will be made to the access control of the endpoint.
  • --tokens TEXT : Access tokens that can be used to access the endpoint. See docs for details on access control. If no tokens is specified, we will not change the tokens of the endpoint. If you want to remove all additional tokens, use--remove-tokens.
  • --remove-tokens : If specified, all additional tokens will be removed, and the endpoint will be either public (if --public) is specified, or only accessible with the workspace token (if --public is not specified).
  • --no-traffic-timeout INTEGER : If specified, the endpoint will be scaled down to 0 replicas after the specified number of seconds without traffic. Set to 0 to explicitly change the endpoint to have no timeout.
  • --visibility TEXT : Visibility of the endpoint. Can be 'public' or 'private'. If private, the endpoint will only be viewable by the creator and workspace admin.
  • -r, -replicas, --replicas-static INTEGER : Use this option if you want a fixed number of replicas and want to turn off autoscaling. For example, to set a fixed number of replicas to 2, you can use: --replicas-static 2 or -r 2
  • -ad, --autoscale-down TEXT : Use this option if you want to have replicas but scale down after a specified time of no traffic. For example, to set 2 replicas and scale down after 3600 seconds of no traffic, use: --autoscale-down 2,3600s or --autoscale-down 2,3600 (Note: Do not include spaces around the comma.)
  • -agu, --autoscale-gpu-util TEXT : Use this option to set a threshold for GPU utilization and enable the system to scale between a minimum and maximum number of replicas. For example, to scale between 1 (min_replica) and 3 (max_replica) with a 50% threshold, use: --autoscale-gpu-util 1,3,50% or --autoscale-gpu-util 1,3,50 (Note: Do not include spaces around the comma.)

If the GPU utilization is higher than the target GPU utilization, the autoscaler will scale up the replicas. If the GPU utilization is lower than the target GPU utilization, the autoscaler will scale down the replicas. The threshold value should be between 0 and 99.

  • -aq, --autoscale-qpm TEXT : Use this option to set a threshold for QPM and enable the system to scale between a minimum and maximum number of replicas. For example, to scale between 1 (min_replica) and 3 (max_replica) with a 2.5 QPM, use: --autoscale-qpm 1,3,2.5 (Note: Do not include spaces around the comma.)

This sets up autoscaling based on queries per minute, scaling between 1 and 3 replicas when QPM per replica exceeds 2.5.

  • -lg, --log-collection BOOLEAN : Enable or disable log collection (true/false). If not provided, the workspace setting will be used.
  • --shared-memory-size INTEGER : Update the shared memory size for this endpoint, in MiB.
  • --help : Show this message and exit.

lep endpoint events

Lists events of the endpoint

Usage

lep endpoint events [OPTIONS]

Options

  • -n, --name TEXT : The endpoint name to get status. [required]
  • --help : Show this message and exit.

lep endpoint get

Shows Endpoint detail and optionally saves its spec JSON.

Usage

lep endpoint get [OPTIONS]

Options

  • -n, --name TEXT : Endpoint name [required]
  • -p, --path PATH : Optional local path to save the endpoint spec JSON. Directory or full filename accepted. If a directory is provided, the file will be saved as endpoint-spec-[name].json.
  • --help : Show this message and exit.
Copyright @ 2025, NVIDIA Corporation.