lep endpoint
Manage endpoints (formerly called deployments) on the DGX Cloud Lepton.
An endpoint is a running instance (previously referred to as a deployment).
Endpoints are created with the lep endpoint create
command and typically
expose one or more HTTP routes that users can call, either via a RESTful API
or through the Python client provided by leptonai.client
.
The endpoint commands let you list, manage, and remove endpoints on the DGX Cloud Lepton.
Usage
lep endpoint [OPTIONS] COMMAND [ARGS]...
Options
--help
: Show this message and exit.
Commands
create
: Creates an endpoint from either a photon or container image.events
: Lists events of the endpointget
: Shows Endpoint detail and optionally saves its spec JSON.list
: Lists all endpoints in the current workspace.log
: Gets the log of an endpoint.remove
: Removes an endpoint.restart
: Restarts an endpoint.status
: Gets the status of an endpoint.update
: Updates an endpoint.
lep endpoint create
Creates an endpoint from either a photon or container image.
Usage
lep endpoint create [OPTIONS]
Options
-n
,--name TEXT
: Name of the endpoint being created.-f
,--file FILE
: If specified, load the endpoint spec from this JSON file before applying additional CLI overrides. The file should contain a LeptonDeploymentUserSpec JSON produced bylep endpoint get -p
.-p
,--photon TEXT
: Name of the photon to run.-i
,--photon-id TEXT
: Specific version id of the photon to run. If not specified, we will run the most recent version of the photon.--container-image TEXT
: Container image to run.--container-port INTEGER
: Guest OS port to listen to in the container. If not specified, default to 8080.--container-command TEXT
: Command to run in the container. Your command should listen to the port specified by --container-port.--resource-shape TEXT
: Resource shape for the endpoint. Available types are: 'cpu.small', 'cpu.medium', 'cpu.large', 'gpu.a10', 'gpu.a10.6xlarge', 'gpu.a100-40gb', 'gpu.2xa100-40gb', 'gpu.4xa100-40gb', 'gpu.8xa100-40gb', 'gpu.a100-80gb', 'gpu.2xa100-80gb', 'gpu.4xa100-80gb', 'gpu.8xa100-80gb', 'gpu.h100-sxm', 'gpu.2xh100-sxm', 'gpu.4xh100-sxm', 'gpu.8xh100-sxm'.--min-replicas INTEGER
: (Will be deprecated soon) Minimum number of replicas.--max-replicas INTEGER
: (Will be deprecated) Maximum number of replicas.--mount TEXT
: Persistent storage to be mounted to the endpoint, in the formatSTORAGE_PATH:MOUNT_PATH
orSTORAGE_PATH:MOUNT_PATH:MOUNT_FROM
.-e
,--env TEXT
: Environment variables to pass to the endpoint, in the formatNAME=VALUE
.-s
,--secret TEXT
: Secrets to pass to the endpoint, in the formatNAME=SECRET_NAME
. If secret name is also the environment variable name, you can omit it and simply passSECRET_NAME
.--public
: If specified, the photon will be publicly accessible. See docs for details on access control.--tokens TEXT
: Additional tokens that can be used to access the photon. See docs for details on access control.--no-traffic-timeout INTEGER
: (Will be deprecated soon)If specified, the endpoint will be scaled down to 0 replicas after the specified number of seconds without traffic. Minimum is 60 seconds if set. Note that actual timeout may be up to 30 seconds longer than the specified value.--target-gpu-utilization INTEGER
: (Will be deprecated soon)If min and max replicas are set, if the gpu utilization is higher than the target gpu utilization, autoscaler will scale up the replicas. If the gpu utilization is lower than the target gpu utilization, autoscaler will scale down the replicas. The value should be between 0 and 99.--initial-delay-seconds INTEGER
: If specified, the endpoint will allow the specified amount of seconds for the photon to initialize before it starts the service. Usually you should not need this. If you have a endpoint that takes a long time to initialize, set it to a longer value.--include-workspace-token
: If specified, the workspace token will be included as an environment variable. This is used when the photon code uses Lepton SDK capabilities such as queue, KV, objectstore etc. Note that you should trust the code in the photon, as it will have access to the workspace token.--rerun
: If specified, shutdown the endpoint of the same endpoint name and rerun it. Note that this may cause downtime of the photon if it is for production use, so use with caution. In a production environment, you should do photon create, push, andlep endpoint update
instead.--public-photon
: If specified, get the photon from the public photon registry. This is only supported for remote execution.--image-pull-secrets TEXT
: Secrets to use for pulling images.-ng
,--node-group TEXT
: Node group for the endpoint. If not set, use on-demand resources. You can repeat this flag multiple times to choose multiple node groups. Multiple node group option is currently not supported but coming soon for enterprise users. Only the first node group will be set if you input multiple node groups at this time.--visibility TEXT
: Visibility of the endpoint. Can be 'public' or 'private'. If private, the endpoint will only be viewable by the creator and workspace admin.-r
,-replicas, --replicas-static INTEGER
: Use this option if you want a fixed number of replicas and want to turn off autoscaling. For example, to set a fixed number of replicas to 2, you can use: --replicas-static 2 or -r 2-ad
,--autoscale-down TEXT
: Use this option if you want to have replicas but scale down after a specified time of no traffic. For example, to set 2 replicas and scale down after 3600 seconds of no traffic, use: --autoscale-down 2,3600s or --autoscale-down 2,3600 (Note: Do not include spaces around the comma.)-agu
,--autoscale-gpu-util TEXT
: Use this option to set a threshold for GPU utilization and enable the system to scale between a minimum and maximum number of replicas. For example, to scale between 1 (min_replica) and 3 (max_replica) with a 50% threshold, use: --autoscale-gpu-util 1,3,50% or --autoscale-gpu-util 1,3,50 (Note: Do not include spaces around the comma.)
If the GPU utilization is higher than the target GPU utilization, the autoscaler will scale up the replicas. If the GPU utilization is lower than the target GPU utilization, the autoscaler will scale down the replicas. The threshold value should be between 0 and 99.
-aq
,--autoscale-qpm TEXT
: Use this option to set a threshold for QPM and enable the system to scale between a minimum and maximum number of replicas. For example, to scale between 1 (min_replica) and 3 (max_replica) with a 2.5 QPM, use: --autoscale-qpm 1,3,2.5 (Note: Do not include spaces around the comma.)
This sets up autoscaling based on queries per minute, scaling between 1 and 3 replicas when QPM per replica exceeds 2.5.
-lg
,--log-collection BOOLEAN
: Enable or disable log collection (true/false). If not provided, the workspace setting will be used.-ni
,--node-id TEXT
: Node for the endpoint. You can repeat this flag multiple times to choose multiple nodes. Please specify the node group when you are using this option--shared-memory-size INTEGER
: Specify the shared memory size for this endpoint, in MiB.--help
: Show this message and exit.
lep endpoint list
Lists all endpoints in the current workspace.
Usage
lep endpoint list [OPTIONS]
Options
-p
,--pattern TEXT
: Regular expression pattern to filter endpoint names.--help
: Show this message and exit.
lep endpoint restart
Restarts an endpoint.
Usage
lep endpoint restart [OPTIONS]
Options
-n
,--name TEXT
: The endpoint name to restart. [required]--help
: Show this message and exit.
lep endpoint remove
Removes an endpoint.
Usage
lep endpoint remove [OPTIONS]
Options
-n
,--name TEXT
: The endpoint name to remove. [required]--help
: Show this message and exit.
lep endpoint status
Gets the status of an endpoint.
Usage
lep endpoint status [OPTIONS]
Options
-n
,--name TEXT
: The endpoint name to get status. [required]-t
,--show-tokens
: Show tokens for the endpoint. Use with caution as this displays the tokens in plain text, and may be visible to others if you log the output.-d
,--detail
: Show the endpoint detail--help
: Show this message and exit.
lep endpoint log
Gets the log of an endpoint. If replica
is not specified, the first
replica is selected. Otherwise, the log of the specified replica is shown.
To get the list of replicas, use lep endpoint status
.
Usage
lep endpoint log [OPTIONS]
Options
-n
,--name TEXT
: The endpoint name to get log. [required]-r
,--replica TEXT
: The replica name to get log.--help
: Show this message and exit.
lep endpoint update
Updates an endpoint. Note that for all the update options, changes are made
as replacements, and not incrementals. For example, if you specify
--tokens
, old tokens are replaced by the new set of tokens.
Usage
lep endpoint update [OPTIONS]
Options
-n
,--name TEXT
: The endpoint name to update. [required]-i
,--id TEXT
: The new photon id to update to. Uselatest
for the latest id.--min-replicas INTEGER
: Number of replicas to update to. Pass0
to scale the number of replicas to zero, in which case the deployment status page will show the endpoint to benot ready
until you scale it back with a positive number of replicas.--resource-shape TEXT
: Resource shape for the pod. Available types are: 'cpu.small', 'cpu.medium', 'cpu.large', 'gpu.a10', 'gpu.a10.6xlarge', 'gpu.a100-40gb', 'gpu.2xa100-40gb', 'gpu.4xa100-40gb', 'gpu.8xa100-40gb', 'gpu.a100-80gb', 'gpu.2xa100-80gb', 'gpu.4xa100-80gb', 'gpu.8xa100-80gb', 'gpu.h100-sxm', 'gpu.2xh100-sxm', 'gpu.4xh100-sxm', 'gpu.8xh100-sxm'.--public / --no-public
: If --public is specified, the endpoint will be made public. If --no-public is specified, the endpoint will be made non-public, with access tokens being the workspace token and the tokens specified by --tokens. If neither is specified, no change will be made to the access control of the endpoint.--tokens TEXT
: Access tokens that can be used to access the endpoint. See docs for details on access control. If no tokens is specified, we will not change the tokens of the endpoint. If you want to remove all additional tokens, use--remove-tokens.--remove-tokens
: If specified, all additional tokens will be removed, and the endpoint will be either public (if --public) is specified, or only accessible with the workspace token (if --public is not specified).--no-traffic-timeout INTEGER
: If specified, the endpoint will be scaled down to 0 replicas after the specified number of seconds without traffic. Set to 0 to explicitly change the endpoint to have no timeout.--visibility TEXT
: Visibility of the endpoint. Can be 'public' or 'private'. If private, the endpoint will only be viewable by the creator and workspace admin.-r
,-replicas, --replicas-static INTEGER
: Use this option if you want a fixed number of replicas and want to turn off autoscaling. For example, to set a fixed number of replicas to 2, you can use: --replicas-static 2 or -r 2-ad
,--autoscale-down TEXT
: Use this option if you want to have replicas but scale down after a specified time of no traffic. For example, to set 2 replicas and scale down after 3600 seconds of no traffic, use: --autoscale-down 2,3600s or --autoscale-down 2,3600 (Note: Do not include spaces around the comma.)-agu
,--autoscale-gpu-util TEXT
: Use this option to set a threshold for GPU utilization and enable the system to scale between a minimum and maximum number of replicas. For example, to scale between 1 (min_replica) and 3 (max_replica) with a 50% threshold, use: --autoscale-gpu-util 1,3,50% or --autoscale-gpu-util 1,3,50 (Note: Do not include spaces around the comma.)
If the GPU utilization is higher than the target GPU utilization, the autoscaler will scale up the replicas. If the GPU utilization is lower than the target GPU utilization, the autoscaler will scale down the replicas. The threshold value should be between 0 and 99.
-aq
,--autoscale-qpm TEXT
: Use this option to set a threshold for QPM and enable the system to scale between a minimum and maximum number of replicas. For example, to scale between 1 (min_replica) and 3 (max_replica) with a 2.5 QPM, use: --autoscale-qpm 1,3,2.5 (Note: Do not include spaces around the comma.)
This sets up autoscaling based on queries per minute, scaling between 1 and 3 replicas when QPM per replica exceeds 2.5.
-lg
,--log-collection BOOLEAN
: Enable or disable log collection (true/false). If not provided, the workspace setting will be used.--shared-memory-size INTEGER
: Update the shared memory size for this endpoint, in MiB.--help
: Show this message and exit.
lep endpoint events
Lists events of the endpoint
Usage
lep endpoint events [OPTIONS]
Options
-n
,--name TEXT
: The endpoint name to get status. [required]--help
: Show this message and exit.
lep endpoint get
Shows Endpoint detail and optionally saves its spec JSON.
Usage
lep endpoint get [OPTIONS]
Options
-n
,--name TEXT
: Endpoint name [required]-p
,--path PATH
: Optional local path to save the endpoint spec JSON. Directory or full filename accepted. If a directory is provided, the file will be saved as endpoint-spec-[name].json.--help
: Show this message and exit.