lep endpoint
Manage endpoints (formerly called deployments) on the DGX Cloud Lepton.
Usage
lep endpoint [OPTIONS] COMMAND [ARGS]...
Options
--help
: Show this message and exit.
Commands
create
: Creates an endpoint from either a photon or container image.events
: Lists events of the endpointget
: Shows Endpoint detail and optionally saves its spec JSON.list
: Lists all endpoints in the current workspace.log
: Gets the log of an endpoint.remove
: Removes an endpoint.restart
: Restarts an endpoint.status
: Gets the status of an endpoint.update
: Updates an endpoint.
lep endpoint create
Creates an endpoint from either a photon or container image.
Usage
lep endpoint create [OPTIONS]
Options
-n
,--name TEXT
: Name of the endpoint being created. [required]-f
,--file FILE
: If provided, load the endpoint spec from this JSON file before applying CLI overrides. The file can be obtained from the dashboard's UI → CLI → 'Use spec file', or by running:lep endpoint get -i <endpoint_id> --path <download_path>
.-p
,--photon TEXT
: Name of the photon to run.-i
,--photon-id TEXT
: Specific version id of the photon to run. If not specified, we will run the most recent version of the photon.--container-image TEXT
: Container image to run.--container-port INTEGER
: Guest OS port to listen to in the container. If not specified, default to 8080.--container-command TEXT
: Command to run in the container. Your command should listen to the port specified by --container-port.--resource-shape TEXT
: Resource shape for the endpoint. Available types are: 'cpu.small', 'cpu.medium', 'cpu.large', 'gpu.a10', 'gpu.a10.6xlarge', 'gpu.a100-40gb', 'gpu.2xa100-40gb', 'gpu.4xa100-40gb', 'gpu.8xa100-40gb', 'gpu.a100-80gb', 'gpu.2xa100-80gb', 'gpu.4xa100-80gb', 'gpu.8xa100-80gb', 'gpu.h100-sxm', 'gpu.2xh100-sxm', 'gpu.4xh100-sxm', 'gpu.8xh100-sxm'.--min-replicas INTEGER
: (Will be deprecated soon) Minimum number of replicas.--max-replicas INTEGER
: (Will be deprecated) Maximum number of replicas.--mount TEXT
: Persistent storage to be mounted to the endpoint, in the formatSTORAGE_PATH:MOUNT_PATH
orSTORAGE_PATH:MOUNT_PATH:MOUNT_FROM
.-e
,--env TEXT
: Environment variables to pass to the endpoint, in the formatNAME=VALUE
.-s
,--secret TEXT
: Secrets to pass to the endpoint, in the formatNAME=SECRET_NAME
. If secret name is also the environment variable name, you can omit it and simply passSECRET_NAME
.--public
: If specified, the endpoint will be accessible from any IP address. This is equivalent to --ip-whitelist with an empty list. Mutually exclusive with --ip-whitelist.--ip-whitelist TEXT
: IP addresses or CIDR ranges that are allowed to access the endpoint. Can be specified multiple times or as comma-separated values. Examples: --ip-whitelist 192.168.1.1,10.0.0.0/8 or --ip-whitelist 192.168.1.1 --ip-whitelist 10.0.0.0/8. Mutually exclusive with --public. This sets the IP allowlist in the deployment's authentication configuration. Note: --tokens are completely independent of IP access control.--tokens TEXT
: Additional tokens that can be used to access the endpoint. See docs for details on access control. These are completely independent of IP access control (--public and --ip-whitelist).--no-traffic-timeout INTEGER
: (Will be deprecated soon)If specified, the endpoint will be scaled down to 0 replicas after the specified number of seconds without traffic. Minimum is 60 seconds if set. Note that actual timeout may be up to 30 seconds longer than the specified value.--target-gpu-utilization INTEGER
: (Will be deprecated soon)If min and max replicas are set, if the gpu utilization is higher than the target gpu utilization, autoscaler will scale up the replicas. If the gpu utilization is lower than the target gpu utilization, autoscaler will scale down the replicas. The value should be between 0 and 99.--initial-delay-seconds INTEGER
: If specified, the endpoint will allow the specified amount of seconds for the photon to initialize before it starts the service. Usually you should not need this. If you have a endpoint that takes a long time to initialize, set it to a longer value.--include-workspace-token
: If specified, the workspace token will be included as an environment variable. This is used when the photon code uses Lepton SDK capabilities such as queue, KV, objectstore etc. Note that you should trust the code in the photon, as it will have access to the workspace token.--rerun
: If specified, shutdown the endpoint of the same endpoint name and rerun it. Note that this may cause downtime of the photon if it is for production use, so use with caution. In a production environment, you should do photon create, push, andlep endpoint update
instead.--public-photon
: If specified, get the photon from the public photon registry. This is only supported for remote execution.--image-pull-secrets TEXT
: Secrets to use for pulling images.-ng
,--node-group TEXT
: Node group for the endpoint. If not set, use on-demand resources. You can repeat this flag multiple times to choose multiple node groups. Multiple node group option is currently not supported but coming soon for enterprise users. Only the first node group will be set if you input multiple node groups at this time.--visibility TEXT
: Visibility of the endpoint. Can be 'public' or 'private'. If private, the endpoint will only be viewable by the creator and workspace admin.-r
,-replicas, --replicas-static INTEGER
: Use this option if you want a fixed number of replicas and want to turn off autoscaling. For example, to set a fixed number of replicas to 2, you can use: --replicas-static 2 or -r 2-ad
,--autoscale-down TEXT
: Use this option if you want to have replicas but scale down after a specified time of no traffic. For example, to set 2 replicas and scale down after 3600 seconds of no traffic, use: --autoscale-down 2,3600s or --autoscale-down 2,3600 (Note: Do not include spaces around the comma.)-agu
,--autoscale-gpu-util TEXT
: Use this option to set a threshold for GPU utilization and enable the system to scale between a minimum and maximum number of replicas. For example, to scale between 1 (min_replica) and 3 (max_replica) with a 50% threshold, use: --autoscale-gpu-util 1,3,50% or --autoscale-gpu-util 1,3,50 (Note: Do not include spaces around the comma.)
If the GPU utilization is higher than the target GPU utilization, the autoscaler will scale up the replicas. If the GPU utilization is lower than the target GPU utilization, the autoscaler will scale down the replicas. The threshold value should be between 0 and 99.
-aq
,--autoscale-qpm TEXT
: Use this option to set a threshold for QPM and enable the system to scale between a minimum and maximum number of replicas. For example, to scale between 1 (min_replica) and 3 (max_replica) with a 2.5 QPM, use: --autoscale-qpm 1,3,2.5 (Note: Do not include spaces around the comma.)
This sets up autoscaling based on queries per minute, scaling between 1 and 3 replicas when QPM per replica exceeds 2.5.
-lg
,--log-collection BOOLEAN
: Enable or disable log collection (true/false). If not provided, the workspace setting will be used.-ni
,--node-id TEXT
: Node for the endpoint. You can repeat this flag multiple times to choose multiple nodes. Please specify the node group when you are using this option-qp
,--queue-priority TEXT
: Set the priority for this endpoint (dedicated node groups only).-cbp
,--can-be-preempted
: Allow this endpoint to be preempted by higher priority workloads.-cp
,--can-preempt
: Allow this endpoint to preempt lower priority workloads.--shared-memory-size INTEGER
: Specify the shared memory size for this endpoint, in MiB.--with-reservation TEXT
: Assign the endpoint to a specific reserved compute resource using a reservation ID (only applicable to dedicated node groups).--allow-burst-to-other-reservation
: If set, the endpoint can temporarily use free resources from nodes reserved by other reservations. Be aware that when a new workload bound to those reservations starts, your endpoint may be evicted.--replica-spread TEXT
: Controls how endpoint replicas are distributed across different nodes to improve availability, but it may lead to resource fragmentation.
Preferred (p): Attempts to spread replicas across different nodes when possible. Required (r): Enforces strict replica spreading where each replica must be scheduled on a different node. Replicas cannot start if there are not enough nodes.
Usage examples: --replica-spread required/r (strict) --replica-spread preferred/p (soft)
--help
: Show this message and exit.
lep endpoint list
Lists all endpoints in the current workspace.
Usage
lep endpoint list [OPTIONS]
Options
-n
,--name TEXT
: Filter endpoints by name (case-insensitive substring). Can be specified multiple times.--help
: Show this message and exit.
lep endpoint restart
Restarts an endpoint.
Usage
lep endpoint restart [OPTIONS]
Options
-n
,--name TEXT
: The endpoint name to restart. [required]--help
: Show this message and exit.
lep endpoint remove
Removes an endpoint.
Usage
lep endpoint remove [OPTIONS]
Options
-n
,--name TEXT
: The endpoint name to remove. [required]--help
: Show this message and exit.
lep endpoint status
Gets the status of an endpoint.
Usage
lep endpoint status [OPTIONS]
Options
-n
,--name TEXT
: The endpoint name to get status. [required]-t
,--show-tokens
: Show tokens for the endpoint. Use with caution as this displays the tokens in plain text, and may be visible to others if you log the output.-d
,--detail
: Show the endpoint detail--help
: Show this message and exit.
lep endpoint log
Gets the log of an endpoint. If replica
is not specified, the first
replica is selected. Otherwise, the log of the specified replica is shown.
To get the list of replicas, use lep endpoint status
.
Usage
lep endpoint log [OPTIONS]
Options
-n
,--name TEXT
: The endpoint name to get log. [required]-r
,--replica TEXT
: The replica name to get log.--help
: Show this message and exit.
lep endpoint update
Updates an endpoint. Note that for all the update options, changes are made
as replacements, and not incrementals. For example, if you specify
--tokens
, old tokens are replaced by the new set of tokens.
Usage
lep endpoint update [OPTIONS]
Options
-n
,--name TEXT
: The endpoint name to update. [required]-i
,--id TEXT
: The new photon id to update to. Uselatest
for the latest id.--min-replicas INTEGER
: Number of replicas to update to. Pass0
to scale the number of replicas to zero, in which case the deployment status page will show the endpoint to benot ready
until you scale it back with a positive number of replicas.--resource-shape TEXT
: Resource shape for the pod. Available types are: 'cpu.small', 'cpu.medium', 'cpu.large', 'gpu.a10', 'gpu.a10.6xlarge', 'gpu.a100-40gb', 'gpu.2xa100-40gb', 'gpu.4xa100-40gb', 'gpu.8xa100-40gb', 'gpu.a100-80gb', 'gpu.2xa100-80gb', 'gpu.4xa100-80gb', 'gpu.8xa100-80gb', 'gpu.h100-sxm', 'gpu.2xh100-sxm', 'gpu.4xh100-sxm', 'gpu.8xh100-sxm'.--public / --no-public
: If --public is specified, the endpoint will be made public. If --no-public is specified, the endpoint will be made non-public, with access tokens being the workspace token and the tokens specified by --tokens. If neither is specified, no change will be made to the access control of the endpoint.--ip-whitelist TEXT
: IP addresses or CIDR ranges that are allowed to access the endpoint. Can be specified multiple times or as comma-separated values. Examples: --ip-whitelist 192.168.1.1,10.0.0.0/8 or --ip-whitelist 192.168.1.1 --ip-whitelist 10.0.0.0/8. Mutually exclusive with --public. This sets the IP allowlist in the deployment's authentication configuration.--tokens TEXT
: Access tokens that can be used to access the endpoint. See docs for details on access control. If no tokens is specified, we will not change the tokens of the endpoint. If you want to remove all additional tokens, use--remove-tokens.--remove-tokens
: If specified, all additional tokens will be removed, and the endpoint will be either public (if --public) is specified, or only accessible with the workspace token (if --public is not specified).--visibility TEXT
: Visibility of the endpoint. Can be 'public' or 'private'. If private, the endpoint will only be viewable by the creator and workspace admin.-r
,-replicas, --replicas-static INTEGER
: Use this option if you want a fixed number of replicas and want to turn off autoscaling. For example, to set a fixed number of replicas to 2, you can use: --replicas-static 2 or -r 2-ad
,--autoscale-down TEXT
: Use this option if you want to have replicas but scale down after a specified time of no traffic. For example, to set 2 replicas and scale down after 3600 seconds of no traffic, use: --autoscale-down 2,3600s or --autoscale-down 2,3600 (Note: Do not include spaces around the comma.)-agu
,--autoscale-gpu-util TEXT
: Use this option to set a threshold for GPU utilization and enable the system to scale between a minimum and maximum number of replicas. For example, to scale between 1 (min_replica) and 3 (max_replica) with a 50% threshold, use: --autoscale-gpu-util 1,3,50% or --autoscale-gpu-util 1,3,50 (Note: Do not include spaces around the comma.)
If the GPU utilization is higher than the target GPU utilization, the autoscaler will scale up the replicas. If the GPU utilization is lower than the target GPU utilization, the autoscaler will scale down the replicas. The threshold value should be between 0 and 99.
-aq
,--autoscale-qpm TEXT
: Use this option to set a threshold for QPM and enable the system to scale between a minimum and maximum number of replicas. For example, to scale between 1 (min_replica) and 3 (max_replica) with a 2.5 QPM, use: --autoscale-qpm 1,3,2.5 (Note: Do not include spaces around the comma.)
This sets up autoscaling based on queries per minute, scaling between 1 and 3 replicas when QPM per replica exceeds 2.5.
-lg
,--log-collection BOOLEAN
: Enable or disable log collection (true/false). If not provided, the workspace setting will be used.--shared-memory-size INTEGER
: Update the shared memory size for this endpoint, in MiB.--replica-spread TEXT
: Controls how endpoint replicas are distributed across different nodes to improve availability, but it may lead to resource fragmentation.
Preferred (p): Attempts to spread replicas across different nodes when possible. Required (r): Enforces strict replica spreading where each replica must be scheduled on a different node. Replicas cannot start if there are not enough nodes.
Usage examples: --replica-spread required/r (strict) --replica-spread preferred/p (soft)
--help
: Show this message and exit.
lep endpoint events
Lists events of the endpoint
Usage
lep endpoint events [OPTIONS]
Options
-n
,--name TEXT
: The endpoint name to get status. [required]--help
: Show this message and exit.
lep endpoint get
Shows Endpoint detail and optionally saves its spec JSON.
Usage
lep endpoint get [OPTIONS]
Options
-n
,--name TEXT
: Endpoint name [required]-p
,--path PATH
: Optional local path to save the endpoint spec JSON. Directory or full filename accepted. If a directory is provided, the file will be saved as endpoint-spec-[name].json.--help
: Show this message and exit.