Creating a job

A job corresponds to a one-off task that runs to completion and then stops.

This page will go through the basics of creating a job in Lepton with the various configurable options available to you when creating a job: environment variables, secrets, file system mounts and more.

Create Job in Dashboard

Navigate to the create job page, you can see the create job page as following image.

Configure Options

Resource

Node group: The node group that the job will be launched on, default to the shared node group.
Resource shape: The instance type that the job will be running on, select from a variety of CPU and GPU shapes.
Number of workers: The number of workers that will be used for the job, default to 1.

Note

Please reach out to your Technical Account Manager (TAM) if you need more resource shapes and node groups.

Container

Image: The container image that will be used to create the job. You can choose from default image lists or use your own custom image.
Private image registry auth (optional): If you are using a private image, you need to specify the image registry auth.
Run Command: Command to run when the container starts.
Container Ports: The ports that the container will listen on.
Log Collection: Whether to collect the logs from the container, following the workspace level setting by default.

Advanced

Environment Variables: Environment variables are key-value pairs that are passed to the job. They will be automatically set as environment variables in the job container, so the runtime can refer to them as needed.
Note
Your defined environment variables should not start with the name prefix LEPTON_, as this prefix is reserved for predefined env variables. The following environment variables are predefined and will be available in the job:
- LEPTON_JOB_NAME: The name of the job
- LEPTON_RESOURCE_ACCELERATOR_TYPE: The resource accelerator type of the job
Shared Memory: The shared memory size is the size of the shared memory that will be allocated to the container.
Max replica failure retry: Maximum number of times to retry a failed replica, zero by default.
Max job failure retry: Maximum number of failure restarts of the entire job.
Visibility: You can use this to specify the visibility of the job. If the visibility is set to private, only the creator can access the job. If the visibility is set to public, all the users in the workspace can access the job.

Environment Variables

There are several environment variables that are predefined and will be available in the job:

Env Variable Name	Meaning	Sample Value
LEPTON_RESOURCE_ACCELERATOR_NUM	Number of hardware accelerators allocated	1
LEPTON_JOB_WORKER_HOSTNAME_PREFIX	Prefix used for naming worker hostnames	worker
LEPTON_WORKSPACE_ID	Identifier for the current workspace	prod01awsuswest2wssys
LEPTON_RESOURCE_ACCELERATOR_TYPE	Type of hardware accelerator used	NVIDIA-A100-80GB
LEPTON_WORKER_ID	Unique identifier for the current worker	env-job-98bw-0-2nm7s
LEPTON_JOB_FAILURE_COUNT	Number of failed job attempts	0
LEPTON_JOB_TOTAL_WORKERS	Total number of workers assigned to the job	1
LEPTON_JOB_WORKER_INDEX	Index of the current worker within the job	0
LEPTON_SUBDOMAIN	Subdomain name assigned to the job service	env-job-98bw-job-svc
LEPTON_JOB_SERVICE_PREFIX	Prefix used for naming services related to the job	env-job-98bw
LEPTON_JOB_NAME	Name assigned to the job	env-job-98bw
LEPTON_VIRTUAL_ENV	Path to the Python virtual environment	/opt/lepton/venv

By leveraging the environment variables, you can customize your job to meet your specific needs. Here are some commonly used environment variables when launching a job via torch.distributed.launch:

#! /usr/bin/env bash

# Setup the service prefix and subdomain
SERVICE_PREFIX="${LEPTON_JOB_SERVICE_PREFIX:-$LEPTON_JOB_NAME}"
SUBDOMAIN="${LEPTON_SUBDOMAIN:-$LEPTON_JOB_NAME-job-svc}"

# Setup the master address, port, world size and node rank
export MASTER_ADDR=${SERVICE_PREFIX}-0.${SUBDOMAIN}.ws-${LEPTON_WORKSPACE_ID}.svc.cluster.local
export MASTER_PORT=29400
export WORLD_SIZE=${LEPTON_JOB_TOTAL_WORKERS}
export WORKER_ADDRS=$(seq 1 $((LEPTON_JOB_TOTAL_WORKERS - 1)) | xargs -I {} echo ${SERVICE_PREFIX}-{}.${SUBDOMAIN}.ws-${LEPTON_WORKSPACE_ID}.svc.cluster.local | paste -sd ',' -)
export NODE_RANK=${LEPTON_JOB_WORKER_INDEX}

# Run the distributed training script. 
python -m torch.distributed.run \
    --nnodes=$WORLD_SIZE \
    --nproc_per_node=$LEPTON_RESOURCE_ACCELERATOR_NUM \
    --node_rank=$NODE_RANK \
    --master_addr=$MASTER_ADDR \
    --master_port=$MASTER_PORT \
    /path/to/train.py

Examples

For job creation, job failure diagnosis and so on, you can refer to the following examples:

Distributed training with PyTorch

Run jobs with conda environment

Job failure diagnose