Creating a job

A job corresponds to a one-off task that runs to completion and then stops.

This page will go through the basics of creating a job in Lepton with the various configurable options available to you when creating a job: environment variables, secrets, file system mounts and more.

Create Job in Dashboard

Navigate to the create job page, you can see the create job page as following image.

create job page

Configure Options

Resource

  • Node group: The node group that the job will be launched on, default to the shared node group.
  • Resource shape: The instance type that the job will be running on, select from a variety of CPU and GPU shapes.
  • Number of workers: The number of workers that will be used for the job, default to 1.
Note

Please reach out to your Technical Account Manager (TAM) if you need more resource shapes and node groups.

Container

  • Image: The container image that will be used to create the job. You can choose from default image lists or use your own custom image.
  • Private image registry auth (optional): If you are using a private image, you need to specify the image registry auth.
  • Run Command: Command to run when the container starts.
  • Container Ports: The ports that the container will listen on.
  • Log Collection: Whether to collect the logs from the container, following the workspace level setting by default.

Advanced

  • Environment Variables: Environment variables are key-value pairs that are passed to the job. They will be automatically set as environment variables in the job container, so the runtime can refer to them as needed.
    Note

    Your defined environment variables should not start with the name prefix LEPTON_, as this prefix is reserved for predefined env variables. The following environment variables are predefined and will be available in the job:

    • LEPTON_JOB_NAME: The name of the job
    • LEPTON_RESOURCE_ACCELERATOR_TYPE: The resource accelerator type of the job
  • Shared Memory: The shared memory size is the size of the shared memory that will be allocated to the container.
  • Max replica failure retry: Maximum number of times to retry a failed replica, zero by default.
  • Max job failure retry: Maximum number of failure restarts of the entire job.
  • Visibility: You can use this to specify the visibility of the job. If the visibility is set to private, only the creator can access the job. If the visibility is set to public, all the users in the workspace can access the job.

Environment Variables

There are several environment variables that are predefined and will be available in the job:

Env Variable NameMeaningSample Value
LEPTON_RESOURCE_ACCELERATOR_NUMNumber of hardware accelerators allocated1
LEPTON_JOB_WORKER_HOSTNAME_PREFIXPrefix used for naming worker hostnamesworker
LEPTON_WORKSPACE_IDIdentifier for the current workspaceprod01awsuswest2wssys
LEPTON_RESOURCE_ACCELERATOR_TYPEType of hardware accelerator usedNVIDIA-A100-80GB
LEPTON_WORKER_IDUnique identifier for the current workerenv-job-98bw-0-2nm7s
LEPTON_JOB_FAILURE_COUNTNumber of failed job attempts0
LEPTON_JOB_TOTAL_WORKERSTotal number of workers assigned to the job1
LEPTON_JOB_WORKER_INDEXIndex of the current worker within the job0
LEPTON_SUBDOMAINSubdomain name assigned to the job serviceenv-job-98bw-job-svc
LEPTON_JOB_SERVICE_PREFIXPrefix used for naming services related to the jobenv-job-98bw
LEPTON_JOB_NAMEName assigned to the jobenv-job-98bw
LEPTON_VIRTUAL_ENVPath to the Python virtual environment/opt/lepton/venv

By leveraging the environment variables, you can customize your job to meet your specific needs. Here are some commonly used environment variables when launching a job via torch.distributed.launch:

#! /usr/bin/env bash

# Setup the service prefix and subdomain
SERVICE_PREFIX="${LEPTON_JOB_SERVICE_PREFIX:-$LEPTON_JOB_NAME}"
SUBDOMAIN="${LEPTON_SUBDOMAIN:-$LEPTON_JOB_NAME-job-svc}"

# Setup the master address, port, world size and node rank
export MASTER_ADDR=${SERVICE_PREFIX}-0.${SUBDOMAIN}.ws-${LEPTON_WORKSPACE_ID}.svc.cluster.local
export MASTER_PORT=29400
export WORLD_SIZE=${LEPTON_JOB_TOTAL_WORKERS}
export WORKER_ADDRS=$(seq 1 $((LEPTON_JOB_TOTAL_WORKERS - 1)) | xargs -I {} echo ${SERVICE_PREFIX}-{}.${SUBDOMAIN}.ws-${LEPTON_WORKSPACE_ID}.svc.cluster.local | paste -sd ',' -)
export NODE_RANK=${LEPTON_JOB_WORKER_INDEX}

# Run the distributed training script. 
python -m torch.distributed.run \
    --nnodes=$WORLD_SIZE \
    --nproc_per_node=$LEPTON_RESOURCE_ACCELERATOR_NUM \
    --node_rank=$NODE_RANK \
    --master_addr=$MASTER_ADDR \
    --master_port=$MASTER_PORT \
    /path/to/train.py

Examples

For job creation, job failure diagnosis and so on, you can refer to the following examples:

Distributed training with PyTorch

Run jobs with conda environment

Job failure diagnose

Copyright @ 2025, NVIDIA.