Batch Job Configurations

When creating a batch job, you can configure the resource, container and more options for it.

Resource

  • Node Group: First you need to select one or more node groups to determine what resources this job can use from.
  • Priority: Set the priority of the job, defaults to Medium(4). If the specified node group has limited resources, you can set the priority accordingly to get higher priority resource allocation.
  • Resource shape: The instance type that the job will be running on. Select from a variety of CPU and GPU shapes. Refer to Node Group Shapes for more details.
  • Nodes: Default to no specific nodes, but you can specify the nodes you want to launch the job on.
  • Can preempt lower priority workload: Whether the job can preempt lower priority workload, defaults to false.
  • Can be preempted by higher priority workload: Whether the job can be preempted by higher priority workload, defaults to false.
  • Workers: The number of workers to launch for the job, defaults to 1.

Container

  • Image: The container image that will be used to create the job. You can choose from the default image lists or use your own custom image.
  • Private image registry auth (optional): If you are using a private image, you need to specify the image registry auth.
  • Run Command: The command to run when the container starts.
  • Container Ports: The ports that the container will listen on. In this field, you can add multiple ports, and each port can be specified with a protocal(TCP, UDP or SCTP) and a port number.
    ports
  • Log Collection: Whether to collect the logs from the container, following the workspace level setting by default.

Advanced

  • Environment Variables: Environment variables are key-value pairs that are passed to the job. They will be automatically set as environment variables in the job container, so the runtime can refer to them as needed. Refer this guide for more details.
    Note

    Your defined environment variables should not start with the name prefix LEPTON_, as this prefix is reserved for predefined environment variables. The following environment variables are predefined and will be available in the job:

    • LEPTON_JOB_NAME: The name of the job
    • LEPTON_RESOURCE_ACCELERATOR_TYPE: The resource accelerator type of the job
  • Storages: Mount storage for the job container, refer to this guide for more details.
  • Shared Memory: The size of the shared memory that will be allocated to the container.
  • Max replica failure retry: Maximum number of times to retry a failed replica, zero by default.
  • Max job failure retry: Maximum number of failure restarts of the entire job.
  • Disable retry when program error occurs: If enabled, the job won't be retried if a program error is detected in the logs.
  • Archive time: The time to keep the job's logs and artifacts after the job is completed, defaults to 3 days.
  • Visibility: Specifies the visibility of the job. If set to private, only the creator can access the job. If set to public, all users in the workspace can access the job.
Copyright @ 2025, NVIDIA Corporation.