SDK Reference

Reference guide for the fine-tuning SDK

The fine-tuning feature on DGX Cloud Lepton includes SDK support for interacting with fine-tuning jobs.

Installation and Authentication

First, install the Python SDK and authenticate with your workspace. Install the SDK with:

Next, authenticate with your workspace:

This prompts you to authenticate with your DGX Cloud Lepton workspace. If you're in a GUI-supported environment such as a desktop, a browser will open to the credentials page in your workspace. Otherwise, a URL will be displayed. Open this URL in a browser.

On the credentials page, create an authentication token by following the prompts. The page will display a secret token which is used for authentication. Copy the workspace ID and token shown in the second field and paste it back into your terminal. The format should look like xxxxxx:**************************. You should now be authenticated with DGX Cloud Lepton. You only need to authenticate once locally as long as your credentials remain valid.

Validate Installation

After authentication, validate the installation by running:

This lists your available workspaces and should look similar to the following if authentication was successful:

SDK Usage

The Python SDK includes support for launching, deleting, and retrieving information for fine-tuning jobs.

Launching a Fine-tuning Job

To launch a fine-tuning job from the Python SDK, a small Python script needs to be created. The OpenMathReasoning example is used as a guide for launching the job from the SDK.

Following the linked guide, the script to launch the fine-tuning job would look like the following:

Modifying the script above to match your workspace settings and running it with python3 <script.py> will launch the fine-tuning job in your workspace which will run once resources are available.

Some of the high-level parameters in the script above are as follows:

  • name: This is the name of the fine-tuning job to distinguish the job.
  • resource_shape: Specify the resource shape to run the fine-tuning job on, such as gpu.8xh200.
  • node_groups: Enter the node group to run the fine-tuning job on. This must be a list of strings.
  • num_workers: Enter the number of workers to use for the fine-tuning job. Values greater than 1 will launch a distributed fine-tuning job.
  • mount: Specify the location for the storage to mount. path is the location on the storage to mount, mount_path is the directory to mount storage inside the container, and from is the name of the storage volume to mount. from must start with either node-nfs: or node-local:.
  • checkpoint_directory: Enter the directory to save the output checkpoints to. This must be in a mounted storage volume.
  • model_uri: Enter the Hugging Face repo ID for the base model to fine-tune or the absolute path to a model in safetensors format loaded on mounted storage.
  • dataset_uri: Enter the Hugging Face repo ID for the training dataset to fine-tune, or the absolute path to a preprocessed dataset in .jsonl format loaded on mounted storage.
  • dataset_split: Enter the dataset split name if loading from Hugging Face, such as train.
  • dataset_column_mapping: Enter the column mapping for the training dataset if columns in the dataset have different names than the fine-tuning defaults. This should be in dictionary format. For example, if the input column in the dataset is named prompt, this should be mapped to question with {"question": "prompt"}. The key for each item in the dictionary must be one of the three input column names (context, question, and answer).
  • validation_dataset_uri: Enter the Hugging Face repo ID for the validation dataset to fine-tune, or the absolute path to a preprocessed dataset in .jsonl format loaded on mounted storage.
  • validation_dataset_split: Enter the dataset split name if loading from Hugging Face, such as validation.
  • validation_dataset_column_mapping: Enter the column mapping for the validation dataset if columns in the dataset have different names than the fine-tuning defaults. This should be in dictionary format. For example, if the input column in the dataset is named prompt, this should be mapped to question with {"question": "prompt"}. The key for each item in the dictionary must be one of the three input column names (context, question, and answer).
  • epochs: Specify the number of epochs over the training dataset to fine-tune over.
  • cp_size: Enter the context parallelism size for the fine-tuning job, useful for longer context input and outputs.
  • val_every_steps: Specify how frequently a validation pass should be run.
  • checkpoint_every_steps: Specify how often a checkpoint should be saved.
  • global_batch_size: Enter the global batch size to use during fine-tuning.
  • learning_rate: Enter the maximum learning rate for the optimizer during fine-tuning.
  • min_lr: Enter the minimum learning rate for the optimizer during fine-tuning.

Listing Fine-tuning Jobs

After fine-tuning jobs have been launched, they can be viewed from the SDK. This will provide high-level information on all of the fine-tuning jobs in the node group in tabular format. An example is as follows:

This will display a list of jobs similar to the following:

Retrieving Fine-tuning Job Details

To get more detailed information on an existing fine-tuning job including training parameters, model and dataset names, job status, and more, the SDK provides an API to retrieve fine-tuning job metadata.

You will need the ID for a fine-tuning job which can be found using the lep finetune list CLI command or the SDK to list jobs as shown above. The ID for the job in the example above is openmathreasoning-x7x5, shown in the second line of the first box on the left.

To view the job's information, run:

This will output a JSON object with all of the training and job specifications, similar to the following which has been truncated for brevity:

Deleting Fine-tuning Jobs

Jobs can be deleted by the SDK by providing the ID of the fine-tuning job to delete. The ID can be found with the lep finetune list CLI command or using the SDK to list job IDs as shown above. In the earlier example, the ID is openmathreasoning-x7x5. To delete the job, run:

This will send a signal to the workspace to stop the job if it's currently running and delete it from the fine-tuning job list.

Copyright @ 2025, NVIDIA Corporation.