CLI Reference
Reference guide for the fine-tuning CLI
The fine-tuning feature on DGX Cloud Lepton includes CLI support for interacting with fine-tuning jobs.
Installation and authentication
First, install the Python SDK and authenticate with your workspace. Install the SDK with:
Next, authenticate with your workspace:
This prompts you to authenticate with your DGX Cloud Lepton workspace. If you're in a GUI-supported environment such as a desktop, a browser will open to the credentials page in your workspace. Otherwise, a URL will be displayed. Open this URL in a browser.
On the credentials page, create an authentication token by following the prompts. The page will display a secret token which is used for authentication. Copy the workspace ID and token shown in the second field and paste it back into your terminal. The format should look like xxxxxx:**************************. You should now be authenticated with DGX Cloud Lepton. You only need to authenticate once locally as long as your credentials remain valid.
Validate Installation
After authentication, validate the installation by running:
This lists your available workspaces and should look similar to the following if authentication was successful:
CLI Usage
The Python SDK includes CLI commands for launching, deleting, and retrieving information for fine-tuning jobs. High-level information on this can be viewed by running the following in your terminal:
Launching a Fine-tuning Job
To launch a fine-tuning job from the CLI, use the lep finetune create command. The OpenMathReasoning example is used as a guide for launching the job from the CLI.
Following the linked guide, the command to launch the fine-tuning job would look like the following:
The parameters are as follows:
name: This is the name of the fine-tuning job to distinguish the job.resource-shape: Specify the resource shape to run the fine-tuning job on, such asgpu.8xh200.node-group: Enter the node group to run the fine-tuning job on.num-workers: Enter the number of workers to use for the fine-tuning job. Values greater than 1 will launch a distributed fine-tuning job.mount: Specify the<storage path>:<mount path>:<storage name>for the storage to mount.<storage path>is the location on the storage to mount,<mount path>is the directory to mount storage inside the container, and<storage name>is the name of the storage volume to mount.<storage name>must start with eithernode-nfs:ornode-local:.checkpoint-directory: Enter the directory to save the output checkpoints to. This must be in a mounted storage volume.model-uri: Enter the Hugging Face repo ID for the base model to fine-tune or the absolute path to a model in safetensors format loaded on mounted storage.dataset-uri: Enter the Hugging Face repo ID for the training dataset to fine-tune, or the absolute path to a preprocessed dataset in.jsonlformat loaded on mounted storage.dataset-split: Enter the dataset split name if loading from Hugging Face, such astrain.dataset-column-mapping: Enter the column mapping for the training dataset if columns in the dataset have different names than the fine-tuning defaults. For example, if the input column in the dataset is namedprompt, this should be mapped toquestionwith--dataset-column-mapping question:prompt. This argument can be repeated for all three columns (context,question, andanswer).validation-dataset-uri: Enter the Hugging Face repo ID for the validation dataset to fine-tune, or the absolute path to a preprocessed dataset in.jsonlformat loaded on mounted storage.validation-dataset-split: Enter the dataset split name if loading from Hugging Face, such asvalidation.validation-dataset-column-mapping: Enter the column mapping for the validation dataset if columns in the dataset have different names than the fine-tuning defaults. For example, if the input column in the dataset is namedprompt, this should be mapped toquestionwith--validation-dataset-column-mapping question:prompt. This argument can be repeated for all three columns (context,question, andanswer).epochs: Specify the number of epochs over the training dataset to fine-tune over.cp-size: Enter the context parallelism size for the fine-tuning job, useful for longer context input and outputs.val-every-steps: Specify how frequently a validation pass should be run.checkpoint-every-steps: Specify how often a checkpoint should be saved.global-batch-size: Enter the global batch size to use during fine-tuning.learning-rate: Enter the maximum learning rate for the optimizer during fine-tuning.min-lr: Enter the minimum learning rate for the optimizer during fine-tuning.
The complete list of variables available for the CLI can be found with lep finetune create -h. For further analysis of the values used in this example, refer to the OpenMathReasoning guide.
Running the command above will queue the fine-tuning job on the specified node group and the job will start once resources are available.
Listing Fine-tuning Jobs
After fine-tuning jobs have been launched, they can be viewed from the CLI using the lep finetune list command. This will provide high-level information on all of the fine-tuning jobs in the node group in tabular format. An example is as follows:
This will display a table similar to the following:
A complete list of parameters can be viewed with:
Retrieving Fine-tuning Job Details
To get more detailed information on an existing fine-tuning job including training parameters, model and dataset names, job status, and more, use the lep finetune get command. This will retrieve all of the metadata for a specific fine-tuning job.
You will need the ID for a fine-tuning job which can be found using the lep finetune list command above. The ID for the job in the example above is openmathreasoning-c7vv, shown in the second line of the first box on the left.
To view the job's information, run:
This will output a JSON object with all of the training and job specifications, similar to the following which has been truncated for brevity:
A complete list of parameters can be viewed with:
Deleting Fine-tuning Jobs
To delete a fine-tuning job, use the lep finetune delete command. This will stop and delete the specified job.
You will need the fine-tuning job's ID which can be found with the lep finetune list command. In the earlier example, the ID is openmathreasoning-c7vv. To delete the job, run:
This will send a signal to the workspace to stop the job if it's currently running and delete it from the fine-tuning job list.
A complete list of parameters can be viewed with: