Model Configurations#
The models available to train with NeMo Customizer are configurable through the Helm chart. Refer to the Model Catalog for a list of supported models.
Configure Models#
To make specific models available in your deployment, you’ll need to enable them in your Helm chart’s values. While the values.yaml
file includes default configurations for all supported models, you can choose which ones to activate.
For example, to enable just two models - meta/llama-3.1-8b-instruct
and mistralai/mistral-7b-instruct-v0.3
- you would add the following configuration to your values file. This will use their default settings while keeping all other models hidden from the API:
customizerConfig:
models:
meta/llama-3.1-8b-instruct:
enabled: true
mistralai/mistral-7b-instruct-v0.3:
enabled: true
Configure Training Methods#
Each model can be configured with specific training methods and resource allocations. You can customize these settings by configuring the training_options
in your values file.
Here’s an example that configures a model with PEFT (specifically LoRA) and full SFT methods:
customizerConfig:
models:
meta/llama-3.1-8b-instruct:
enabled: true
training_options:
- training_type: sft
finetuning_type: lora
num_gpus: 1
- training_type: sft
finetuning_type: all-weights
num_gpus: 8
Configure Resources#
The Helm chart comes with default GPU configurations that have been tested for PEFT training with typical dataset sizes. While the default setup assumes 8 GPUs per node, you can adjust these settings based on your cluster’s capabilities and your specific needs. For example, if you need more computational power, you can scale up the GPU allocation. Here’s how to increase the resources for LoRA training from the default 4 GPUs to 16 GPUs (spread across 2 nodes):
customizerConfig:
models:
meta/llama-3.1-8b-instruct:
enabled: true
training_options:
- training_type: sft
finetuning_type: lora
num_gpus: 8
num_nodes: 2
You can optionally increase the training parallelism by increasing the number of GPUs if your cluster has more resources.
A common use case for increasing the number of GPUs is if a number of training jobs fail from running into an out of memory error, like the following error capture from a job log.
[rank12]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 4 has a total capacity of 79.25 GiB of which 134.75 MiB is free. Process 1538798 has 79.10 GiB memory in use. Of the allocated memory 76.27 GiB is allocated by PyTorch, and 245.30 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Configuration Options#
Parameter |
Description |
---|---|
|
Set to |
|
URL of the Entity Store microservice for managing dataset and model entities |
|
URL of NeMo Data Store for Customizer to connect to for dataset and model files |
|
List of models to expose for training with the NeMo Customizer microservice |
|
Model name as map key (must match NeMo Data Store and NVIDIA Inference Microservices (NIM) for Large Language Models (LLMs)) |
|
Set to |
|
URI to download the model from. Formats:
|
|
Directory for model download. Can be absolute path or relative to |
|
Micro batch size for training. Larger sizes improve efficiency but risk out-of-memory errors. Use 1 for local mode |
|
Training configuration settings for the model |
|
Training objective (currently supports Supervised Fine-Tuning (SFT)) |
|
Fine-tuning method (supports LoRA) |
|
Number of GPUs for training (must not exceed available node GPUs) |