The benchmarks used within this sizing guide are not all-encompassing; they provide a representative workflow and serve as a starting point that can be used to build upon depending on your environment. This sizing specifically concentrates on a single-node deployment for the following workflow.

Large Language Models#

Llamafactory supports many LLMs and can be selected using the Model name drop-down menu. Llamafactory obtains these models through Hugging Face. Some models are ungated, while others, like llama3-8b-instruct, are gated. If a model is gated, you will need to request access to the model. Once access is granted, you can generate an Access Token that can be used with AI Workbench to download the model. The models you can support will depend on the parameter size of the model (7B, 8B, etc.) and the quantization you select (4-bit, 8-bit, etc.).

In most cases, a vGPU profile of 24Q can support up to an 8B parameter with 16-bit quantization. You can review Llamafactory’s guidance on what is supported here, but this is just general guidance as other factors such as batch size and sequence length will affect the amount of vGPU memory. In Llamafacotry these parameters are Batch size and Cuttoff length respectively. If you receive ‘out of memory’ errors, you must lower these parameters or increase the vGPU profile to 32Q.

For a list of all the supported Large Language Models, along with their supported model size and the template to use, reference this chart on the office Llamafactory GitHub repository.