Sweeping for Optimized TensorRT-LLM Engines#

Important

During Early Access (EA), sweeping will only work if your NVIDIA NGC org has been granted access permission.

Sweeping is a systematic exploration of various configurations to identify the most optimal set for serving a particular machine learning model. In the context of TensorRT-LLM engines, sweeping involves testing different combinations of configs like maximum sequence length, maximum batch size, quantization settings, and target GPUs to optimize for objectives such as throughput or latency. The process includes evaluating performance metrics for each configuration across multiple trials, allowing developers to effectively search the specified parameter space to enhance the model’s performance and efficiency.

Launching a Sweep#

A sweep can be launched with the trt-cloud sweep command in one of two ways:

Automatic Sweep Configuration.
User Created, Custom Sweep Configuration.

Automatic Sweep Configuration#

Running the trt-cloud sweep command with a minimal set of input and load profile arguments automatically generates a sweep configuration tailored to your needs.

Minimal Argument Groups

Input Sources for Sweeping
Supported GPUs (--gpu)
Load profiles:
1. --concurrency (optional)
2. --input-sequence-length
3. --output-sequence-length
Optional: Optimization objective
1. --optimization-objective {throughput,latency}

For example:

trt-cloud sweep \
--src-hf-repo meta-llama/Llama-3.2-1B-Instruct \
--gpu L40S \
--concurrency 50 \
--input-sequence-length 1000 \
--output-sequence-length 1000 \
--optimization-objective throughput

Anything that is not provided will be selected automatically. Run trt-cloud sweep -h for more information.

User Created, Custom Sweep Configuration#

An alternative way to launch a sweep is from a configuration file in JSON format. This configuration contains, among other things:

the LLM model to sweep over
the TensorRT-LLM version to use
the various parameters that are part of the search space
benchmark parameters

To generate a sweep configuration file, run trt-cloud sweep and provide the following additional arguments:

--save-config SAVE_CONFIG_FILE.json --save-config-only

By default, the token is removed from the generated configuration. If you want to keep the token, add --save-config-with-token. For example:

--save-config SAVE_CONFIG_FILE.json --save-config-with-token

For example:

$ trt-cloud sweep --src-hf-repo meta-llama/Llama-3.2-1B-Instruct --gpu L40S --trtllm-version 0.17.0 --max-trials 4 --isl 4096 --osl 1024 --save-config my_sweep_config.json --save-config-only
[I] Generating sweep config file using the provided arguments.
[I] Saved generated sweep config to my_sweep_config.json.

The generated file can now be modified and launched:

$ trt-cloud sweep -c my_sweep_config.json

For details on how to customize your sweep config, refer to the Sweep Configuration Schema. section and the Interactive Sweep Config documentation.

Sweep Configuration Schema#

The following code shows an example sweep configuration. It is a good starting point for customizing your own sweep config.

Replace <YOUR_HF_TOKEN> with your actual Hugging Face token, with read access to the meta-llama/Llama-3.2-1B-Instruct repo.

{
    "sweep_config": {
        "build_inputs": [
            {
                "source": {
                    "id": "meta-llama/Llama-3.2-1B-Instruct",
                    "source_type": "huggingface_repo",
                    "token": "<YOUR_HF_TOKEN>"
                },
                "type": "huggingface_checkpoint"
            }
        ],
        "hardware": {
            "gpu": "H100"
        },
        "search_strategy": {
            "batch_size": 16,
            "max_trials": 32,
            "model_architecture": "meta-llama/Llama-3.2-1B",
            "modifier": {
                "heuristics": [
                    "default_filters",
                    "max_batchsize_by_kvcache",
                    "limit_gpus_to_single_node"
                ]
            },
            "name": "heuristic",
            "optimization_objective": "throughput"
        },
        "trtllm_build": {
            "dtype": [ "float16", "bfloat16" ],
            "gather_all_token_logits": false,
            "gather_context_logits": false,
            "gather_generation_logits": false,
            "gemm_plugin": [ "auto", "disable" ],
            "max_num_tokens": [ 512, 1024, 8192, 16384, 24576 ],
            "multidevice": [ {
                    "pp_size": 1, "tp_size": 1
                }, {
                    "pp_size": 1, "tp_size": 2
                }, {
                    "pp_size": 1, "tp_size": 4
                }, {
                    "pp_size": 1, "tp_size": 8
                }
            ],
            "quantization": [ {
                    "kv_cache_dtype": null,
                    "qformat": "fp8"
                }, {
                    "kv_cache_dtype": null,
                    "qformat": "full_prec"
                }
            ],
            "recipe_name": "trtllm_trtcloud",
            "trtllm_model_family": "llama",
            "trtllm_version": "0.17.0",
            "use_paged_context_fmha": true,
            "user_buffer": [ true, false ]
        },
        "benchmark": {
            "dynamic_runtime_config": {
                "cuda_graph_cache_size": 1,
                "cuda_graph_mode": true,
                "enable_chunked_context": true
            },
            "perf_configs": [
                {
                    "requests_config": {
                        "concurrency": 5,
                        "input_tokens_mean": 1000,
                        "max_num_request": 2000,
                        "output_tokens_mean": 1000,
                        "request_ratio": 3
                    }
                }
            ]
        }
    }
}

Sweep Search Space#

The search space to sweep over can be thought of as a filtered Cartesian product of most list items in the trtllm_build portion of the sweep configuration. For example, in order to sweep over max_seq_len and max_batch_size, modify the sweep configuration as follows:

{
    "sweep_config": {
        ...
        "trtllm_build": {
            "max_seq_len": [4096, 8192],
            "max_batch_size": [32, 64, 128],
        },
        ...
    },
}

Any TensorRT-LLM flag that you would like to sweep over can be specified as a list with the options you think are worth considering based on your load profile.

The maximum number of trials is, ideally, the size of the cartesian space, and might be pared down internally for infeasible combinations (for example, if the requested GPU type does not support certain features, or have enough memory). However, since the number might be very large, it can also be manually limited as follows:

{
    "sweep_config": {
        ...
        "search_strategy": {
        "max_trials": N,
        "batch_size": M
        },
        ...
    },
}

Note

Specifying max_trials is optional. If not provided, the default is 64, which limits the max trials of a sweep to 64.
TensorRT-Cloud runs trials in batches. The parameter batch_size defines how many trials are in one batch. If not provided, batch_size equals to max_trials, which means all trials run in one batch.
There is an execution timeout of 3 days for sweeps, no matter the number of trials.

Sweep Overview#

TensorRT-Cloud provides a Generate sweep overview prompt to show how many trials and batches a sweep is going to run and how it filters the search space, before proceeding with the sweep. For example:

$ trt-cloud sweep ...
Generate sweep overview (could take a few minutes)? [y/N]: y
[I] Unfiltered search space: 320 configs.
Filtered search space: 240 configs left.
    80 configs filtered out: `user_buffer` is only supported for llama model family and fp8
Total number of trials to run: 32 (2 batches with batch size of 16).
WARNING: Sweep runs 10+ parallel trials, consider reducing `batch_size`

...

This is an optional feature that will not affect the sweep. If skipped, the sweep still runs the same overview before running the trials.

Limitation: For a very large search space, the generating sweep overview can take a long time and even time out. In this case, skip the generating sweep overview.

Input Sources for Sweeping#

Sweeps accept a variety of different input types sourced in different ways.

Hugging Face Repository#

Hugging Face is a popular mechanism to source models. A Hugging Face model can be specified as follow:

Example CLI command:

--src-hf-repo <HF_REPO_PATH> --src-token <READ_ONLY_HF_TOKEN>

Express this in the sweep configuration file as follows:

{
    "sweep_config": {
        "build_inputs": [
            {
                "type": "huggingface_checkpoint",
                "source": {
                    "source_type": "huggingface_repo",
                    "id": "google/gemma-2b-it",
                    "revision": "main",
                    "token": "<READ_ONLY_HF_TOKEN>"
                }
            }
        ],
        ...
    }
}

NGC Private Registry#

The same Hugging Face checkpoint is provided via the NGC private registry. Refer to the NGC Private Registry User Guide for how to create and upload an NGC private registry resource.

Example CLI command:

--src-ngc=<org>/[<team>/]<resource-name[:version]> --src-type="huggingface_checkpoint" --src-token=<NVAPI_TOKEN> --model-family=<MODEL_FAMILY>

Express this in the sweep configuration file as follows:

{
    "sweep_config": {
        "build_inputs": [
            {
                "type": "huggingface_checkpoint", # or another type
                "source": {
                    "source_type": "ngc_private_registry",
            "artifact_type": "model",  # or "resource"
                    "target": <org>/[<team>/]<resource-name[:version]>
                    "nvapi_token": "<YOUR_NVAPI_TOKEN>"
                }
            }
        ],
        ...
        "trtllm_build": {
                    "trtllm_model_family": "<MODEL_FAMILY>",
            }
    }
}

For models supplied using the NGC private registry, the model family must be explicitly provided since it cannot be easily inferred.

Note

The nvapi_token must match the org/team specified in target and have the Private Registry role.

Pre-signed URL#

The checkpoint is provided as a URL as well (for example, a pre-signed S3 URL).

For limited access and a limited time, follow these best practices when generating a pre-signed URL. The artifact must be the ZIP archive of the Hugging Face checkpoint.

Example CLI command:

--src-url <PRESIGNED_URL> --src-type "huggingface_checkpoint" --model-family=<MODEL_FAMILY>

Express this in the sweep configuration file as follows:

{
    "sweep_config": {
        "build_inputs": [
            {
                "type": "huggingface_checkpoint",
                "source": {
                    "source_type": "url",
            "url": <URL>
                }
            }
        ],
        ...
        "trtllm_build": {
            "trtllm_model_family": "<MODEL_FAMILY>",
        }
    }
}

For models supplied using a pre-signed URL, the model family must be explicitly provided since it cannot be easily inferred.

Local File Path#

Checkpoints are also provided via a local version of the checkpoint directory. This can be useful for specifying quickly iterating local weights.

Example CLI command:

--src-path <LOCAL_PATH> --src-type "huggingface_checkpoint" --model-family=<MODEL_FAMILY>

Note

For models supplied using the local file path, you can’t specify a sweep config file change directly since the CLI will upload your local file as an NGC private registry checkpoint and update the config file to reflect that link. It will track changes to your local file using a hash that is stored locally on your system to decide if a new upload is needed.

Input Types#

Multiple input types are supported for sweeping.

Hugging Face Checkpoint#

As discussed previously, an input to a sweep can be a Hugging Face checkpoint.

TensorRT-LLM Checkpoint#

Sweeping from a TensorRT-LLM checkpoint can be useful in the following scenarios:

The target GPU needs to be larger to accommodate the original model weights, but it can fit them if they are quantized on a larger GPU.
- Quantize on a large GPU and bring the quantized checkpoint to the target GPU.
The engine must be built for custom or pre-quantized weights.

This can be achieved by first building the quantized checkpoint using Building a TensorRT-LLM Quantized Checkpoint and then using the generated checkpoint as the source of your sweep.

Assuming the outputs of those steps of the quantization are uploaded to the NGC Private Registry, the CLI command will look similar to the following.

--src-ngc=<org>/[<team>/]<resource-name[:version]> --src-type="trtllm_checkpoint" --src-token=<NVAPI_TOKEN> --model-family=<MODEL_FAMILY> --tokenizer-hf-repo=<HF_REPO_PATH>

The build_inputs section of the sweep configuration can also be modified directly as follows:

{
    "sweep_config": {
        "build_inputs": [
            {
                "type": "trtllm_checkpoint",
                "source": {
                    "source_type": "ngc_private_registry",
            "artifact_type": "model",  # or "resource"
                    "target": <org>/[<team>/]<model-name>[:version]>
                    "nvapi_token": "<NVAPI_TOKEN>"
                }
            },
            {
                "type": "tokenizer",
                "source": {
                    "source_type": "huggingface_repo",
                    "id": "google/gemma-2b-it",
                    "token": "<HF_TOKEN>",
                    "allowed_extensions": [".json"]
                }
            }
        ],
        ...
        "trtllm_build": {
            "trtllm_model_family": "<MODEL_FAMILY>",
        }
    }
}

Note

A tokenizer needs to be provided as an input as well, since it is required for certain benchmarking steps and not captured as a part of the TRT-LLM quantized checkpoint.
The "allowed_extensions": [".json", ".model"] field is an optimization that allows the builds to download only JSON files from the Hugging Face repository. For many models, the tokenizer information is encapsulated in the JSON files.
The other input sources detailed in the Input Sources for Sweeping section can be used as well.

Hugging Face Weightless Checkpoint#

There are scenarios where we do not want to use the model’s actual weights for building engines:

IP protection
Spending time downloading large model weights

In such situations, TensorRT-Cloud supports building and performance benchmarking with randomly initialized weights.

Example CLI command:

--src-hf-repo <HF_REPO_PATH> --src-token <READ_ONLY_HF_TOKEN> --weightless

This can also be done by switching the type in the previous examples from huggingface_checkpoint to huggingface_checkpoint_weightless in the sweep config.

{
    "sweep_config": {
        "build_inputs": [
            {
                "type": "huggingface_checkpoint_weightless",
                "source": {
                    "source_type": "huggingface_repo",
                    "id": "google/gemma-2b-it",
                    "revision": "main",
                    "token": "<YOUR_HF_TOKEN>"
                }
            }
        ],
        ...
    }
}

Limitations:

This is only supported when huggingface_repo is the source_type.

Only models that are of type transformers.AutoModelForCausalLM are currently supported for building with fake weights. Refer to Auto Classes for more information.

GPU Type and Parallelism#

The GPU name and number of GPUs to use for a sweep are sweepable pair of parameters.

The can be specified via the CLI:

--gpu=<GPU_NAME> --tp-size=<TP_SIZE> --pp-size=<PP_SIZE>

To customize, this can also be specified through the sweep specification JSON file:

{
    "sweep_config": {
        "build_inputs": [...],
        "trtllm_build": {
            "multidevice": {"tp_size": 2, "pp_size": 1}
        },
        ...
    }
}

The product of tp_size and pp_size is the number of GPUs that will be used for a particular build. In order to sweep over different combinations of tp_size, pp_size, and the multidevice fields, in the example above, can be a list instead:

{
    "sweep_config": {
        "build_inputs": [...],
        "trtllm_build": {
            "multidevice": [
                {"tp_size": 1, "pp_size": 1},
                {"tp_size": 2, "pp_size": 1},
                ...
            ],
        },
        ...
    }
}

The GPU on which to execute the sweep can be defined as:

{
    "sweep_config": {
        "build_inputs": [...],
        "trtllm_build": {...},
        "hardware": {
            "gpu": "H100"
        },
    },
    ...
}

Note

In the above example, we see that multidevice is a sweep-able parameter.
The hardware portion is a non-sweepable parameter.
The available GPU types can be listed using trt-cloud info. For more information, refer to the Supported Models, Hardware, and OS section.

Quantization#

Quantization allows you to run with lower precision to improve performance. Sweeping over quantization allows you to profile the performance vs accuracy gap between the different quantization profiles.

The quantization parameters can be defined as:

{
    "sweep_config": {
        "build_inputs": [...],
        "trtllm_build": {
            "quantization": [
            {"qformat": "full_prec"},
            {"qformat": "fp8", "kv_cache_dtype": "fp8"},
            {"qformat": "int8_sq", "kv_cache_dtype": "int8"},
            ...
            ],
            ...
        },
        ...
    }
}

For more information, refer to the Quantization Modes Supported by Model Family section.

Specifying Benchmark Options#

There are two portions to how TensorRT-LLM builds are benchmarked during a sweep: the dynamic and static portions.

The dynamic portion is considered part of the search space, and therefore can be swept over. This includes parameters that affect the runtime of an engine: the fraction of GPU memory for KV-caching, the scheduler policy, are a few examples.

The static portion defines what runtime parameters to use when running performance benchmarks.

{
    "sweep_config": {
        "build_inputs": [...],
        "trtllm_build": {...},
        "benchmark": {
        # This is the static component. There can be multiple items to request
        # multiple performance benchmarks for each build (for example, different
        # concurrency profiles).
        "perf_configs": [
            {
            "requests_config": {
                "concurrency": 500,
                "input_tokens_mean": 1000,
                "output_tokens_mean": 1000
            }
            }
        ],
        "dynamic_runtime_config": {
            # The below means the scheduler policy will be part of the search space
            # with these two values.
            "scheduler_policy": ["guaranteed_no_evict", "max_utilization"],
        }
        }
    }
}

The objective to optimize for can specified in the search_strategy part of the configuration:

{
    "sweep_config": {
        "build_inputs": [...],
        "trtllm_build": {...},
        "search_strategy": {
            # Can be either "throughput" or "latency".
            "optimization_objective": "throughput"
        }
    }
}

Querying Sweep Status#

The trt-cloud sweep status command allows a user to query the status of a sweep. The -w and --watch flags periodically polls the status for the user.

$ trt-cloud sweep status <sweep_uuid>
┌──────────────────────────────────────────────────────────────────────────────
│ User Command: sweep status <sweep_uuid>                                            │
│ Completed - 14 min 37 sec                                                          │
│ Total Trials: 32, Start: 2025-03-14T02:45:24, End: 2025-03-14T03:00:01, Tags:
└──────────────────────────────────────────────────────────────────────────────

The --show-trials flag allows users to see the status of individual builds within a sweep.

$ trt-cloud sweep status <sweep_uuid> --show-trials
┌──────────────────────────────────────────────────────────────────────────────
│ User Command: sweep status <sweep_uuid> --show-trials                              │
│ Completed - 14 min 37 sec                                                          │
│ Total Trials: 32, Start: 2025-03-14T02:45:24, End: 2025-03-14T03:00:01, Tags:
│     Trial 0 - DONE
│     Trial 1 - DONE
│     [... skipped for brevity ...]
│     Trial 31 - DONE                                                                │
└──────────────────────────────────────────────────────────────────────────────

We can query for a given trial’s status using the -i and --trial-id options. After that particular trial starts getting processed, the last few log lines of that trial are printed to the console:

$ trt-cloud sweep status <sweep_uuid> --trial-id 0
┌──────────────────────────────────────────────────────────────────────────────
│ User Command: sweep status <sweep_uuid> --trial-id 0                               │
│ Completed - 12 min 37 sec                                                          │
│ Total Trials: 2, Start: 2025-03-14T02:45:24, End: 2025-03-14T02:45:24, Tags:
│     Trial 0 - DONE - 2.8 min                                                       │
├──────────────────────────────────────────────────────────────────────────────
│ Latest 5 lines of trial log:                                                       │
│    2025-03-14T02:53:27.472413 [INF...] p99_request_queueing_latency(ms) 22.62
│    2025-03-14T02:53:27.472487 [INF...K] p90_request_queueing_latency(ms) 6.32
│    2025-03-14T02:53:27.472561 [INF...K] p50_request_queueing_latency(ms) 3.61
│    2025-03-14T02:53:27.472632 [INFO] [TRT Cloud]
│    2025-03-14T02:53:27.472704 [INF...M][INFO] Refreshed the MPI local session │
└──────────────────────────────────────────────────────────────────────────────

Limitation: –show-trials has no effect if --trial-id is also provided

Cancelling a Sweep#

The trt-cloud sweep cancel command allows users to cancel a running sweep. For example:

$ trt-cloud sweep cancel <sweep_uuid>
[I] Cancelled sweep session with sweep_id: <sweep_uuid>

The sweep will immediately terminate on cancellation, with the sweep’s status updated.

$ trt-cloud sweep status <sweep_uuid>
┌──────────────────────────────────────────────────────────────────────────────
│ User Command: sweep status <sweep_uuid>                                             |
│ Cancelled                                                                           |
│ Total Trials: 0, Start: <start_time>, End: <time_of_cancellation>, Tags: []         |
└──────────────────────────────────────────────────────────────────────────────

Obtaining Sweep Results#

The trt-cloud sweep results command allows users to obtain the sweep’s benchmark results.

$ trt-cloud sweep results <sweep_uuid>
[I] Benchmark results (valid for 1:59:59): <pre-signed S3 URL>

The pre-signed URL printed to the console can be used to download a ZIP archive containing the results of the sweep. Each of the trials that comprise the sweep has a corresponding trial directory called trial_i where i is between 0 and N. The trial directory contains the following information:

The exact configuration corresponding to this build in a trtc_config.json file.
Logs for the various stages involved in building the engine and benchmarking it under the build_result directory.
Performance metrics in metrics.json under the build_result directory.

In the event of an unsuccessful build, we do our best to populate each build subdirectory with the error message encountered during the build process.

In addition, two summary files are generated; one csv file and one html report. The summaries include a row for each trial containing the status, build, benchmark times, the varying sweep parameters, the performance metrics, and a command to build the engine of the trial.

The produced archive has the following structure:

<sweep_uuid>/
├── summary.csv
├── summary.html
├── trial_0/
│   ├── trtc_config.json
│   ├── build_result/
│   │   ├── trt_cloud.log
│   │   ├── metrics.json
│   │   ├── ...
├── trial_1/
│   ├── ...
├── trial_2/
│   ├── ...
├── ...
├── trial_N/
│   ├── ...

Building the Optimal Engine for a Sweep#

After a sweep has completed, the optimal engine for it can be built using the trt-cloud sweep build command.

$ trt-cloud sweep build <sweep_uuid>
Do you want to proceed? [y/N]: y
[I] Building optimized engine for sweep started. Build ID: <build_uuid>.
[I] trt-cloud build status <build_uuid>

The optimal engine is built based on the provided optimization_objective.

A new build_uuid is generated and can be used with the various trt-cloud build commands.

To use this build for inference, refer to the NVIDIA TensorRT-LLM Quick Start Guide.

To build the engine of a specific trial, use the -i and --trial-id options. For example:

trt-cloud sweep build <sweep_uuid> --trial-id <trial-id>

$ trt-cloud sweep build <sweep_uuid> --trial-id <trial-id>
Do you want to proceed? [y/N]: y
[I] Building engine for trial <trial-id> started. Build ID: <build_uuid>.
[I] trt-cloud build status <build_uuid>

Retrying Trials#

The trt-cloud sweep retry command allows users to retry failed, cancelled, or completed trials within a sweep. By using this command without additional flags, users can effectively manage and retry failed trials within their sweeps.

$ trt-cloud sweep retry <sweep_uuid>
Do you want to proceed? [y/N]: y
[I] Retrying all failed builds for sweep_id: <sweep_uuid> started.

To retry a specific failed, cancelled, or completed trial, a comma-separated list of trial IDs can be provided with the -i and --trial-id flags.

$ trt-cloud sweep retry <sweep_uuid> --trial-id 6,15
Do you want to proceed? [y/N]: y
[I] Retrying builds [6, 15] for sweep_id: <sweep_uuid> started.

This can be particularly useful if there are intermittent failures in a sweep.

Listing Sweeps#

The list of sweeps for the NGC organization can be obtained using the trt-cloud sweep list command.

Sweeps specific to the current user can be requested using the --me option. By default, only the 25 most recent sweeps are shown. This number can be modified using the --limit option.

Alternatively, a time range can be supplied to filter queries with finer control. If only --since is provided, all sweeps from that time till the current time are shown. Additionally, --until can be provided along with --since to only show queries between the two times (inclusive).