core.inference.model_inference_wrappers.inference_wrapper_config#
Module Contents#
Classes#
Config for the model inference wrapper |
API#
- class core.inference.model_inference_wrappers.inference_wrapper_config.InferenceWrapperConfig#
Config for the model inference wrapper
NOTE : All the arguments here are obtained from arguments.py file
None
Receive happens between the layers during PP with size [seq_len, batch_size, hidden_size]
- params_dtype: torch.dtype#
None
Can be torch.float or torch.half if –fp16 is used, or torch.bfloat16 if –bf16 is used
- inference_batch_times_seqlen_threshold: int#
None
if (batch-size * sequence-length) is smaller than this threshold then we will not pipeline the batch.
- padded_vocab_size: int#
None
The final padded vocab size (Padded to make it divisible by –make-vocab-size-divisible-by value)
- inference_max_requests: int#
8
Maximum number of requests for inference (prefill & decode). Necessary for CUDA graphs.
- inference_max_seq_length: int#
2560
Maximum sequence length for inference (prefill & decode). Necessary for CUDA graphs.
- fp32_residual_connection: bool#
False
Move residual connections to fp32. Obtained from arguments.py
- nccl_all_reduce_for_prefill: bool#
False
When using symmetric all reduce kernels we keep the default all reduces for nccl. This can be more effecient for large prefill sizes
- fp8: Optional[str]#
None
If set, enables the use of FP8 precision through Transformer Engine. There are 2 predefined choices (1) ‘e4m3’ uniformly uses e4m3 for all FP8 tensors, (2) ‘hybrid’ uses e4m3 for all FP8 activation and weight tensors and e5m2 for all FP8 output activation gradient tensors.
- moe_pad_experts_for_cuda_graph_inference: bool#
False
Some MoE routers have a D2H sync that will break cuda graphs. If this flag is set the router will switch to dropping and padding during decode time which does not have a D2H sync. The capacity factor is set to the max that an expert could see during inference so no tokens are actually dropped.
- add_attributes(attribute_value_pair: dict)#
Utility to add more attributes to inference params
Use this method to pass in a custom dictionary to add more configs to the instance created. Use as follows: c = InferenceWrapperConfig c.add_attributes({‘precision’:’fp32’})
- Parameters:
attribute_value_pair (dict) – A dictionary containing attributes as the key names and
values. (corresponding)