recipe.h

Functions handling FP8 recipes.

Functions

void nvte_delayed_scaling_recipe_amax_and_scale_update(const NVTETensor amax_history, const NVTETensor scale, NVTETensor updated_amax_history, NVTETensor updated_scale, const char *amax_compute_algo, NVTEDType fp8_dtype, float margin, cudaStream_t stream)

Update FP8 scaling factors with delayed scaling recipe.

The amax history is rotated by -1 (e.g. the first entry shifts to the last, the last entry shifts to the second to last) and the first entry is set to zero. The scaling factor is estimated so the FP8 tensor’s maximum absolute value is \( 2^{-\text{margin}} \text{max}_\text{fp8\_dtype} \).

Parameters:

amax_history – [in] History of maximum absolute values. Shape: [history_length, num_scales]
scale – [in] Scaling factor for casting to FP8. Shape: [num_scales]
updated_amax_history – [out] Updated history of maximum absolute values. Shape: [history_length, num_scales]
updated_scale – [out] Updated scaling factor for casting to FP8. Shape: [num_scales]
amax_compute_algo – [in] Method to reduce amax history. Options are “max” and “most_recent”.
fp8_dtype – [in] FP8 datatype.
margin – [in] Scaling factor margin.
stream – [in] CUDA stream.

void nvte_delayed_scaling_recipe_amax_and_scale_update_after_reduction(const NVTETensor amax_reduction_buffer, std::vector<NVTETensor> amax_histories, std::vector<NVTETensor> scales, const char *amax_compute_algo, NVTEDType fp8_dtype, float margin, cudaStream_t stream)

Bulk-update FP8 scaling factors with delayed scaling recipe after amax reduction.

Operations performed include, updating the most recent amax history with the relevant segment of global reduction buffer if it’s not 0, rotating the amax history based on the rule below, and updating the scales.

The amax history is rotated by -1 (e.g. the first entry shifts to the last, the last entry shifts to the second to last) and the first entry is set to zero. The scaling factor is estimated so the FP8 tensor’s maximum absolute value is \( 2^{-\text{margin}} \text{max}_\text{fp8\_dtype} \).

Parameters:

amax_reduction_buffer – [in] The contiguous buffer used for amax reduction. Shape: [num_scales * num_tensors]
amax_histories – [inout] List of amax histories of maximum absolute values. Shape: num_tensors x [history_length, num_scales]
scales – [inout] List of scaling factors for casting to FP8. Shape: num_tensors x [num_scales]
amax_compute_algo – [in] Method to reduce amax history. Options are “max” and “most_recent”.
fp8_dtype – [in] FP8 datatype.
margin – [in] Scaling factor margin.
stream – [in] CUDA stream.