CPU Instruction Extensions#

AVX2 and AVX-512#

The libraries SIMDe and SSE2NEON are designed as drop-in replacements for x86_64 SIMD instruction set extensions like SSE and AVX, through emulation via ARM NEON and SVE. These libraries provide a good starting point for porting existing x86_64 SIMD code to ARM.

With the newer ARM extensions SVE and SVE2, many x86_64 intrinsics that have no direct NEON counterpart and therefore require emulation via multiple instructions now have straightforward replacements. We recommend porting them directly to SVE. This applies mainly to gather, scatter, predicated, and SM4 encryption instructions.

ARM v9.2 also supports the bit extension SVE_BitPerm, which introduces BEXT, BDEP, and BGRP. BDEP is an alternative to PDEP, part of the Intel BMI2 extension.

When using SVE, there is a choice between variable and fixed vector length and therefore register size. If you already use ASIMD/NEON or SIMDe and SSE2NEON, you should add SVE code on a case-by-case basis and fix the vector length to 128-bit for compatibility and performance. Mixing NEON and SVE incurs no additional overhead.

SVE supports both vector-length agnostic (VLA) and vector-length specific (VLS) programming. VLA is portable, so it works on hardware with any vector register bit-width. VLS sacrifices this portability to make the vector width known to the compiler and therefore enables further optimizations.

VLA: Scalable Vectors#

The vector length is not set by default and can be unset manually via the compiler flag -msve-vector-bits=scalable. Not knowing this can lead to performance degradation, because:

The compiler may not unroll the loop.
Partial use of the SVE register requires appropriate predication.
Run-time checks are required to map AVX to SVE types and operations.

For example, C or C++ doesn’t allow data structs to contain types of unknown size like svuint32x2. These run-time checks also reduce instruction level parallelism (ILP) and out-of-order (OoO) execution capacity.

#ifndef __ARM_FEATURE_SVE_BITS
#define __m256i svuint32x2_t

__m256i _mm256_i32gather_epi32(const int* base_addr, __m256i vindex, const int scale) {
    __m256i result = svset2_s32(
        result,
        0,
        svld1_gather_s32offset_s32(
            svwhilelt_b8_s32(0, 32),
            base_addr,
            svmul_n_s32_x(svptrue_b32(), svget2_s32(vindex, 0), scale * sizeof(int))
        )
    );
    if (svcntb() < 32) {
        result = svset2_s32(
            result,
            1,
            svld1_gather_s32offset_s32(
                svwhilelt_b8_s32(16, 32),
                base_addr,
                svmul_n_s32_x(svptrue_b32(), svget2_s32(vindex, 1), scale * sizeof(int))
            )
        );
    }
    return result;
}
#endif

VLS: Fixed-Length Vectors#

You can use the compiler flag -msve-vector-bits={} to fix the vector length. The pre-processor macro __ARM_FEATURE_SVE_BITS is then set to the specified value. Otherwise, it is not defined. Passing 128 makes the code compatible with all ARM SVE-enabled cores and allows the compiler to incorporate this knowledge into optimization decisions.

The following sections describe SVE implementations of common AVX2 and AVX-512 intrinsics.

AVX2 Gather and Scatter#

The gather and scatter intrinsics of the format _mm[register size]_[index data type][gather/scatter]_[data type] should be changed to a pair of consecutive svuint32_t svld1sb_[gather/scatter]_[data type]offset_[offset data type]. The multiply operation can be removed and svld1_gather used instead if the scaling factor is known at compile time and is equal to sizeof(data_type). The svptrue_[data type] intrinsic function call should be used as an argument for the svbool_t pg parameter (see the code snippet below):

#if __ARM_FEATURE_SVE_BITS==128
typedef svint32_t __m128i __attribute__((arm_sve_vector_bits(128)));
typedef struct __m256i { __m128i val[2]; } __m256i;
#define AVX_REGSIZE_DQ (256/128)
#define AVX512_REGSIZE_DQ (512/128)

__m256i _mm256_i32gather_epi32(const int* base_addr, __m256i vindex, const int scale)
{
    __m256i result;
    svbool_t ptrue = svptrue_b32();
    for (int i = 0; i < AVX_REGSIZE_DQ; i++)
    { // compiler unrolls the loop
        result.val[i] = svld1sb_gather_s32offset_s32(
            ptrue,
            base_addr,
            svmul_n_s32_x(ptrue, vindex.val[i], scale * sizeof(int))
        );
    }
    return result;
}
#endif

AVX2 Masked Gather and Scatter#

Porting masked operations from AVX2 is straightforward; however, SVE requires an extra multiplication (to apply the scale factor), a compare operation (because AVX2 uses vector registers as masks, while SVE uses predicate registers), and a bitselect operation (to merge with the third vector src). The same approach is applicable to porting scatter instructions.

__m256i _mm256_mask_i32gather_epi32(__m256i src, int const* base_addr, __m256i vindex, __m256i mask, const int scale)
{
    __m256i result;
    svbool_t ptrue = svptrue_b32();
    for (int i = 0; i <  AVX_REGSIZE_DQ; i++)
    { // compiler unrolls the loop
        svbool_t pmask32 = svcmpne_n_s32(ptrue, mask.val[i], 0);
        result.val[i] = svsel_s32(
            pmask32,
            svld1_gather_s32offset_s32(
                pmask32,
                base_addr,
                svmul_n_s32_x(ptrue, vindex.val[i], scale * sizeof(int))
            ),
            src.val[i]
        );
    }
    return result;
}

AVX-512 __mmask and SVE svbool_t#

In svbool_t, every bit is mapped to the corresponding byte of the lane of the scalable vector register, while one bit is mapped to the entire lane in the AVX-512 __mmask. For example, SVE requires 4 bits to map the svbool_t to a dword lane, and only the least significant bit out of the four will be used in the operation with predicate. Although BDEP (vector operation), an alternative to x86_64 PDEP (scalar operation), was introduced in SVE2 (FEAT_BitPerm), it only helps when converting the __mmask bitmask to bytemask and in the case where reinterpret_cast is not an option: __mmask can be converted to byte mask and then to svbool_t using the SVE compare operation. The following code splits __mmask into four svbool_t values to map to four 128-bit svuint32_t vectors:

#if REINTERPRET_CAST_IS_NOT_ALLOWED
static svbool_t convert_mmask16_to_svbool8(__mmask16 k)
{
    uint64_t bit2bytemask = 0x101010101010101;
    svuint8_t bitmask = svreinterpret_u8_u64(svdup_n_u64(k));
    svuint8_t bytemask = svreinterpret_u8_u64(svbdep_n_u64(
        svreinterpret_u64_u8(svtbl_u8(bitmask, svreinterpret_u8_u64(svdupq_n_u64(0, 1)))),
        bit2bytemask)
    );
    svbool_t pmaskb8 = svcmpeq_n_u8(svptrue_b8(), bytemask, 1);
    return pmaskb8;
}
#endif

static inline void convert_mmask16_to_svbool32(__mmask16 k, pred_(&pmaskb32)[4])
{
    #if REINTERPRET_CAST_IS_NOT_ALLOWED
        svbool_t pmaskb8 = convert_mmask16_to_svbool8(k);
    #else
        svbool_t pmaskb8 = *reinterpret_cast<svbool_t*>(&k);
    #endif
    svbool_t pmaskb16lo = svunpklo_b(pmaskb8);
    pmaskb32[0] = svunpklo_b(pmaskb16lo);
    pmaskb32[1] = svunpkhi_b(pmaskb16lo);
    svbool_t pmaskb16hi = svunpkhi_b(pmaskb8);
    pmaskb32[2] = svunpklo_b(pmaskb16hi);
    pmaskb32[3] = svunpkhi_b(pmaskb16hi);
}

There is no intrinsic SVE MOV operation “because the ACLE intrinsic calls do not imply a particular register allocation and so the code generator must decide for itself when move instructions are required” source.

AVX-512 Masked Gather and Scatter#

Porting the masked gather or scatter intrinsics from AVX-512 to SVE is similar to porting from AVX to SVE, but k-register types like __mmask16 should be converted to svbool32_t first (using the function mentioned above).

__m512i _mm512_mask_i32gather_epi32(__m512i src, __mmask16 k, __m512i vindex, int const* base_addr, const int scale)
{
    __m512i result;
    pred_t pmaskb32[4];
    convert_mmask16_to_svbool32(k, pmaskb32);
    svbool_t ptrue = svptrue_b32();
    for (int i = 0; i < AVX512_REGSIZE_DQ; i++)
    {
        result.val[i] = svsel_s32(
            pmaskb32[i],
            svld1_gather_s32offset_s32(
                pmaskb32[i],
                base_addr,
                svmul_n_s32_x(ptrue, vindex.val[i], scale * sizeof(int))
            ),
            src.val[i]
        );
    }
    return result;
}

The reinterpret_cast is used because there are no SVE mov intrinsics, as described above.

AVX-512 Other Masked Operations#

Most masked AVX-512 operations benefit from direct translation to SVE. For example, the intrinsics of the format _mm[register_size]_[mask/maskz]_[operation]_[data type] can be directly translated to SVE alternatives, because of the predicated registers available for both AVX-512 and SVE. The porting snippet below is applicable to numerous arithmetic, logic, comparison, and bit processing AVX-512 intrinsics containing mask or maskz suffixes:

__m512i _mm512_maskz_add_epi32(__mmask16 k, __m512i a, __m512i b)
{
    __m512i result;
    pred_t pbmaskb32[4];
    convert_mmask16_to_svbool32(k, pbmaskb32);
    for (int i = 0; i < AVX512_REGSIZE_DQ; i++)
    { // compiler unrolls the loop
        result.val[i] = svadd_s32_z(pbmaskb32[i], a.val[i], b.val[i]);
    }
    return result;
}

The same approach as above, but with mask-suffixed intrinsics, has a broader workaround because the SVE2 predicated operation can merge with the first operand of the SVE operation. However, porting to SVE2 still has advantages over NEON, particularly in converting the __mmask16 bitmask type to a vector register with fewer instructions:

__m512i _mm512_mask_add_epi32(__m512i src, __mmask16 k, __m512i a, __m512i b)
{
    __m512i result;
    pred_t pmaskb32[4];
    convert_mmask16_to_svbool32(k, pmaskb32);
    for (int i = 0; i < AVX512_REGSIZE_DQ; i++)
    { // compiler unrolls the loop
        result.val[i] = svsel_s32(
            pmaskb32[i],
            svadd_s32_x(pmaskb32[i], a.val[i], b.val[i]),
            src.val[i]
        );
    }
    return result;
}

AVX-512 Reduce Operations#

Replace instructions like _mm_reduce_add_epi32 with SVE horizontal reduce intrinsics like svaddv_s32(). It is not recommended to rely on auto-vectorization for reductions.

SM4 and SM4KEY Encryption#

NEON does not provide instructions similar to x86_64 encryption intrinsics like _mm(_mm256)_sm4round4_epi32 and _mm(_mm256)_sm4key_epi32. However, with SVE2, these encryption intrinsics can be directly replaced by svsm4e_u32 and svsm4ekey_u32 respectively.