相当于 SSE 内在函数的 Neon答案

【问题标题】：Neon equivalent to SSE intrinsics相当于 SSE 内在函数的 Neon
【发布时间】：2012-07-02 19:41:56
【问题描述】：

我正在尝试使用 neon 内在函数将 c 代码转换为优化的代码。

这是对 2 个操作符而不是对操作符向量进行操作的 c 代码。

uint16_t mult_z216(uint16_t a,uint16_t b){
unsigned int c1 = a*b;
    if(c1)
    {
        int c1h = c1 >> 16;
        int c1l = c1 & 0xffff;
        return (c1l - c1h + ((c1l<c1h)?1:0)) & 0xffff;
    }
    return (1-a-b) & 0xffff;
}

此操作的 SEE 优化版本已由以下人员实现：

#define MULT_Z216_SSE(a, b, c) \
    t0  = _mm_or_si128 ((a), (b)); \ //Computes the bitwise OR of the 128-bit value in a and the 128-bit value in b.
    (c) = _mm_mullo_epi16 ((a), (b)); \ //low 16-bits of the product of two 16-bit integers
    (a) = _mm_mulhi_epu16 ((a), (b)); \ //high 16-bits of the product of two 16-bit unsigned integers
    (b) = _mm_subs_epu16((c), (a)); \ //Subtracts the 8 unsigned 16-bit integers of a from the 8 unsigned 16-bit integers of c and saturates
    (b) = _mm_cmpeq_epi16 ((b), C_0x0_XMM); \ //Compares the 8 signed or unsigned 16-bit integers in a and the 8 signed or unsigned 16-bit integers in b for equality. (0xFFFF or 0x0)
    (b) = _mm_srli_epi16 ((b), 15); \ //shift right 16 bits
    (c) = _mm_sub_epi16 ((c), (a)); \ //Subtracts the 8 signed or unsigned 16-bit integers of b from the 8 signed or unsigned 16-bit integers of a.
    (a) = _mm_cmpeq_epi16 ((c), C_0x0_XMM); \ ////Compares the 8 signed or unsigned 16-bit integers in a and the 8 signed or unsigned 16-bit integers in b for equality. (0xFFFF or 0x0)
    (c) = _mm_add_epi16 ((c), (b)); \ // Adds the 8 signed or unsigned 16-bit integers in a to the 8 signed or unsigned 16-bit integers in b.
    t0  = _mm_and_si128 (t0, (a)); \ //Computes the bitwise AND of the 128-bit value in a and the 128-bit value in b.
    (c) = _mm_sub_epi16 ((c), t0); ///Subtracts the 8 signed or unsigned 16-bit integers of b from the 8 signed or unsigned 16-bit integers of a.

我几乎已经使用霓虹内在函数转换了这个：

#define MULT_Z216_NEON(a, b, out) \
    temp = vorrq_u16 (*a, *b); \
    // ??
    // ??
    *b = vsubq_u16(*out, *a); \
    *b = vceqq_u16(*out, vdupq_n_u16(0x0000)); \
    *b = vshrq_n_u16(*b, 15); \
    *out = vsubq_s16(*out, *a); \
    *a = vceqq_s16(*c, vdupq_n_u16(0x0000)); \
    *c = vaddq_s16(*c, *b); \
    *temp = vandq_u16(*temp, *a); \
    *out = vsubq_s16(*out, *a);

我只缺少_mm_mullo_epi16 ((a), (b)); 和_mm_mulhi_epu16 ((a), (b)); 的霓虹灯等效项。要么我误解了某些东西，要么 NEON 中没有这样的内在函数。如果没有等效的如何使用 NEONS 内在函数归档这些步骤？

更新：

我忘了强调以下一点：函数的操作符是 uint16x8_t NEON 向量（每个元素都是一个 uint16_t => 0 到 65535 之间的整数）。在一个答案中，有人提议使用内在的vqdmulhq_s16()。使用这个与给定的实现不匹配，因为乘法内在函数会将向量解释为有符号值并产生错误的输出。

【问题讨论】：

如果您的值 > 32767，那么您需要使用下面建议的加宽乘法 (vmull_u16)。如果您知道您的值都将是

标签： c arm sse multiplication neon

【解决方案1】：

你可以使用：

uint32x4_t vmull_u16 (uint16x4_t, uint16x4_t)

它返回一个 32 位乘积的向量。如果您想将结果分成高低部分，您可以使用 NEON unzip 内在函数。

【讨论】：

该指令是 16x16=32 乘法（扩大输出）。有更详细的说明（见我的回答）。
@BitBank：OP 需要高 16 位和低 16 位，因此他需要 32 位结果。加倍/饱和乘法不能替代，因为您会失去精度。

【解决方案2】：

vmulq_s16() 相当于 _mm_mullo_epi16。 _mm_mulhi_epu16 没有完全等价的；最接近的指令是 vqdmulhq_s16() ，它是“饱和、加倍、乘法、返回高位部分”。它仅对有符号的 16 位值进行操作，您需要将输入或输出除以 2 才能使加倍无效。

【讨论】：

由于 vqdmulhq_s16() 使用带符号的输入，GCC 抱怨输入错误的参数...如何以有效的方式从 uint16x8_t 转换为 int16x8_t ？
有铸造宏；使用 vreinterpretq_s16_u16()