【发布时间】:2017-06-28 23:19:45
【问题描述】:
我找到了this post,它解释了如何用 24 个操作转置一个 8x8 字节矩阵,然后滚动几卷后,the code 实现了转置。但是,这种方法没有利用我们可以阻止将 8x8 转置为四个 4x4 转置,并且每个转置只能在一个 shuffle 指令中完成(this post 是参考)。所以我想出了这个解决方案:
__m128i transpose4x4mask = _mm_set_epi8(15, 11, 7, 3, 14, 10, 6, 2, 13, 9, 5, 1, 12, 8, 4, 0);
__m128i shuffle8x8Mask = _mm_setr_epi8(0, 1, 2, 3, 8, 9, 10, 11, 4, 5, 6, 7, 12, 13, 14, 15);
void TransposeBlock8x8(uint8_t *src, uint8_t *dst, int srcStride, int dstStride) {
__m128i load0 = _mm_set_epi64x(*(uint64_t*)(src + 1 * srcStride), *(uint64_t*)(src + 0 * srcStride));
__m128i load1 = _mm_set_epi64x(*(uint64_t*)(src + 3 * srcStride), *(uint64_t*)(src + 2 * srcStride));
__m128i load2 = _mm_set_epi64x(*(uint64_t*)(src + 5 * srcStride), *(uint64_t*)(src + 4 * srcStride));
__m128i load3 = _mm_set_epi64x(*(uint64_t*)(src + 7 * srcStride), *(uint64_t*)(src + 6 * srcStride));
__m128i shuffle0 = _mm_shuffle_epi8(load0, shuffle8x8Mask);
__m128i shuffle1 = _mm_shuffle_epi8(load1, shuffle8x8Mask);
__m128i shuffle2 = _mm_shuffle_epi8(load2, shuffle8x8Mask);
__m128i shuffle3 = _mm_shuffle_epi8(load3, shuffle8x8Mask);
__m128i block0 = _mm_unpacklo_epi64(shuffle0, shuffle1);
__m128i block1 = _mm_unpackhi_epi64(shuffle0, shuffle1);
__m128i block2 = _mm_unpacklo_epi64(shuffle2, shuffle3);
__m128i block3 = _mm_unpackhi_epi64(shuffle2, shuffle3);
__m128i transposed0 = _mm_shuffle_epi8(block0, transpose4x4mask);
__m128i transposed1 = _mm_shuffle_epi8(block1, transpose4x4mask);
__m128i transposed2 = _mm_shuffle_epi8(block2, transpose4x4mask);
__m128i transposed3 = _mm_shuffle_epi8(block3, transpose4x4mask);
__m128i store0 = _mm_unpacklo_epi32(transposed0, transposed2);
__m128i store1 = _mm_unpackhi_epi32(transposed0, transposed2);
__m128i store2 = _mm_unpacklo_epi32(transposed1, transposed3);
__m128i store3 = _mm_unpackhi_epi32(transposed1, transposed3);
*((uint64_t*)(dst + 0 * dstStride)) = _mm_extract_epi64(store0, 0);
*((uint64_t*)(dst + 1 * dstStride)) = _mm_extract_epi64(store0, 1);
*((uint64_t*)(dst + 2 * dstStride)) = _mm_extract_epi64(store1, 0);
*((uint64_t*)(dst + 3 * dstStride)) = _mm_extract_epi64(store1, 1);
*((uint64_t*)(dst + 4 * dstStride)) = _mm_extract_epi64(store2, 0);
*((uint64_t*)(dst + 5 * dstStride)) = _mm_extract_epi64(store2, 1);
*((uint64_t*)(dst + 6 * dstStride)) = _mm_extract_epi64(store3, 0);
*((uint64_t*)(dst + 7 * dstStride)) = _mm_extract_epi64(store3, 1);
}
排除加载/存储操作,此过程仅包含 16 条指令,而不是 24 条。
我错过了什么?
【问题讨论】:
-
您错过了加载向量的 4 128 位操作和存储向量的 4 128 位操作。
-
__m128i load0 = _mm_set_epi64x((uint64_t)(src + 1 * srcStride), (uint64_t)(src + 0 * srcStride));它等于 __m128i load0 = _mm_loadu_si128((__m128i*)(src + 0 * srcStride));
-
((uint64_t)(dst + 0 * dstStride)) = _mm_extract_epi64(store0, 0); ((uint64_t)(dst + 1 * dstStride)) = _mm_extract_epi64(store0, 1);等于 _mm_storeu_si128((__m128i*)(src + 0 * srcStride), store0);
-
@ErmIg:加载/存储应该从操作中排除。我链接的包含 24 个操作的帖子没有考虑到它们。无论如何,您必须加载数据,不是吗?
-
@ErmIg:只有当
srcStride是8并且dstStride是8时,它们才等效,也就是说,如果 src 和 dst 都是每个 8x8 字节的矩阵。否则它们是不同的,例如 128x128 字节的矩阵,但我实际上只是“放大”到 8x8 块,在这种情况下,scrStride和dstStride将都是128。
标签: c matrix optimization sse simd