gcc 不会矢量化简单循环答案

【问题标题】：gcc won't vectorize simple loopgcc 不会矢量化简单循环
【发布时间】：2016-06-03 23:14:33
【问题描述】：

我正在尝试从 gcc auto-vectorize documentation 向量化示例 4 的简化版本。对于我的生活，我不知道该怎么做；

typedef int aint __attribute__ ((__aligned__(16)));
void foo1 (int n, aint * restrict px, aint *restrict qx) {

  /* feature: support for (aligned) pointer accesses.  */
  int *__restrict p = __builtin_assume_aligned (px, 16);
  int *__restrict q = __builtin_assume_aligned (qx, 16);

  while (n--){
    //*p++ += *q++; <- this is vectorized                                                                                                                                                                   
    p[n] += q[n]; // This isn't!                                                                                                                                                                            
  }
}

我正在运行 gcc 4.7.2 gcc -o 应用程序/craft_dbsplit.o -c -Wall -g -ggdb -O3 -msse2 -funsafe-math-optimizations -ffast-math -ftree-vectorize -ftree-vectorizer-verbose=5 -funsafe-loop-optimizations -std =c99

它会回复：

Analyzing loop at apps/craft_dbsplit.c:388

388: dependence distance  = 0.
388: dependence distance == 0 between *D.9363_14 and *D.9363_14
388: dependence distance  = 0.
388: accesses have the same alignment.
388: dependence distance modulo vf == 0 between *D.9363_14 and *D.9363_14
388: vect_model_load_cost: unaligned supported by hardware.
388: vect_get_data_access_cost: inside_cost = 2, outside_cost = 0.
388: vect_model_store_cost: unaligned supported by hardware.
388: vect_get_data_access_cost: inside_cost = 2, outside_cost = 0.
388: Alignment of access forced using peeling.
388: Vectorizing an unaligned access.
388: vect_model_load_cost: aligned.
388: vect_model_load_cost: inside_cost = 1, outside_cost = 0 .
388: vect_model_load_cost: unaligned supported by hardware.
388: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .
388: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 .
388: not vectorized: relevant stmt not supported: *D.9363_14 = D.9367_20;

apps/craft_dbsplit.c:382: note: vectorized 0 loops in function.

【问题讨论】：

"我正在运行 gcc 4.7.2" 您可能需要更新它，它已经很旧了。较新的版本确实对循环进行了矢量化。
虽然gcc内置的vector很烂但是你可以试试
对于它的价值：给定while(n--)，那么*p++ += *q++; 不等于p[n] += q[n];。第二个版本向后迭代。

标签： c gcc auto-vectorization

【解决方案1】：

循环从高地址运行到低地址。您的 gcc 将向量操作视为从低地址运行到高地址，因此没有意识到它可以向量化。您的“优化”使循环成为while (n--)，实际上是在阻止更相关的优化。试试

#include <stddef.h>

void foo1 (size_t n, int *restrict px, int const *restrict qx)
{
  int *restrict p = __builtin_assume_aligned(px, 16);
  int const *restrict q = __builtin_assume_aligned(qx, 16);
  size_t i = 0;
  while (i < n)
    {
      p[i] += q[i];
      i++;
    }
}

【讨论】：

为什么不做一个不模糊的循环：for (size_t i=0; i<n; i++).
@Lundin：给猫剥皮的方法很多。
是的，你可以用一把猫剥皮刀，这把猫剥皮刀是众所周知的，而且所有剥皮者都能立即认出，或者你也可以用别的东西:)
伦丁，EOF，感谢您的意见。我尝试了for 循环，gcc 说：not vectorized: unsupported data-type。我注意到while (--n) 是gcc auto-vectorization documentation 示例3 中的模式。我怀疑古代版本是罪魁祸首。我会尝试买一个新的，看看会发生什么。
@freddofrog：如果您真的仔细阅读该示例，您会注意到它不会向后迭代。它不是从p[n-1] 开始并以p[0] 结束，而是从p[0] 开始并以p[n-1] 结束。你看到区别了吗？