为什么 tan 在上下文中比孤立时慢？答案

【问题标题】：Why is tan slower in context than when isolated?为什么 tan 在上下文中比孤立时慢？
【发布时间】：2022-01-06 08:46:24
【问题描述】：

在运行附加的示例程序时，函数tan 在上下文中的速度似乎是其被隔离时的两倍。这是我机器上的输出：

justtan(): ~16.062430 ns/iter
notan():   ~30.852820 ns/iter
withtan(): ~60.703100 ns/iter
empty():   ~0.355270 ns/iter

鉴于justtan 和notan 的组合，我预计withtan() 约为45ns 或更低。

我正在使用 Intel i7-4980HQ CPU 运行 macOS 11.5.2。我的cc --version 是Apple clang version 13.0.0 (clang-1300.0.29.3)。我已经检查以确保 withtan 和 notan 的反汇编除了对 tan 的调用外是相同的，并且 clang 正在使用 VEX 指令对循环进行自动矢量化。我还通过调试器检查了在运行时调用的 tan 版本也使用 VEX 指令来避免 SSE-AVX2 转换损失。

我在 Linux VM 中编译并运行程序，得到了类似的结果（在调试器中，tan 也使用 AVX/VEX）。此外，我通过 cachegrind 运行它，发现基本上没有任何函数的 L1 缓存未命中 (0.00%)，但是当通过 cachegrind 运行时，所有时间都正确加起来。

这就是我运行可执行文件的方式：

cc -Wall -O3 -mavx2 -o main main.c && ./main

这里是main.c：

#include <stdint.h>
#include <stdio.h>
#include <time.h>
#include <math.h>

// ---------------------------------------------------------------------
// -------------------- benchmarking harness ---------------------------
int64_t ITERS = 100000000;

double black_box(double x) {
    asm("" : : "r"(&x) : "memory");
    return x;
}

uint64_t nanosec() {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return ts.tv_sec * 1000000000ull + ts.tv_nsec;
}

double bench(double (*f)()) {
    // Warmup
    for (int i = 0; i < ITERS / 10; i++) {
        black_box(f());
    }

    uint64_t start = nanosec();
    for (int i = 0; i < ITERS; i++) {
        black_box(f());
    }
    uint64_t end = nanosec();

    return (double)(end - start) / (double)ITERS;
}
// -------------------- end benchmarking harness -----------------------
// ---------------------------------------------------------------------

#define LEN 32
#define SUM_LEN 24

double VALS[LEN] = {
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
};

__attribute__ ((noinline))
double sum24(double* ptr) {
    double sum = 0.;
    for (int i = 0; i < 24; i++) {
        sum += ptr[i];
    }
    return sum;
}

__attribute__ ((noinline))
double withtan() {
    double a = sum24(VALS);
    double b = sum24(VALS + 1);
    double c = sum24(VALS + 2);
    double d = sum24(VALS + 3);

    return tan(a + b + c + d);
}

__attribute__ ((noinline))
double notan() {
    double a = sum24(VALS);
    double b = sum24(VALS + 1);
    double c = sum24(VALS + 2);
    double d = sum24(VALS + 3);

    return a + b + c + d;
}

__attribute__ ((noinline))
double justtan() {
    return tan(black_box(96));
}

__attribute__ ((noinline))
double empty() {
    return 1.;
}

int main() {
    printf("justtan(): ~%f ns/iter\n", bench(justtan));
    printf("notan():   ~%f ns/iter\n", bench(notan));
    printf("withtan(): ~%f ns/iter\n", bench(withtan));
    printf("empty():   ~%f ns/iter\n", bench(empty));
}

为什么tan 在上下文中比在孤立时慢？

【问题讨论】：

从摆脱缓存和其他效果开始：复制整个主块。你第一次做的任何事情都会变慢。
这取决于您的测量技术。由于测量，我一直最喜欢的结果：nature.com/articles/nature.2011.9393
@RobertHarvey: Its refutation and the update of the paper are well known.
或许可以移除中微子？
顺便说一句，在像这样的小规模上，在超标量 OoO exec CPU 上一般来说，加起来成本不起作用。 (What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?)。我认为您期望withtan() 与justtan() 加上notan() 具有相同的吞吐量成本，但您没有这么说，只是似乎隐含地假设它。通常将两个不同的东西放在一个循环中会让它们在它们独立时重叠，因此组合成本通常相等或更低。

标签： c performance x86 clang avx

【解决方案1】：

同样的行为也出现在 Macbook M1 中，数字分别为 4 vs 14 vs 28 vs 0.3。

使用 withtan = tan(black_box(96)) + a + b + c + d 是 20 ns/iter，这对我来说暗示了 tan(a+b+c+d) 创建了 OoO 单元无法破坏的依赖关系，其中计算所有 sum_a、sum_b、sum_c、sum_d、tan (96) 是独立的任务，可以乱序运行。

其中一个问题还必须是tan 足够长，因此 OoO 单元无法窥视下一次独立迭代。

【讨论】：

这实际上是 Peter Cordes 在 cmets 中所说的。
我自然无法预测 Arm M1 的英特尔微架构细节，但面向 OoO 的算法并不多。
我认为数字 15/7/30 仍然符合 a+a+a+... 可以 OoO 执行独立循环，a+b+c+d... 也可以 OoO 执行，tan+tan+tan... 的理论。但是tan(whatever) 没有，因为tan 太长了。当然可以通过使用更短的函数来测试该假设，例如 sqrt() 代替（使用 -ffast-math）——时间确实加起来，这表明 a+b+... 的下一次迭代可以在 sqrt 开始时开始仍在执行。
实际上，sqrt(a+b+c+d) 计时 1.1 / 13.9 / 14.1 不仅加起来，而且增加了“短”（如果这甚至是一件事），具有一些预期的性能改进。
sqrt 不仅仅是“一个更短的函数”，它内联到一条指令（使用-fno-math-errno）。在 M1 上，我认为它的流水线非常繁重。 M1 上的至少整数除法显然以 2c 的吞吐量流水线化（与 Ice Lake 上的 6c 或 10c 相比）。 SKL/ICL sqrtsd 是 4.5c 吞吐量。所以在 4.5 GHz 机器上大约每纳秒一个。您的 M1 上的 1.1 ns（以较低的时钟速度）表明它对于 double sqrt 具有更好的每时钟吞吐量，但可能仍然是瓶颈，而不是前端调用/返回和存储开销。