当程序从终端运行时，clock_gettime 需要更长的时间来执行答案

【问题标题】：clock_gettime takes longer to execute when program run from terminal当程序从终端运行时，clock_gettime 需要更长的时间来执行
【发布时间】：2020-11-23 21:48:55
【问题描述】：

我试图测量一段 sn-p 代码的时间，并注意到当我从我的编辑器 QtCreator 中运行程序时，与我从 bash shell 启动时相比，时间快了大约 50ns一个侏儒终端。我正在使用 Ubuntu 20.04 作为操作系统。

一个小程序重现我的问题：

#include <stdio.h>
#include <time.h>

struct timespec now() {
  struct timespec now;
  clock_gettime(CLOCK_MONOTONIC, &now);
  return now;
}

long interval_ns(struct timespec tick, struct timespec tock) {
  return (tock.tv_sec - tick.tv_sec) * 1000000000L
      + (tock.tv_nsec - tick.tv_nsec);
}

int main() {
    // sleep(1);
    for (size_t i = 0; i < 10; i++) {
        struct timespec tick = now();
        struct timespec tock = now();
        long elapsed = interval_ns(tick, tock);
        printf("It took %lu ns\n", elapsed);
    }
    return 0;
}

在 QtCreator 中运行时的输出

It took 84 ns
It took 20 ns
It took 20 ns
It took 21 ns
It took 21 ns
It took 21 ns
It took 22 ns
It took 21 ns
It took 20 ns
It took 21 ns

当从终端内的 shell 运行时：

$ ./foo 
It took 407 ns
It took 136 ns
It took 74 ns
It took 73 ns
It took 77 ns
It took 79 ns
It took 74 ns
It took 81 ns
It took 74 ns
It took 78 ns

我尝试过但没有任何影响的事情

让 QtCreator 在终端中启动程序
使用 rdtsc 和 rdtscp 调用而不是 clock_gettime（运行时的相对差异相同）
通过在env -i下运行从终端清除环境
使用 sh 而不是 bash 启动程序

我已经验证在所有情况下都调用了相同的二进制文件。我已经验证在所有情况下程序的 nice 值都是 0。

问题

为什么从我的 shell 启动程序会有所不同？有什么建议可以尝试吗？

更新

如果我在 main 的开头添加 sleep(1) 调用，QtCreator 和 gnome-terminal/bash 调用都会报告更长的执行时间。
如果我在 main 的开头添加了一个 system("ps -H") 调用，但删除了前面提到的 sleep(1)：两个调用都报告了较短的执行时间（~20 ns）。

【问题讨论】：

旁白：当long 是32 位时，代码很容易溢出。建议long long interval_ns(struct timespec tick, struct timespec tock) { return (tock.tv_sec - tick.tv_sec) * 1000000000LL + (tock.tv_nsec - tick.tv_nsec); }（类型更改和LL）

标签： c linux x86 microbenchmark cpu-cycles

【解决方案1】：

只需添加更多迭代，让 CPU 有时间加速到最大时钟速度。您的“慢”时间是 CPU 处于低功耗空闲时钟速度。

QtCreator 显然在您的程序运行之前使用了足够的 CPU 时间来实现这一点，否则您正在编译 + 运行并且编译过程用作热身。（与bash 的 fork/execve 相比，重量更轻。）

请参阅Idiomatic way of performance evaluation? 了解有关在基准测试时进行热身运行的更多信息，以及Why does this delay-loop start to run faster after several iterations with no sleep?

在我运行 Linux 的 i7-6700k (Skylake) 上，将循环迭代计数增加到 1000 足以使最终迭代以全时钟速度运行，即使在前几次迭代处理页面错误、预热 iTLB、uop 之后也是如此缓存、数据缓存等。

$ ./a.out      
It took 244 ns
It took 150 ns
It took 73 ns
It took 76 ns
It took 75 ns
It took 71 ns
It took 72 ns
It took 72 ns
It took 69 ns
It took 75 ns
...
It took 74 ns
It took 68 ns
It took 69 ns
It took 72 ns
It took 72 ns        # 382 "slow" iterations in this test run (copy/paste into wc to check)
It took 15 ns
It took 15 ns
It took 15 ns
It took 15 ns
It took 16 ns
It took 16 ns
It took 15 ns
It took 15 ns
It took 15 ns
It took 15 ns
It took 14 ns
It took 16 ns
...

在我的系统上，energy_performance_preference 设置为 balance_performance，因此硬件 P 状态调控器不像 performance 那样激进。使用grep . /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference查看，使用sudo修改：

sudo sh -c 'for i in /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference;do echo balance_performance > "$i";done'

不过，即使在perf stat ./a.out 下运行它也足以快速提升到最大时钟速度；它真的不需要太多。但是bash 在您按下回车后的命令解析非常便宜，在它调用execve 并在您的新进程中到达main 之前完成的CPU 工作不多。

带有行缓冲输出的printf 占用了您程序中大部分 CPU 时间，顺便说一句。这就是为什么需要如此少的迭代才能加快速度的原因。例如如果你运行perf stat --all-user -r10 ./a.out，你会看到每秒用户空间核心时钟周期只有0.4GHz，其余时间在write系统调用中花费在内核中。

【讨论】：

感谢您的精彩回答。我添加了x86标签希望召唤你。 :-) 我认为前 2-3 次迭代的较长时间代表了热身时间。例如。在那之后处理器已经加速到全时钟速度。显然它需要的还不止这些。令人尴尬的是，我没有考虑更换性能调节器，但另一方面，我学到了一些新东西，这总是很有价值。
@DanielNäslund：前几次迭代是其他类型的热身，与时钟速度不同，例如页面错误。如果您在另一个内核上运行其他任何东西，在此运行之前保持较高的时钟速度（例如，一个简单的无限循环），您会看到前几个间隔更短。或者，如果您使用 perf stat -r10 ./a.out 连续运行 10 次。除了更改调速器之外，您还可以做很多事情，尤其是因为英特尔“客户端”芯片（非服务器）以相同的时钟速度运行所有内核，因此一个内核最高意味着其他内核最高。