为什么clock_gettime如此不稳定？答案

【问题标题】：Why is clock_gettime so erratic?为什么clock_gettime如此不稳定？
【发布时间】：2011-10-12 11:44:07
【问题描述】：

简介

旧问题部分包含最初的问题（进一步调查和结论已添加）。
跳到下面的进一步调查部分，详细比较不同的计时方法（rdtsc、clock_gettime 和 QueryThreadCycleTime）。
我认为 CGT 的不稳定行为可归因于有问题的内核或有问题的 CPU（请参阅结论部分）。
用于测试的代码在这个问题的底部（参见附录部分）。
抱歉，篇幅较长。

老问题

简而言之：我正在使用clock_gettime 来测量许多代码段的执行时间。我在不同的运行之间遇到了非常不一致的测量结果。与其他方法相比，该方法具有极高的标准偏差（见下文解释）。

问题：与其他方法相比，clock_gettime 给出如此不一致的测量值有什么原因吗？是否存在具有相同分辨率的替代方法来解决线程空闲时间？

说明：我正在尝试分析 C 代码的一些小部分。每个代码段的执行时间不超过几微秒。在一次运行中，每个代码段将执行数百次，产生runs × hundreds 个测量值。

我还必须只测量线程实际执行所花费的时间（这就是 rdtsc 不适合的原因）。我还需要高分辨率（这就是times 不适合的原因）。

我尝试了以下方法：

rdtsc（在 Linux 和 Windows 上），
clock_gettime（在 Linux 上使用“CLOCK_THREAD_CPUTIME_ID”；以及
QueryThreadCycleTime（在 Windows 上）。

方法：分析进行了 25 次运行。在每次运行中，单独的代码段重复 101 次。因此我有 2525 个测量值。然后我查看测量值的直方图，并计算一些基本的东西（如平均值、标准差、中位数、众数、最小值和最大值）。

我没有介绍我是如何衡量这三种方法的“相似性”的，但这只是简单地比较了每个代码段所花费的时间比例（“比例”意味着时间是标准化的）。然后我看看这些比例的纯粹差异。该比较表明，所有 'rdtsc'、'QTCT' 和 'CGT' 在 25 次运行的平均时测量的比例相同。但是，下面的结果表明“CGT”具有非常大的标准偏差。这使得它在我的用例中无法使用。

结果：

clock_gettime 与 rdtsc 对同一代码段的比较（25 次运行 101 次测量 = 2525 次读数）：

clock_gettime：
- 11 ns 的 1881 次测量，
- 595 次测量（几乎正态分布）在 3369 到 3414 ns 之间，
- 11680 ns 的 2 次测量，
- 1 次测量为 1506022 ns，并且
- 其余在 900 到 5000 ns 之间。
- 最小值：11 ns
- 最大值：1506022 ns
- 平均值：1471.862 ns
- 中位数：11 ns
- 模式：11 ns
- 标准开发：29991.034
rdtsc（注意：在此运行期间没有发生上下文切换，但如果发生这种情况，通常只会导致一次测量 30000 次左右）：
- 1178 次测量，介于 274 和 325 滴答之间，
- 326 到 375 滴答之间的 306 次测量，
- 910 次测量，介于 376 和 425 滴答之间，
- 129 次测量，介于 426 和 990 滴答之间，
- 1 次测量 1240 个刻度，并且
- 1256 个刻度的 1 次测量。
- 最小：274 滴答声
- 最大：1256 滴答声
- 平均值：355.806 个滴答声
- 中位数：333 滴答声
- 模式：376 滴答声
- 标准开发：83.896

讨论：

rdtsc 在 Linux 和 Windows 上给出非常相似的结果。它有一个可接受的标准偏差——它实际上是相当一致/稳定的。但是，它不考虑线程空闲时间。因此，上下文切换使测量变得不稳定（在 Windows 上，我经常观察到这一点：平均 1000 次左右的代码段将不时地占用大约 30000 次滴答——这肯定是因为抢占）。
QueryThreadCycleTime 给出了非常一致的测量值——即与rdtsc 相比，标准偏差要低得多。当没有上下文切换发生时，此方法与rdtsc 几乎相同。
clock_gettime 会产生极其不一致的结果（不仅在运行之间，而且在测量之间）。标准偏差非常大（与rdtsc 相比）。

我希望统计数据没问题。但是，这两种方法的测量结果出现这种差异的原因可能是什么？当然，还有缓存、CPU/核心迁移等。但这些都不应该对“rdtsc”和“clock_gettime”之间的任何此类差异负责。怎么回事？

进一步调查

我对此进行了进一步调查。我做了两件事：

测量了仅调用 clock_gettime(CLOCK_THREAD_CPUTIME_ID, &t) 的开销（参见附录中的代码 1），并且
在一个名为clock_gettime 的普通循环中，并将读数存储到一个数组中（参见附录中的代码2）。我测量增量时间（连续测量时间的差异，这应该与clock_gettime的调用开销相对应）。

我在两台不同的计算机上用两个不同的 Linux 内核版本进行了测量：

CGT：

CPU：Core 2 Duo L9400 @ 1.86GHz

内核：Linux 2.6.40-4.fc15.i686 #1 SMP Fri Jul 29 18:54:39 UTC 2011 i686 i686 i386

结果：

估计 clock_gettime 开销：在 690-710 ns 之间

增量时间：

平均：815.22 ns
中位数：713 ns
模式：709 ns
最小值：698 ns
最大值：23359 ns

直方图（遗漏范围的频率为 0）：

      Range       |  Frequency
------------------+-----------
  697 < x ≤ 800   ->     78111  <-- cached?
  800 < x ≤ 1000  ->     16412
 1000 < x ≤ 1500  ->         3
 1500 < x ≤ 2000  ->      4836  <-- uncached?
 2000 < x ≤ 3000  ->       305
 3000 < x ≤ 5000  ->       161
 5000 < x ≤ 10000 ->       105
10000 < x ≤ 15000 ->        53
15000 < x ≤ 20000 ->         8
20000 < x         ->         5

CPU：4 × 双核 AMD Opteron 处理器 275

内核：Linux 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 GNU/Linux

结果：

估计 clock_gettime 开销：在 279-283 ns 之间

增量时间：

平均：320.00
中位数：1
模式：1
分钟：1
最大：3495529

直方图（遗漏范围的频率为 0）：

      Range         |  Frequency
--------------------+-----------
          x ≤ 1     ->     86738  <-- cached?
    282 < x ≤ 300   ->     13118  <-- uncached?
    300 < x ≤ 440   ->        78
   2000 < x ≤ 5000  ->        52
   5000 < x ≤ 30000 ->         5
3000000 < x         ->         8

RDTSC：

相关代码rdtsc_delta.c和rdtsc_overhead.c。

CPU：Core 2 Duo L9400 @ 1.86GHz

内核：Linux 2.6.40-4.fc15.i686 #1 SMP Fri Jul 29 18:54:39 UTC 2011 i686 i686 i386

结果：

估计开销：在 39-42 个滴答声之间

增量时间：

平均：52.46 滴答声
中位数：42 个刻度
模式：42 滴答声
最少：35 个滴答声
最大：28700 滴答声

直方图（遗漏范围的频率为 0）：

      Range       |  Frequency
------------------+-----------
   34 < x ≤ 35    ->     16240  <-- cached?
   41 < x ≤ 42    ->     63585  <-- uncached? (small difference)
   48 < x ≤ 49    ->     19779  <-- uncached?
   49 < x ≤ 120   ->       195
 3125 < x ≤ 5000  ->       144
 5000 < x ≤ 10000 ->        45
10000 < x ≤ 20000 ->         9
20000 < x         ->         2

CPU：4 × 双核 AMD Opteron 处理器 275

内核：Linux 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 GNU/Linux

结果：

估计开销：在 13.7-17.0 滴答之间

增量时间：

平均：35.44 滴答声
中位数：16 个刻度
模式：16 滴答声
最少：14 个滴答声
最大：16372 滴答声

直方图（遗漏范围的频率为 0）：

      Range       |  Frequency
------------------+-----------
   13 < x ≤ 14    ->       192
   14 < x ≤ 21    ->     78172  <-- cached?
   21 < x ≤ 50    ->     10818
   50 < x ≤ 103   ->     10624  <-- uncached?
 5825 < x ≤ 6500  ->        88
 6500 < x ≤ 8000  ->        88
 8000 < x ≤ 10000 ->        11
10000 < x ≤ 15000 ->         4
15000 < x ≤ 16372 ->         2

QTCT：

相关代码qtct_delta.c和qtct_overhead.c。

CPU：Core 2 6700 @ 2.66GHz

内核：Windows 7 64 位

结果：

估计开销：在 890-940 滴答之间

增量时间：

平均：1057.30 滴答声
中位数：890 滴答声
模式：890 滴答声
最小：880 滴答声
最大：29400 滴答声

直方图（遗漏范围的频率为 0）：

      Range       |  Frequency
------------------+-----------
  879 < x ≤ 890   ->     71347  <-- cached?
  895 < x ≤ 1469  ->       844
 1469 < x ≤ 1600  ->     27613  <-- uncached?
 1600 < x ≤ 2000  ->        55
 2000 < x ≤ 4000  ->        86
 4000 < x ≤ 8000  ->        43
 8000 < x ≤ 16000 ->        10
16000 < x         ->         1

结论

我相信我的问题的答案将是我的机器上的一个错误实现（带有旧 Linux 内核的 AMD CPU）。

带有旧内核的 AMD 机器的 CGT 结果显示出一些极端读数。如果我们查看增量时间，我们会发现最频繁的增量是 1 ns。这意味着对clock_gettime 的调用不到一纳秒！而且，它还产生了许多非凡的大三角洲（超过 3000000 纳秒）！这似乎是错误的行为。（也许是下落不明的核心迁移？）

备注：

CGT 和 QTCT 的开销相当大。
也很难计算它们的开销，因为 CPU 缓存似乎有很大的不同。
也许坚持使用 RDTSC，将进程锁定到一个内核，并分配实时优先级是判断一段代码使用了多少个周期的最准确方法...

附录

代码 1：clock_gettime_overhead.c

#include <time.h>
#include <stdio.h>
#include <stdint.h>

/* Compiled & executed with:

    gcc clock_gettime_overhead.c -O0 -lrt -o clock_gettime_overhead
    ./clock_gettime_overhead 100000
*/

int main(int argc, char **args) {
    struct timespec tstart, tend, dummy;
    int n, N;
    N = atoi(args[1]);
    clock_gettime(CLOCK_THREAD_CPUTIME_ID, &tstart);
    for (n = 0; n < N; ++n) {
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &dummy);
    }
    clock_gettime(CLOCK_THREAD_CPUTIME_ID, &tend);
    printf("Estimated overhead: %lld ns\n",
            ((int64_t) tend.tv_sec * 1000000000 + (int64_t) tend.tv_nsec
                    - ((int64_t) tstart.tv_sec * 1000000000
                            + (int64_t) tstart.tv_nsec)) / N / 10);
    return 0;
}

代码 2：clock_gettime_delta.c

#include <time.h>
#include <stdio.h>
#include <stdint.h>

/* Compiled & executed with:

    gcc clock_gettime_delta.c -O0 -lrt -o clock_gettime_delta
    ./clock_gettime_delta > results
*/

#define N 100000

int main(int argc, char **args) {
    struct timespec sample, results[N];
    int n;
    for (n = 0; n < N; ++n) {
        clock_gettime(CLOCK_THREAD_CPUTIME_ID, &sample);
        results[n] = sample;
    }
    printf("%s\t%s\n", "Absolute time", "Delta");
    for (n = 1; n < N; ++n) {
        printf("%lld\t%lld\n",
               (int64_t) results[n].tv_sec * 1000000000 + 
                   (int64_t)results[n].tv_nsec,
               (int64_t) results[n].tv_sec * 1000000000 + 
                   (int64_t) results[n].tv_nsec - 
                   ((int64_t) results[n-1].tv_sec * 1000000000 + 
                        (int64_t)results[n-1].tv_nsec));
    }
    return 0;
}

代码 3：rdtsc.h

static uint64_t rdtsc() {
#if defined(__GNUC__)
#   if defined(__i386__)
    uint64_t x;
    __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
    return x;
#   elif defined(__x86_64__)
    uint32_t hi, lo;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ((uint64_t)lo) | ((uint64_t)hi << 32);
#   else
#       error Unsupported architecture.
#   endif
#elif defined(_MSC_VER)
    return __rdtsc();
#else
#   error Other compilers not supported...
#endif
}

代码 4：rdtsc_delta.c

#include <stdio.h>
#include <stdint.h>
#include "rdtsc.h"

/* Compiled & executed with:

    gcc rdtsc_delta.c -O0 -o rdtsc_delta
    ./rdtsc_delta > rdtsc_delta_results

Windows:

    cl -Od rdtsc_delta.c
    rdtsc_delta.exe > windows_rdtsc_delta_results
*/

#define N 100000

int main(int argc, char **args) {
    uint64_t results[N];
    int n;
    for (n = 0; n < N; ++n) {
        results[n] = rdtsc();
    }
    printf("%s\t%s\n", "Absolute time", "Delta");
    for (n = 1; n < N; ++n) {
        printf("%lld\t%lld\n", results[n], results[n] - results[n-1]);
    }
    return 0;
}

代码 5：rdtsc_overhead.c

#include <time.h>
#include <stdio.h>
#include <stdint.h>
#include "rdtsc.h"

/* Compiled & executed with:

    gcc rdtsc_overhead.c -O0 -lrt -o rdtsc_overhead
    ./rdtsc_overhead 1000000 > rdtsc_overhead_results

Windows:

    cl -Od rdtsc_overhead.c
    rdtsc_overhead.exe 1000000 > windows_rdtsc_overhead_results
*/

int main(int argc, char **args) {
    uint64_t tstart, tend, dummy;
    int n, N;
    N = atoi(args[1]);
    tstart = rdtsc();
    for (n = 0; n < N; ++n) {
        dummy = rdtsc();
        dummy = rdtsc();
        dummy = rdtsc();
        dummy = rdtsc();
        dummy = rdtsc();
        dummy = rdtsc();
        dummy = rdtsc();
        dummy = rdtsc();
        dummy = rdtsc();
        dummy = rdtsc();
    }
    tend = rdtsc();
    printf("%G\n", (double)(tend - tstart)/N/10);
    return 0;
}

代码 6：qtct_delta.c

#include <stdio.h>
#include <stdint.h>
#include <Windows.h>

/* Compiled & executed with:

    cl -Od qtct_delta.c
    qtct_delta.exe > windows_qtct_delta_results
*/

#define N 100000

int main(int argc, char **args) {
    uint64_t ticks, results[N];
    int n;
    for (n = 0; n < N; ++n) {
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
        results[n] = ticks;
    }
    printf("%s\t%s\n", "Absolute time", "Delta");
    for (n = 1; n < N; ++n) {
        printf("%lld\t%lld\n", results[n], results[n] - results[n-1]);
    }
    return 0;
}

代码 7：qtct_overhead.c

#include <stdio.h>
#include <stdint.h>
#include <Windows.h>

/* Compiled & executed with:

    cl -Od qtct_overhead.c
    qtct_overhead.exe 1000000
*/

int main(int argc, char **args) {
    uint64_t tstart, tend, ticks;
    int n, N;
    N = atoi(args[1]);
    QueryThreadCycleTime(GetCurrentThread(), &tstart);
    for (n = 0; n < N; ++n) {
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
        QueryThreadCycleTime(GetCurrentThread(), &ticks);
    }
    QueryThreadCycleTime(GetCurrentThread(), &tend);
    printf("%G\n", (double)(tend - tstart)/N/10);
    return 0;
}

【问题讨论】：

clock_gettime 以秒和纳秒为单位，而不是“滴答”。它对我来说非常一致。您确定您正确解释了struct timespec 的字段吗？你能展示一下它的代码吗？
我将(int64_t)ts.tv_sec * 1000000000 + (int64_t)ts.tv_nsec 截断为int32_t 并将其传递给产生影响的代码。后者考虑了溢出。相同的代码用于rdtsc 和QueryThreadCycleTime 以及不同架构上的许多其他方法——所有这些都经过了很好的测试。我将不得不要求您假设后一个代码不太可能包含错误。你说它对你有用。您如何以及在哪些方面对其进行了测试？长时间执行的代码？我正在测量非常短的片段。
@n.m.也许是这个问题的更好版本：为什么clock_gettime 和rdtsc 之间存在如此大的差异？
我已经将它用于长片段和短片段。目前我有一个测试用例，我只调用clock_gettime 两次，中间没有任何代码。我在单个双核 CPU 上得到大约 1500 到 2000 ns 的差异，在具有 2 个双核至强的更强大的机器上得到大约 700 到 1000 ns 的差异，非常一致。更有趣的是一些计算和 sleep() 会发生什么。对于 2.6 内核，报告的时间不包括 sleep() 时间，而对于 2.4，它确实如此。但在任何情况下，我都不会出现严重的不一致。
嗯，也许在 2.4 中这个东西测量的是 wall time 而不是 CPU time，或者内核是用不同的 RTC 选项编译的。我不知道从哪里继续。

标签： linux time profiling

【解决方案1】：

CLOCK_THREAD_CPUTIME_ID 是使用rdtsc 实现的，它可能会遇到与它相同的问题。 clock_gettime 的手册页说：

CLOCK_PROCESS_CPUTIME_ID 和 CLOCK_THREAD_CPUTIME_ID 时钟在许多平台上使用来自 CPU 的计时器（TSC on i386，安腾上的 AR.ITC）。这些寄存器可能因 CPU 和因此，这些时钟可能会返回虚假结果，如果进程被迁移到另一个 CPU。

这听起来可以解释您的问题？也许您应该将您的进程锁定到一个 CPU 以获得稳定的结果？

【讨论】：

尚不清楚这些声明的含义。 CLOCK_PROCESS_CPUTIME_ID 是一个每个进程的时钟。进程启动时从 0 开始，不活动时停止，恢复执行时重新启动。迁移到不同的 CPU 会如何影响这一点？
如果两者都使用相同的方法，为什么会有这么大的差异？
@n.m.即使 CLOCK_PROCESS_CPUTIME_ID 是每个进程的，一个进程（或线程）也可以从一个 CPU 迁移到另一个。由于 rtdsc （大约）测量自处理器启动以来的时钟周期数，并且由于并非所有处理器在引导期间同时启动，因此如果您的进程被内核移动到不同的 CPU，那么时间将显得不连续在调用 rtdsc 之间。
@BDatRivenhill 较新的 CPU 设置了 constant_tsc。这意味着“恒定的 TSC 行为确保每个时钟滴答的持续时间是一致的，并且即使处理器内核改变频率，也支持将 TSC 用作挂钟计时器。这是向前发展的架构行为。”并且 RDTSCP 可用于查看是否发生了核心迁移。见：download.intel.com/design/processor/manuals/253668.pdf

【解决方案2】：

当您的分布高度偏斜且不能为负时，您会发现均值、中位数和众数之间存在很大差异。对于这样的分布，标准差是毫无意义的。

对数转换通常是个好主意。这会让它“更正常”。

【讨论】：

问题是我们有一个极度倾斜的分布。只看特设的“直方图”就足够了。我包括了其余的以防万一。我承认，我应该使用“高度不稳定”而不是“具有高 std.dev。”。
@sinharaj：当它高度偏斜时，直方图会告诉您它高度偏斜，但仅此而已。我建议每次收集时，取其对数。然后直方图并对其进行统计。它将提供更多信息。
我为每个单独的测量计算了log(x)，并以各种方式绘制它们，但无济于事。总之，对于我的用例来说，测量结果太不一致了。我必须坚持RDTSC。我要放弃了。无论如何谢谢！