性能统计中的公式答案

【问题标题】：Formulas in perf stat性能统计中的公式
【发布时间】：2018-03-09 01:05:46
【问题描述】：

我想知道perf stat 中用于从原始数据计算数字的公式。

perf stat -e task-clock,cycles,instructions,cache-references,cache-misses ./myapp

    1080267.226401      task-clock (msec)         #   19.062 CPUs utilized          
 1,592,123,216,789      cycles                    #    1.474 GHz                      (50.00%)
   871,190,006,655      instructions              #    0.55  insn per cycle           (75.00%)
     3,697,548,810      cache-references          #    3.423 M/sec                    (75.00%)
       459,457,321      cache-misses              #   12.426 % of all cache refs      (75.00%)

在这种情况下，如何根据缓存引用计算 M/sec？

【问题讨论】：

不确定我的问题是否正确。只是cache-references/task-clock，不是吗？
@Zulan Duh！当然，它是......我认为它会更复杂
不用担心 ;-)。复杂的部分是(75%)指示的计数器复用，但隐藏在幕后。

标签： performance profiling performance-testing perf measurement

【解决方案1】：

公式似乎没有在builtin-stat.c 中实现（其中default event sets for perf stat are defined），但它们可能在perf_stat__print_shadow_stats() 中计算（and averaged with stddev）（并且一些统计数据被收集到perf_stat__update_shadow_stats() 中的数组中):

http://elixir.free-electrons.com/linux/v4.13.4/source/tools/perf/util/stat-shadow.c#L626

当计算 HW_INSTRUCTIONS 时： “每时钟指令” = HW_INSTRUCTIONS / HW_CPU_CYCLES； “每条指令的停滞周期”= HW_STALLED_CYCLES_FRONTEND / HW_INSTRUCTIONS

if (perf_evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS)) {
    total = avg_stats(&runtime_cycles_stats[ctx][cpu]);
    if (total) {
        ratio = avg / total;
        print_metric(ctxp, NULL, "%7.2f ",
                "insn per cycle", ratio);
    } else {
        print_metric(ctxp, NULL, NULL, "insn per cycle", 0);
    }

分支未命中来自print_branch_misses，为 HW_BRANCH_MISSES / HW_BRANCH_INSTRUCTIONS

perf_stat__print_shadow_stats() 中也有几个缓存未命中率计算，例如 HW_CACHE_MISSES / HW_CACHE_REFERENCES 和一些更详细的（perf stat -d 模式）。

停滞百分比are computed 为 HW_STALLED_CYCLES_FRONTEND / HW_CPU_CYCLES 和 HW_STALLED_CYCLES_BACKEND / HW_CPU_CYCLES

GHz 计算为 HW_CPU_CYCLES / runtime_nsecs_stats，其中 runtime_nsecs_stats 是从任何软件事件 task-clock 或 cpu-clock 更新的（SW_TASK_CLOCK 和 SW_CPU_CLOCK，We still know no exact difference between them two 自 2010 年在 LKML 和 2014 年在 SO）

if (perf_evsel__match(counter, SOFTWARE, SW_TASK_CLOCK) ||
    perf_evsel__match(counter, SOFTWARE, SW_CPU_CLOCK))
    update_stats(&runtime_nsecs_stats[cpu], count[0]);

还有several formulas for transactions（perf stat -T模式）。

"CPU utilized" is from task-clock 或 cpu-clock / walltime_nsecs_stats，其中 walltime 由 the perf stat itself (in userspace 使用墙上的时钟（天文时间）计算得出：

static inline unsigned long long rdclock(void)
{
    struct timespec ts;

    clock_gettime(CLOCK_MONOTONIC, &ts);
    return ts.tv_sec * 1000000000ULL + ts.tv_nsec;
}

...

static int __run_perf_stat(int argc, const char **argv)
{    
...
    /*
     * Enable counters and exec the command:
     */
    t0 = rdclock();
    clock_gettime(CLOCK_MONOTONIC, &ref_time);
    if (forks) {
        ....
    }
    t1 = rdclock();

    update_stats(&walltime_nsecs_stats, t1 - t0);

还有来自自上而下方法的some estimations（Tuning Applications Using a Top-down Microarchitecture Analysis Method、Software Optimizations Become Simple with Top-Down Analysis .. Name Skylake, IDF2015、Gregg's Methodology List 中的#22。由 Andi Kleen 在 2016 年描述https://lwn.net/Articles/688335/“将自上而下的指标添加到性能统计”（ perf stat --topdown -I 1000 cmd 模式）。

最后，如果当前打印事件没有确切的公式，则有通用的“%c/sec”（K/sec 或 M/sec）度量：http://elixir.free-electrons.com/linux/v4.13.4/source/tools/perf/util/stat-shadow.c#L845任何除以运行时 nsec（任务时钟或 cpu-clock 事件，如果它们存在于 perf stat 事件集中）

} else if (runtime_nsecs_stats[cpu].n != 0) {
    char unit = 'M';
    char unit_buf[10];

    total = avg_stats(&runtime_nsecs_stats[cpu]);

    if (total)
        ratio = 1000.0 * avg / total;
    if (ratio < 0.001) {
        ratio *= 1000;
        unit = 'K';
    }
    snprintf(unit_buf, sizeof(unit_buf), "%c/sec", unit);
    print_metric(ctxp, NULL, "%8.3f", unit_buf, ratio);
}

【讨论】：