C#调用本机代码比本机调用本机更快答案

【问题标题】：C# calling native code is faster than native calling nativeC#调用本机代码比本机调用本机更快
【发布时间】：2017-10-03 16:57:08
【问题描述】：

在进行一些性能测试时，我遇到了一种我似乎无法解释的情况。

我编写了以下 C 代码：

void multi_arr(int32_t *x, int32_t *y, int32_t *res, int32_t len)
{
    for (int32_t i = 0; i < len; ++i)
    {
        res[i] = x[i] * y[i];
    }
}

我使用 gcc 将其与测试驱动程序一起编译成单个二进制文件。我还使用 gcc 将它自己编译成一个共享对象，我通过 p/invoke 从 C# 调用它。目的是衡量从 C# 调用本机代码的性能开销。

在 C 和 C# 中，我创建等长的随机值输入数组，然后测量 multi_arr 运行所需的时间。在 C# 和 C 中，我都使用 POSIX clock_gettime() 调用进行计时。我已经在调用 multi_arr 之前和之后定位了计时调用，因此输入准备时间等不会影响结果。我运行了 100 次迭代并报告了平均次数和最短次数。

尽管 C 和 C# 执行完全相同的功能，但 C# 在大约 50% 的时间里领先，通常是相当大的数量。例如，对于 1,048,576 的 len，C# 的最小值为 768,400 ns，而 C 的最小值为 1,344,105。 C# 的平均值为 1,018,865，而 C 的平均值为 1,852,880。我在这张图中输入了一些不同的数字（注意对数刻度）：

这些结果对我来说似乎非常错误，但工件在多个测试中是一致的。我检查了 asm 和 IL 以验证正确性。比特度是一样的。我不知道什么会影响这种程度的性能。我在here 上放了一个最小的复制示例。

这些测试都在具有 dotnet-core 2.0.0 和 gcc 5.0.4 的 Linux（KDE neon，基于 Ubuntu Xenial）上运行。

有人见过吗？

【问题讨论】：

无法复制（使用 Mono）。两个版本基本同时运行。在 Ubuntu 上使用 mcs -optimize+ -unsafe、mcs 3.2.8.0 编译。
@KerrekSB 很有趣。我也试过单声道，问题减少了。它仍然发生在 131072（平均 53,380ns 对 128,280ns）和其他一些地方，但大部分数字更接近。
在我的测试中，C# 在统计上运行得有点慢。差异实际上非常小。可能是测试方法不好。
@PeterJ_01我发布了代码，欢迎提出改进建议。
更多的研究出现了this，这意味着一些性能差异是由于对齐方式对缓存的影响。也许 C# 分配的数组对齐稍微好一点？

标签： c# c performance gcc .net-core

【解决方案1】：

正如您已经怀疑的那样，它取决于对齐方式。返回内存以便编译器可以将其用于在存储或检索双精度或整数等数据类型时不会导致不必要错误的结构，但它不保证内存块如何适合缓存。

这如何变化取决于您测试的硬件。假设您在这里谈论的是 x86_64，这意味着 Intel 或 AMD 处理器及其缓存与主内存访问的相对速度。

您可以通过使用各种对齐方式进行测试来模拟这一点。

这是我拼凑起来的一个示例程序。在我的 i7 上，我看到了很大的变化，但第一个最不对齐的访问确实比对齐程度更高的版本慢。

#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

void multi_arr(int32_t *x, int32_t *y, int32_t *res, int32_t len)
{
    for (int32_t i = 0; i < len; ++i)
    {
        res[i] = x[i] * y[i];
    }
}

uint64_t getnsec()
{
  struct timespec n;

  clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &n);
  return (uint64_t) n.tv_sec * 1000000000 + n.tv_nsec;
}

#define CACHE_SIZE (16 * 1024 * 1024 / sizeof(int32_t))
int main()
{
  int32_t *memory;
  int32_t *unaligned;
  int32_t *x;
  int32_t *y;
  int count;
  uint64_t start, elapsed;
  int32_t len = 1024 * 16;
  int64_t aligned = 1;

  memory = calloc(sizeof(int32_t), 4 * CACHE_SIZE);

  // make unaligned as unaligned as possible, e.g. to 0b11111111111111100

  unaligned = (int32_t *) (((intptr_t) memory + CACHE_SIZE) & ~(CACHE_SIZE - 1));
  printf("memory starts at %p, aligned %p\n", memory, unaligned);
  unaligned = (int32_t *) ((intptr_t) unaligned | (CACHE_SIZE - 1));
  printf("memory starts at %p, unaligned %p\n", memory, unaligned);

  for (aligned = 1; aligned < CACHE_SIZE; aligned <<= 1)
  {
    x = (int32_t *) (((intptr_t) unaligned + CACHE_SIZE) & ~(aligned - 1));

    start = getnsec();
    for (count = 1; count < 1000; count++)
    {
      multi_arr(x, x + len, x + len + len, len);
    }
    elapsed = getnsec() - start;
    printf("memory starts at %p, aligned %08"PRIx64" to > cache = %p elapsed=%"PRIu64"\n", unaligned, aligned - 1, x, elapsed);
  }

  exit(0);
}

【讨论】：