如何找到已知大小的数组的最大元素？答案

【问题标题】：How to find the largest element of an array of known size?如何找到已知大小的数组的最大元素？
【发布时间】：2015-12-17 10:03:06
【问题描述】：

我需要在恰好包含 16 个整数的数组中找到最大的元素。我正在考虑两种可能的实现。一、合理的实现：

int largest = array[0];
for (int i = 1; i < 16; i++) {
  const int val = array[i];
  if (val > largest) {
    largest = val;
  }
}

还有一个稍微疯狂的实现，它利用了数组大小已知这一事实：

const int max_value =
  max(
    max(
      max(
        max(array[0], array[1]),
        max(array[2], array[3])),
      max(
        max(array[4], array[5]),
        max(array[6], array[7]))),
    max(
      max(
        max(array[8], array[9])
        max(array[10], array[11])),
      max(
        max(array[12], array[13])
        max(array[14], array[15]))));

哪个是更好的实现？ max 通常是在硬件中实现的吗？

【问题讨论】：

在疯狂的实现中调用 max 的次数可能不会使它特别有效，但我怀疑你会注意到，除非你的数组非常大。此外，您将仅限于只能找到该特定数组的最大大小，这不是很好。我会坚持明智的选择。
编译器会优化你的循环。你的工作是编写可读的代码。它的工作是让它快速。尤其是在这种微不足道的情况下。可能您的疯狂实现更慢，请参阅另一种情况：stackoverflow.com/a/9601625/1207195

标签： c

【解决方案1】：

尝试使用此功能：

int max_array(int a[], int count) {
   int i,
    max = a[0];

   for (i = 1; i < count; i++) {
     if (a[i] > max) {
        max = a[i];
     }
   }

   return max;
}

编辑：

抱歉，没有看到您尝试过。但无论如何 - 这是更好的实现，你提出的第二个只是可怕的。我想如果你想保持你的代码干净，这就是你的目标。

【讨论】：

这与 OP 的 “合理实现” 有何不同？
@P0W 对不起，我没有注意到，请检查我的编辑。

【解决方案2】：

显然第一个，它更具可读性和健壮性。可能max() 没有在硬件中实现。

Hare 是 c++ 中最大的 implementation

template <class T> const T& max (const T& a, const T& b) {
    return (a<b)?b:a;     // or: return comp(a,b)?b:a; for version (2)
}

而gcc-4.9.2的C实现max定义为

#define max(a,b) \
   ({ typeof (a) _a = (a); \
       typeof (b) _b = (b); \
     _a > _b ? _a : _b; })

所以，最好使用第一个。虽然 size 小于 3 可以考虑用第二个来实现。

【讨论】：

typeof 不是标准 C 语言的一部分。

【解决方案3】：

据我所知，C 或 GNU 标准库中没有 max 或 min 函数。第一个会更好用。此外，您可以直接比较 array[i] 来呈现最大值。

int largest = array[0];
for (int i = 1; i < 16; i++) {
    if (array[i]>largest)
        largest=array[i];  
}

【讨论】：

【解决方案4】：

第一个显然是最直接的实现。

不过，这个问题与Sorting Networks 的概念有关，这是一个关于对固定大小的数据集进行排序的非常复杂的理论。

【讨论】：

而且，是的，max 是在硬件中实现的，通常以x86 和兼容处理器中的cmp 指令的形式：)

【解决方案5】：

让我们编译它们，看看我们得到了什么！

首先，AFAIK，C 标准中没有定义“max”函数/宏。所以我添加了一个（看起来很复杂，因为它避免了对其输入的双重评估）。

#define max(a,b) ({ \
    const __typeof__ (a) _a = (a); \
    const __typeof__ (b) _b = (b); \
    _a > _b ? _a : _b; \
})

int __attribute__ ((noinline)) test1(const int* array) {
    int largest = array[0];
    for (int i = 1; i < 16; i++) {
      const int val = array[i];
      if (val > largest) {
        largest = val;
      }
    }
    return largest;
}

int __attribute__ ((noinline)) test2(const int* array) {
    const int max_value =
      max(
        max(
          max(
            max(array[0], array[1]),
            max(array[2], array[3])),
          max(
            max(array[4], array[5]),
            max(array[6], array[7]))),
        max(
          max(
            max(array[8], array[9]),
            max(array[10], array[11])),
          max(
            max(array[12], array[13]),
            max(array[14], array[15]))));
    return max_value;
}

我的 gcc 版本，在谈到优化时是相关的：

tmp$ gcc --version
gcc (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

-O2 用于优化，-S 用于输出程序集，-o - 用于输出到标准输出。

tmp$ gcc -std=c99 -O2 -S test.c -o -
    .file   "test.c"
    .text
    .p2align 4,,15
    .globl  test1
    .type   test1, @function
test1:
.LFB0:
    .cfi_startproc
    movl    (%rdi), %eax
    xorl    %edx, %edx
    .p2align 4,,10
    .p2align 3
.L3:
    movl    4(%rdi,%rdx), %ecx
    cmpl    %ecx, %eax
    cmovl   %ecx, %eax
    addq    $4, %rdx
    cmpq    $60, %rdx
    jne .L3
    rep ret
    .cfi_endproc
.LFE0:
    .size   test1, .-test1
    .p2align 4,,15
    .globl  test2
    .type   test2, @function
test2:
.LFB1:
    .cfi_startproc
    movl    (%rdi), %edx
    cmpl    %edx, 4(%rdi)
    cmovge  4(%rdi), %edx
    movl    8(%rdi), %eax
    cmpl    %eax, %edx
    cmovl   %eax, %edx
    movl    12(%rdi), %eax
    cmpl    %eax, %edx
    cmovl   %eax, %edx
    movl    16(%rdi), %eax
    cmpl    %eax, %edx
    cmovl   %eax, %edx
    movl    20(%rdi), %eax
    cmpl    %eax, %edx
    cmovl   %eax, %edx
    movl    24(%rdi), %eax
    cmpl    %eax, %edx
    cmovl   %eax, %edx
    movl    28(%rdi), %eax
    cmpl    %eax, %edx
    cmovl   %eax, %edx
    movl    32(%rdi), %eax
    cmpl    %eax, %edx
    cmovl   %eax, %edx
    movl    36(%rdi), %eax
    cmpl    %eax, %edx
    cmovl   %eax, %edx
    movl    40(%rdi), %eax
    cmpl    %eax, %edx
    cmovl   %eax, %edx
    movl    44(%rdi), %eax
    cmpl    %eax, %edx
    cmovl   %eax, %edx
    movl    48(%rdi), %eax
    cmpl    %eax, %edx
    cmovl   %eax, %edx
    movl    52(%rdi), %eax
    cmpl    %eax, %edx
    cmovl   %eax, %edx
    movl    56(%rdi), %eax
    cmpl    %eax, %edx
    cmovl   %eax, %edx
    movl    60(%rdi), %eax
    cmpl    %eax, %edx
    cmovge  %edx, %eax
    ret
    .cfi_endproc
.LFE1:
    .size   test2, .-test2
    .ident  "GCC: (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4"
    .section    .note.GNU-stack,"",@progbits

好的，所以test2() 确实看起来更长。但是，它根本没有分支。每个元素只有约 3 条指令（内存加载、比较、条件移动）。 test1() 有 6 条指令（内存加载、比较、条件移动、循环计数器递增、循环计数器比较、条件分支）。 test1 中有很多分支，这可能很麻烦（取决于你的架构的分支预测有多好）。另一方面，test2 增加了代码大小，这必然会将其他内容推出指令缓存。在test2（嗯，还有test1...）中有很多数据危险——也许我们可以重写它以使用一些额外的寄存器来减少流水线停顿的数量？

所以，正如您现在可能看到的，这不是一个容易回答的问题。

唯一真正知道的方法是测量它。即便如此，它也会因每个 CPU 型号的内部实现/优化/缓存大小而异。

所以我写了一个小基准：

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <stdint.h>

#define N (1000000)

int main() {
    printf("    %12s %12s  %12s %12s\n", "test1 time", "test2 time", "test1 out", "test2 out");
    int* data = malloc(N * 16 * sizeof(int));
    srand(1);
    for (int i=0; i<16*N; ++i) {
        data[i] = rand();
    }

    const int* a;
    struct timespec t1, t2, t3;
    for (int attempt=0; attempt<10; ++attempt) {
        uint32_t sum1 = 0;
        uint32_t sum2 = 0;

        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t1);
        a = data;
        for (int i=0; i<N; ++i) {
            sum1 += test1(a);
            a += 16;
        }

        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t2);
        a = data;
        for (int i=0; i<N; ++i) {
            sum2 += test2(a);
            a += 16;
        }

        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t3);
        uint64_t nanos1 = (t2.tv_sec - t1.tv_sec) * 1000000000L + (t2.tv_nsec - t1.tv_nsec);
        uint64_t nanos2 = (t3.tv_sec - t2.tv_sec) * 1000000000L + (t3.tv_nsec - t2.tv_nsec);
        printf("%2d: %12lu %12lu  %12u %12u\n", attempt+1, nanos1, nanos2, sum1, sum2);
    }
    return 0;
}

结果：

tmp$ gcc -std=gnu99 -O2 test.c -o test
tmp$ ./test 
      test1 time   test2 time     test1 out    test2 out
 1:     16251659     10431322    4190722540   4190722540
 2:     16796884     10639081    4190722540   4190722540
 3:     16443265     10314624    4190722540   4190722540
 4:     17194795     10337678    4190722540   4190722540
 5:     16966405     10380047    4190722540   4190722540
 6:     16803840     10556222    4190722540   4190722540
 7:     16795989     10871508    4190722540   4190722540
 8:     16389862     11511950    4190722540   4190722540
 9:     16304850     11704787    4190722540   4190722540
10:     16309371     11269446    4190722540   4190722540
tmp$ gcc -std=gnu99 -O3 test.c -o test
tmp$ ./test 
      test1 time   test2 time     test1 out    test2 out
 1:      9090364      8813462    4190722540   4190722540
 2:      8745093      9394730    4190722540   4190722540
 3:      8942015      9839356    4190722540   4190722540
 4:      8849960      8834056    4190722540   4190722540
 5:      9567597      9195950    4190722540   4190722540
 6:      9130245      9115883    4190722540   4190722540
 7:      9680596      8930225    4190722540   4190722540
 8:      9268440      9998824    4190722540   4190722540
 9:      8851503      8960392    4190722540   4190722540
10:      9767021      8875165    4190722540   4190722540
tmp$ gcc -std=gnu99 -Os test.c -o test
tmp$ ./test 
      test1 time   test2 time     test1 out    test2 out
 1:     17569606     10447512    4190722540   4190722540
 2:     17755450     10811861    4190722540   4190722540
 3:     17718714     10372411    4190722540   4190722540
 4:     17743248     10378728    4190722540   4190722540
 5:     18747440     10306748    4190722540   4190722540
 6:     17877105     10782263    4190722540   4190722540
 7:     17787171     10522498    4190722540   4190722540
 8:     17771172     10445461    4190722540   4190722540
 9:     17683935     10430900    4190722540   4190722540
10:     17670540     10543926    4190722540   4190722540
tmp$ gcc -std=gnu99 -O2 -funroll-loops test.c -o test
tmp$ ./test 
      test1 time   test2 time     test1 out    test2 out
 1:      9840366     10008656    4190722540   4190722540
 2:      9826522     10529205    4190722540   4190722540
 3:     10208039     10363219    4190722540   4190722540
 4:      9863467     10284608    4190722540   4190722540
 5:     10473329     10054511    4190722540   4190722540
 6:     10298968     10520570    4190722540   4190722540
 7:      9846157     10595723    4190722540   4190722540
 8:     10340026     10041021    4190722540   4190722540
 9:     10434750     10404669    4190722540   4190722540
10:      9982403     10592842    4190722540   4190722540

结论：在我的英特尔酷睿 i7-3517U 上使用 4 MB 缓存，max() 版本更快（我不会声称更多，因为结果可能会因微架构）。

另外，-funroll-loops 或由-O3 启用的超激进（且不太安全）优化确实对test1 案例产生了巨大影响，基本上使其在时间上与test2 相等——甚至可能-funroll-loops 稍微好一点，但足够接近，我们无法从我得到的数字中得出有把握的结论。在那里查看test1 的程序集可能会很有趣，但我将把它作为练习留给读者。 ;)

所以，我猜答案是“视情况而定”。

【讨论】：

但正如其他人指出的那样，test1 更易于阅读，因此您可能应该使用它，直到您可以验证这种比较实际上对您的程序性能至关重要.可读/灵活的代码比节省几毫秒要好，如果这些毫秒在程序的一部分中无论如何都无关紧要。 :)
对于声称C 标准中没有定义“最大”函数/宏的人。，使用非标准构造（例如typeof）看起来很奇怪和语句表达式来实现max()。
在我看来，用-O2（然后没有循环展开）检查循环性能非常奇怪。你至少应该包括-funroll-loops。
-funroll-loops 可能有助于循环案例，是的。但是，gcc 手册的这一部分很重要：“此选项使代码更大，并且可能会或可能不会使其运行得更快”。当然，对于这个简单的循环，它很可能是一个很大的好处。对于其他人，也许不是。但老实说，我只是没有想到。将编辑以添加带有该标志的运行。 :)