这个数组比较问题的最佳算法是什么？答案

【问题标题】：What is the best algorithm for this array-comparison problem?这个数组比较问题的最佳算法是什么？
【发布时间】：2011-02-16 01:47:35
【问题描述】：

速度算法解决以下问题最有效的方法是什么？

给定 6 个数组，D1、D2、D3、D4、D5 和 D6，每个数组包含 6 个数字，例如：

D1[0] = number              D2[0] = number      ......       D6[0] = number
D1[1] = another number      D2[1] = another number           ....
.....                       ....                ......       ....
D1[5] = yet another number  ....                ......       ....

给定第二个数组 ST1，包含 1 个数字：

ST1[0] = 6

给定第三个数组 ans，包含 6 个数字：

ans[0] = 3, ans[1] = 4, ans[2] = 5, ......ans[5] = 8

使用数组 D1,D2,D3,D4,D5 和 D6 的索引，从 0 到存储在 ST1[0] 中的数字减一，在本例中为 6，因此从 0 到 6 -1，将 ans 数组与每个 D 数组进行比较。如果在同一索引的任何 D 中都没有找到一个或多个 ans 数字，则结果应为 0，如果在同一索引的某个 D 中找到所有 ans 数字，则结果应为 1。也就是说，如果某个 ans[i] 不等于任何 DN[i] 则返回 0，如果每个 ans[i] 等于某个 DN[i 则返回 1 ]。

到目前为止我的算法是：
我试图让所有内容尽可能不循环。

EML  := ST1[0]   //number contained in ST1[0]   
EML1 := 0        //start index for the arrays D 

While EML1 < EML
   if D1[ELM1] = ans[0] 
     goto two
   if D2[ELM1] = ans[0] 
     goto two
   if D3[ELM1] = ans[0] 
     goto two
   if D4[ELM1] = ans[0] 
     goto two
   if D5[ELM1] = ans[0] 
     goto two
   if D6[ELM1] = ans[0] 
     goto two

   ELM1 = ELM1 + 1

return 0     //If the ans[0] number is not found in either D1[0-6], D2[0-6].... D6[0-6] return 0 which will then exclude ans[0-6] numbers


two:

EML1 := 0      start index for arrays Ds 
While EML1 < EML
   if D1[ELM1] = ans[1] 
     goto three
   if D2[ELM1] = ans[1] 
     goto three
   if D3[ELM1] = ans[1] 
     goto three
   if D4[ELM1] = ans[1] 
     goto three
   if D5[ELM1] = ans[1] 
     goto three
   if D6[ELM1] = ans[1] 
     goto three
   ELM1 = ELM1 + 1

return 0    //If the ans[1] number is not found in either D1[0-6], D2[0-6]....  D6[0-6]  return 0 which will then exclude ans[0-6] numbers

three:

EML1 := 0      start index for arrays Ds 

While EML1 < EML
   if D1[ELM1] = ans[2] 
     goto four
   if D2[ELM1] = ans[2] 
     goto four
   if D3[ELM1] = ans[2] 
     goto four
   if D4[ELM1] = ans[2] 
     goto four
   if D5[ELM1] = ans[2] 
     goto four
   if D6[ELM1] = ans[2] 
     goto four
   ELM1 = ELM1 + 1

return 0   //If the ans[2] number is not found in either D1[0-6], D2[0-6]....  D6[0-6]  return 0 which will then exclude ans[0-6] numbers

four:

EML1 := 0      start index for arrays Ds 

While EML1 < EML
   if D1[ELM1] = ans[3] 
     goto five
   if D2[ELM1] = ans[3] 
     goto five
   if D3[ELM1] = ans[3] 
     goto five
   if D4[ELM1] = ans[3] 
     goto five
   if D5[ELM1] = ans[3] 
     goto five
   if D6[ELM1] = ans[3] 
     goto five
   ELM1 = ELM1 + 1

return 0 //If the ans[3] number is not found in either D1[0-6], D2[0-6]....  D6[0-6]  return 0 which will then exclude ans[0-6] numbers


five:

EML1 := 0      start index for arrays Ds 

While EML1 < EML
   if D1[ELM1] = ans[4] 
     goto six
   if D2[ELM1] = ans[4] 
     goto six
   if D3[ELM1] = ans[4] 
     goto six
   if D4[ELM1] = ans[4] 
     goto six
   if D5[ELM1] = ans[4] 
     goto six
   if D6[ELM1] = ans[4] 
     goto six
   ELM1 = ELM1 + 1

return 0  //If the ans[4] number is not found in either D1[0-6], D2[0-6]....  D6[0-6]  return 0 which will then exclude ans[0-6] numbers

six:

EML1 := 0      start index for arrays Ds 

While EML1 < EML
   if D1[ELM1] = ans[5] 
     return 1            ////If the ans[1] number is not found in either D1[0-6].....  
   if D2[ELM1] = ans[5]      return 1 which will then include ans[0-6] numbers
     return 1
   if D3[ELM1] = ans[5] 
     return 1
   if D4[ELM1] = ans[5] 
     return 1
   if D5[ELM1] = ans[5] 
     return 1
   if D6[ELM1] = ans[5] 
     return 1
   ELM1 = ELM1 + 1

return 0

作为首选语言，纯 c

【问题讨论】：

我认为您的编程技能非常基础。很可能你想做的事情可以更容易地完成。请详细说明你想用这段代码做什么（数组代表什么以及你想从中提取哪些信息），这可能会澄清事情并带来更多答案。
哦，伙计们。对于第一次使用的用户，他显然付出了很多努力来尝试尽可能好地格式化和表达他的问题。 +1
同意 Lieven... 即使对于初学者，我们也不希望任何人对提出问题感到不舒服，尤其是一个合法的问题，即使是教育/学习。如果没有与现实世界的开发人员建立联系，其他人如何成为更强大的开发人员。
@mark：我想为 Stackoverflow 的警察部门的方式道歉。已经对待你了。
您能解释一下您还想要什么吗？在您给出的算法中，我认为只有前两个循环可以运行，因为在所有循环中，循环将结束并且代码将返回，或者循环将达到goto two 并将转到第二个一。另外，当您说“将每个 res 数组与每个 D 数组进行比较”时，程序应该如何处理比较？您是要打印一系列字符串“大于”、“小于”等，还是要在遇到相等的数字时退出，或其他？

标签： c algorithm optimization cuda

【解决方案1】：

我对原始海报提供的算法做了一个简单的 C 实现。是here

正如其他人建议的那样，首先要做的是汇总代码。展开对于速度来说并不是真的很好，因为它会导致代码缓存未命中。我从滚动内部循环开始并得到this。然后我滚动了外部循环并删除了现在无用的 goto 并得到了下面的代码。

编辑：我多次更改了 C 代码，因为即使它很简单，在使用 CUDA 进行 JIT 编译或执行时似乎也存在问题（而且 CUDA 似乎不是很冗长关于错误）。这就是为什么下面的代码使用全局变量......这只是微不足道的实现。我们还没有追求速度。它说了很多关于过早优化的内容。如果我们甚至不能让它工作，为什么还要费心让它快呢？我想仍然存在问题，因为如果我相信 Wikipedia 文章，CUDA 似乎对您可以工作的代码施加了许多限制。另外，也许我们应该使用 float 而不是 int ？

#include <stdio.h>

int D1[6] = {3, 4, 5, 6, 7, 8};
int D2[6] = {3, 4, 5, 6, 7, 8};
int D3[6] = {3, 4, 5, 6, 7, 8};
int D4[6] = {3, 4, 5, 6, 7, 8};
int D5[6] = {3, 4, 5, 6, 7, 8};
int D6[6] = {3, 4, 5, 6, 7, 9};
int ST1[1] = {6};
int ans[6] = {1, 4, 5, 6, 7, 9};
int * D[6] = { D1, D2, D3, D4, D5, D6 };

/* beware D is passed through globals */
int algo(int * ans, int ELM){
    int a, e, p;

    for (a = 0 ; a < 6 ; a++){ 
        for (e = 0 ; e < ELM ; e++){
            for (p = 0 ; p < 6 ; p++){
                if (D[p][e] == ans[a]){
                    goto cont;
                }
            }
        }
        return 0; //bad row of numbers found
    cont:;
    }
    return 1;
}

int main(){
    int res;
    res = algo(ans, ST1[0]);
    printf("algo returned %d\n", res);
}

现在这很有趣，因为我们可以理解代码在做什么。顺便说一句，我在做这个包装工作时纠正了原始问题中的几个奇怪之处。我相信这是错别字，因为在全球范围内它根本不合逻辑。 - goto 总是跳到两个（它应该已经进步了） - 最后一个测试检查 ans[0] 而不是 ans[5]

请马克，如果我在上述关于原始代码应该做什么的假设中有错误并且您的原始算法没有错字，请纠正我。

代码对 ans 中的每个值做了什么检查它是否存在于二维数组中。如果任何数字未命中，则返回 0。如果找到所有数字，则返回 1。

为了获得真正快速的代码，我要做的不是用 C 实现它，而是用另一种语言，如 python（或 C++），其中 set 是标准库提供的基本数据结构。然后，我将使用数组的所有值（即 O(n)）构建一个集合，并检查搜索的数字是否存在于集合中（即 O(1)）。至少从算法的角度来看，最终实现应该比现有代码更快。

下面是 Python 示例，因为它真的很简单（打印 true/false 而不是 1/0，但你明白了）：

ans_set = set(ans)
print len(set(D1+D2+D3+D4+D5+D6).intersection(ans_set)) == len(ans_set)

这是一个使用集合的可能 C++ 实现：

#include <iostream>
#include <set>

int algo(int * D1, int * D2, int * D3, int * D4, int * D5, int * D6, int * ans, int ELM){
    int e, p;
    int * D[6] = { D1, D2, D3, D4, D5, D6 };
    std::set<int> ans_set(ans, ans+6);

    int lg = ans_set.size();

    for (e = 0 ; e < ELM ; e++){
        for (p = 0 ; p < 6 ; p++){
            if (0 == (lg -= ans_set.erase(D[p][e]))){
                // we found all elements of ans_set
                return 1;
            }
        }
    }
    return 0; // some items in ans are missing
}

int main(){
    int D1[6] = {3, 4, 5, 6, 7, 8};
    int D2[6] = {3, 4, 5, 6, 7, 8};
    int D3[6] = {3, 4, 5, 6, 7, 8};
    int D4[6] = {3, 4, 5, 6, 7, 8};
    int D5[6] = {3, 4, 5, 6, 7, 8};
    int D6[6] = {3, 4, 5, 6, 7, 1};

    int ST1[1] = {6};

    int ans[] = {1, 4, 5, 6, 7, 8};

    int res = algo(D1, D2, D3, D4, D5, D6, ans, ST1[0]);
    std::cout << "algo returned " << res << "\n";
}

我们做了一些性能假设：ans 的内容应该被排序或者我们应该构造它，我们假设 D1..D6 的内容将在调用 algo 之间改变。因此，我们不必为它构造一个集合（因为集合构造是 O(n) 无论如何，如果 D1..D6 发生变化，我们将不会获得任何东西）。但是，如果我们使用相同的 D1..D6 多次调用算法，而这就是改变的结果，我们应该做相反的事情，将 D1..D6 转换为一个更大的集合，以便我们保持可用。

如果我坚持 C，我可以这样做：

将 D1..D6 中的所有数字复制到一个唯一数组中（对每一行使用 memcpy）
对该数组的内容进行排序
使用二分搜索检查号码是否可用

由于这里的数据量非常小，我们也可以尝试进行微优化。它可以在这里支付更好的。不确定。

EDIT2：CUDA 支持的 C 子集有严格的限制。最严格的一个是我们不应该使用指向主存的指针。必须考虑到这一点。它解释了为什么当前代码不起作用。最简单的更改可能是依次为每个数组 D1..D6 调用它。为了保持简短并避免函数调用成本，我们可以使用宏或内联函数。我会发布一个提案。

【讨论】：

我对C++不是很熟悉，所以不明白“其中set是标准库提供的基本数据结构。然后我会用数组的所有值构建一个集合（即O(n)) 并检查搜索的数字是否存在于集合中（即 O(1)）。"
@Mark：我现在将发布 python 解决方案，因为它是小菜一碟。它会给你一个想法，C++ 并不复杂。如果我们没有立即可用的集合库（实现集合概念的库），C 将需要更多工作。
@Mark：好的，我得到了你想在 CUDA 中做的 cmets。它阐明了上下文。尝试 C++ 集可能会很有趣，因为您有可用的 C++ 编译器。如果它比当前的 C 代码更快或更慢，请告诉我们，还有其他尚未尝试的可能性（例如微优化以避免跳转）。
@kriss：不幸的是它不起作用，可能是因为 int * D[6] = { D1, D2, D3, D4, D5, D6 };和 D[p][e] 是二维的，而上面不是
@Mark：不，上面的语法没问题。你说的是 C++ 还是 C 代码。我并没有真正尝试过 C 代码（因此它可能包含拼写错误），但我编译并运行了 C++ 代码，并用 D1..D6 的几个值对其进行了测试。我很确定它完全没问题，但这可能是编译器的差异。实际的编译器消息是什么？

【解决方案2】：

我对你的问题有点困惑，但我想我已经足够帮助你入门了。

#define ROW 6
#define COL 6

int D[ROW][COL]; // This is all of your D arrays in one 2 dimensional array.

接下来您可能应该使用嵌套的 for 循环。每个循环将对应一个维度D。请记住，索引的顺序很重要。在 C 语言中保持直截了当的最简单方法是记住 D[i] 是一个有效的表达式，即使 D 具有多个维度（并且会计算为指向行的指针：子数组）。

如果您不能将独立的 D 数组更改为一个多维数组，您可以轻松地创建一个指针数组，其成员指向每个数组的头部并达到相同的效果。

确定当前D[i]与ans不匹配后，可以使用break语句跳出内循环。

【讨论】：

好吧，我不想使用二维数组，我需要有 6 个不同的数组，并且尽可能不循环
我感兴趣的是加速，我以算法形式输入的代码，它必须是一维数组
如果您使用的编译器能够在启用该优化的情况下进行循环展开，那么它可能会产生类似于您尝试使用 goto 实现的效果，而不会让您的讲师感到不安。跨度>
@mark: goto 还有一个效果，而不是让教练感到不安，它们清空处理器管道（任何跳转都是如此，所以在这方面循环并不好）。但是，只要您可以在不使用任何类型的 goto/branch/jump 的情况下表达您的程序，即使您执行的指令多于严格必要的指令，您也会加快速度。我相信那里有一条可以遵循的轨道（会尝试一下）。
好吧，我尝试不使用 goto 并使用 bool 变量和中断，但它不起作用，除此之外意味着更多的内存。 goto 是编程中的真正问题还是形式问题？

【解决方案3】：

只有 36 个值要比较，最有效的方法是根本不使用 CUDA。

只需使用 CPU 循环。

如果你改变你的输入，我会改变我的答案。

【讨论】：

不，这只是一个例子，但还有更多可以比较
你想要一个布尔答案还是每个元素的答案数组？
我放弃了，现在我正在做另一个项目stackoverflow.com/questions/3017591/…

【解决方案4】：

如果数字的范围有限，可能会更容易制作一个位数组，如下所示：

int IsPresent(int arrays[][6], int ans[6], int ST1)
{
    uint32_t bit_mask = 0;
    for(int i = 0; i < 6; ++ i) {
        for(int j = 0; j < ST1; ++ j) {
            assert(arrays[i][j] >= 0 && arrays[i][j] < 32); // range is limited
            bit_mask |= 1 << arrays[i][j];
        }
    }
    // make a "list" of numbers that we have

    for(int i = 0; i < 6; ++ i) {
        if(((bit_mask >> ans[i]) & 1) == 0)
            return 0; // in ans, there is a number that is not present in arrays
    }
    return 1; // all of the numbers were found
}

这将始终在 O(6 * ST1 + 6) 中运行。现在这样做的缺点是首先要遍历多达 36 个数组，然后检查六个值。如果有一个强有力的前提条件是数字将大部分存在，则可以逆转测试并提供提前退出：

int IsPresent(int arrays[][6], int ans[6], int ST1)
{
    uint32_t bit_mask = 0;
    for(int i = 0; i < 6; ++ i) {
        assert(ans[i][j] >= 0 && ans[i][j] < 32); // range is limited
        bit_mask |= 1 << ans[i];
    }
    // make a "list" of numbers that we need to find

    for(int i = 0; i < 6; ++ i) {
        for(int j = 0; j < ST1; ++ j)
            bit_mask &= ~(1 << arrays[i][j]); // clear bits of the mask

        if(!bit_mask) // check if we have them all
            return 1; // all of the numbers were found
    }

    assert(bit_mask != 0);
    return 0; // there are some numbers remaining yet to be found
}

如果第一个数组中的第一个数字覆盖所有 ans（并且 ans 是同号）。请注意，位掩码为零的测试可以在每个数组之后（就像现在一样），也可以在每个元素之后（这种方式涉及更多检查，但在找到所有数字时也可以提前截止）。在 CUDA 的上下文中，算法的第一个版本可能会更快，因为它涉及的分支更少，并且大多数循环（ST1 的循环除外）都可以自动展开。

但是，如果数字的范围是无限的，我们可以做其他事情。由于 ans 和所有数组中最多只有 7 * 6 = 42 个不同的数字，因此可以将它们映射到 42 个不同的数字并使用 64 位整数作为位掩码。但可以说，这种数字到整数的映射对于测试来说已经足够了，并且可以完全跳过这个测试。

另一种方法是对数组进行排序并简单地计算各个数字的覆盖率：

int IsPresent(int arrays[][6], int ans[6], int ST1)
{
    int all_numbers[36], n = ST1 * 6;
    for(int i = 0; i < 6; ++ i)
        memcpy(&all_numbers[i * ST1], &arrays[i], ST1 * sizeof(int));
    // copy all of the numbers into a contiguous array

    std::sort(all_numbers, all_numbers + n);
    // or use "C" standard library qsort() or a bitonic sorting network on GPU
    // alternatively, sort each array of 6 separately and then merge the sorted
    // arrays (can also be done in parallel, to some level)

    n = std::unique(all_numbers, all_numbers + n) - all_numbers;
    // this way, we can also remove duplicate numbers, if they are
    // expect to occur frequently and make the next test faster.
    // std::unique() actually moves the duplicates to the end of the list
    // and returns an iterator (a pointer in this case) to one past
    // the last unique element of the list - that gives us number of
    // unique items.

    for(int i = 0; i < 6; ++ i) {
        int *p = std::lower_bound(all_numbers, all_numbers + n, ans[i]);
        // use binary search to find the number in question
        // or use "C" standard library bfind()
        // or implement binary search yourself on GPU

        if(p == all_numbers + n)
            return 0; // not found
        // alternately, make all_numbers array of 37 and write
        // all_numbers[n] = -1; before this loop. that will act
        // as a sentinel and will save this one comparison (assuming
        // that there is a value that is guaranteed not to occur in ans)

        if(*p != ans[i])
            return 0; // another number found, not ans[i]
        // std::lower_bound looks for the given number, or for one that
        // is greater than it, so if the number was to be inserted there
        // (before the bigger one), the sequence would remain ordered.
    }

    return 1; // all the numbers were found
}

这将在 O(n) 中运行以进行复制，O(36 log 36) 用于排序，对于 unique（其中 n 是 6 * ST1）和 O(n log n) 用于搜索（如果使用unique，则 n 可以小于 6 * ST1）。因此，整个算法以线性时间运行。请注意，这不涉及任何动态内存分配，因此甚至适用于 GPU 平台（必须实现排序和端口 std::unique() 和 std::lower_bound()，但所有这些都是非常简单的函数）。

【讨论】：