随机访问时共享内存中的预期银行冲突数答案

【问题标题】：Expected number of bank conflicts in shared memory at random access随机访问时共享内存中的预期银行冲突数
【发布时间】：2012-10-10 15:58:33
【问题描述】：

让A 成为共享内存中正确对齐的 32 位整数数组。

如果单个 warp 尝试随机获取 A 的元素，预期的银行冲突数量是多少？

换句话说：

__shared__ int A[N];          //N is some big constant integer
...
int v = A[ random(0..N-1) ];  // <-- expected number of bank conflicts here?

请假设特斯拉或费米架构。我不想详述 Kepler 的 32 位和 64 位银行配置。另外，为简单起见，让我们假设所有随机数都是不同的（因此没有广播机制）。

我的直觉暗示了一个介于 4 到 6 之间的数字，但我想找到一些数学评估。

我相信这个问题可以从 CUDA 中抽象出来并呈现为一个数学问题。我搜索它作为生日悖论的扩展，但我在那里发现了非常可怕的公式，但没有找到最终公式。我希望有一个更简单的方法...

【问题讨论】：

好问题，+1。（我们需要对 cuda 标签上的好问题进行更多投票！）

标签： cuda

【解决方案1】：

在数学中，这被认为是一个“球在箱子里”的问题 - 32 个球被随机放入 32 个箱子中。您可以枚举可能的模式并计算它们的概率以确定分布。尽管模式的数量很大，但幼稚的方法是行不通的：（63！）/（32！）（31！）“几乎”是五分之一。

如果您以递归方式构建解决方案并使用条件概率，则可以解决此问题。

查找 Charles J. Corrado 的论文，题为“多项式/狄利克雷和多元超几何频率的最大值、最小值和范围的精确分布”。

在下面，我们从最左边的桶开始计算每个可能落入其中的球的概率。然后我们向右移动一个，并在给定已经使用的球和桶的数量的情况下，确定该桶中可能存在的每个球数的条件概率。

为 VBA 代码道歉，但是当我有动力回答时，我只有 VBA 可用:)。

Function nCr#(ByVal n#, ByVal r#)
    Static combin#()
    Static size#
    Dim i#, j#

    If n = r Then
        nCr = 1
        Exit Function
    End If

    If n > size Then
        ReDim combin(0 To n, 0 To n)
        combin(0, 0) = 1
        For i = 1 To n
            combin(i, 0) = 1
            For j = 1 To i
                combin(i, j) = combin(i - 1, j - 1) + combin(i - 1, j)
            Next
        Next
        size = n
    End If

    nCr = combin(n, r)
End Function

Function p_binom#(n#, r#, p#)
    p_binom = nCr(n, r) * p ^ r * (1 - p) ^ (n - r)
End Function

Function p_next_bucket_balls#(balls#, balls_used#, total_balls#, _
  bucket#, total_buckets#, bucket_capacity#)

    If balls > bucket_capacity Then
        p_next_bucket_balls = 0
    Else
        p_next_bucket_balls = p_binom(total_balls - balls_used, balls, 1 / (total_buckets - bucket + 1))
    End If
End Function

Function p_capped_buckets#(n#, cap#)
    Dim p_prior, p_update
    Dim bucket#, balls#, prior_balls#

    ReDim p_prior(0 To n)
    ReDim p_update(0 To n)

    p_prior(0) = 1

    For bucket = 1 To n
        For balls = 0 To n
            p_update(balls) = 0
            For prior_balls = 0 To balls
                p_update(balls) = p_update(balls) + p_prior(prior_balls) * _
                   p_next_bucket_balls(balls - prior_balls, prior_balls, n, bucket, n, cap)
            Next
        Next
        p_prior = p_update
    Next

    p_capped_buckets = p_update(n)
End Function

Function expected_max_buckets#(n#)
    Dim cap#

    For cap = 0 To n
        expected_max_buckets = expected_max_buckets + (1 - p_capped_buckets(n, cap))
    Next
End Function

Sub test32()
    Dim p_cumm#(0 To 32)
    Dim cap#

    For cap# = 0 To 32
        p_cumm(cap) = p_capped_buckets(32, cap)
    Next

    For cap = 1 To 32
        Debug.Print "    ", cap, Format(p_cumm(cap) - p_cumm(cap - 1), "0.000000")
    Next
End Sub

对于 32 个球和水桶，我得到的预期最大球数约为 3.532941。

与 ahmad 比较的输出：

           1            0.000000
           2            0.029273
           3            0.516311
           4            0.361736
           5            0.079307
           6            0.011800
           7            0.001417
           8            0.000143
           9            0.000012
           10           0.000001
           11           0.000000
           12           0.000000
           13           0.000000
           14           0.000000
           15           0.000000
           16           0.000000
           17           0.000000
           18           0.000000
           19           0.000000
           20           0.000000
           21           0.000000
           22           0.000000
           23           0.000000
           24           0.000000
           25           0.000000
           26           0.000000
           27           0.000000
           28           0.000000
           29           0.000000
           30           0.000000
           31           0.000000
           32           0.000000

【讨论】：

我忘了提，但应该插入一个很棒的博客，通过 harrism 在link 的帖子到达这里。
很棒的答案，谢谢 A. BTW，提到的博客是 John D Cook 的，最近正好有一篇与这个问题相关的帖子。

【解决方案2】：

我会尝试一个数学答案，虽然我还没有完全正确。

您基本上想知道，给定一个扭曲中的随机 32 位字索引到一个对齐的 __shared__ 数组中，“一个扭曲中映射到单个银行的最大地址数的 expected value 是多少？ "

如果我考虑类似于散列的问题，那么它与将散列到单个位置的预期最大项目数有关，以及this document shows an upper bound on that number of O(log n / log log n) 用于将 n 个项目散列到 n 个存储桶中。（数学很复杂！）。

对于 n = 32，计算结果约为 2.788（使用自然对数）。没关系，但这里我稍微修改了 ahmad 的程序，以凭经验计算预期的最大值（为了清晰起见，还简化了代码和修改了名称等，并修复了一些错误）。

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <algorithm>
#define NBANK 32
#define WARPSIZE 32
#define NSAMPLE 100000

int main(){  
    int i=0,j=0;

    int *bank=(int*)malloc(sizeof(int)*NBANK);
    int *randomNumber=(int*)malloc(sizeof(int)*WARPSIZE);
    int *maxCount=(int*)malloc(sizeof(int)*(NBANK+1));
    memset(maxCount, 0, sizeof(int)*(NBANK+1));

    for (int i=0; i<NSAMPLE; ++i) {
        // generate a sample warp shared memory access
        for(j=0; j<WARPSIZE; j++){
            randomNumber[j]=rand()%NBANK;
        }

        // check the bank conflict
        memset(bank, 0, sizeof(int)*NBANK);
        int max_bank_conflict=0;
        for(j=0; j<WARPSIZE; j++){
            bank[randomNumber[j]]++;       
        }

        for(j=0; j<WARPSIZE; j++) 
            max_bank_conflict = std::max<int>(max_bank_conflict, bank[j]);

        // store statistic
        maxCount[max_bank_conflict]++;
    }

    // report statistic
    printf("Max conflict degree %% (%d random samples)\n", NSAMPLE);
    float expected = 0;
    for(i=1; i<NBANK+1; i++) {
        float prob = maxCount[i]/(float)NSAMPLE;
        printf("%02d -> %6.4f\n", i, prob);
        expected += prob * i;
    }
    printf("Expected maximum bank conflict degree = %6.4f\n", expected);
    return 0;
}

使用在程序中找到的百分比作为概率，预期最大值是乘积 sum(i * probability(i)) 的总和，对于 i 从 1 到 32。我计算出预期值为 3.529（与 ahmad 的数据匹配）。距离不是很远，但 2.788 应该是一个上限。由于上限是用大 O 表示法给出的，我想有一个常数因子被遗漏了。但目前就我所知。

开放式问题：这个常数因素足以解释它吗？是否可以计算 n = 32 的常数因子？协调这些，和/或找到一个封闭形式的解决方案来解决 32 个银行和 32 个并行线程的预期最大银行冲突程度会很有趣。

这是一个非常有用的话题，因为它有助于在共享内存寻址是有效随机的情况下建模和预测性能。

【讨论】：

【解决方案3】：

我假设 fermi 32-bank 共享内存，其中每 4 个后续字节存储在后续银行中。使用以下代码：

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define NBANK 32
#define N 7823
#define WARPSIZE 32


#define NSAMPLE 10000

int main(){
    srand ( time(NULL) );

    int i=0,j=0;
    int *conflictCheck=NULL;
    int *randomNumber=NULL;
    int *statisticCheck=NULL;

    conflictCheck=(int*)malloc(sizeof(int)*NBANK);
    randomNumber=(int*)malloc(sizeof(int)*WARPSIZE);
    statisticCheck=(int*)malloc(sizeof(int)*(NBANK+1));
    while(i<NSAMPLE){
        // generate a sample warp shared memory access
        for(j=0; j<WARPSIZE; j++){
            randomNumber[j]=rand()%NBANK;
        }
        // check the bank conflict
        memset(conflictCheck, 0, sizeof(int)*NBANK);
        int max_bank_conflict=0;
        for(j=0; j<WARPSIZE; j++){
            conflictCheck[randomNumber[j]]++;
            max_bank_conflict = max_bank_conflict<conflictCheck[randomNumber[j]]? conflictCheck[randomNumber[j]]: max_bank_conflict;
        }
        // store statistic
        statisticCheck[max_bank_conflict]++;

        // next iter
        i++;
    }
    // report statistic
    printf("Over %d random shared memory access, there found following precentages of bank conflicts\n");
    for(i=0; i<NBANK+1; i++){
        //
        printf("%d -> %6.4f\n",i,statisticCheck[i]/(float)NSAMPLE);
    }
    return 0;
}

我得到以下输出：

Over 0 random shared memory access, there found following precentages of bank conflicts
0 -> 0.0000
1 -> 0.0000
2 -> 0.0281
3 -> 0.5205
4 -> 0.3605
5 -> 0.0780
6 -> 0.0106
7 -> 0.0022
8 -> 0.0001
9 -> 0.0000
10 -> 0.0000
11 -> 0.0000
12 -> 0.0000
13 -> 0.0000
14 -> 0.0000
15 -> 0.0000
16 -> 0.0000
17 -> 0.0000
18 -> 0.0000
19 -> 0.0000
20 -> 0.0000
21 -> 0.0000
22 -> 0.0000
23 -> 0.0000
24 -> 0.0000
25 -> 0.0000
26 -> 0.0000
27 -> 0.0000
28 -> 0.0000
29 -> 0.0000
30 -> 0.0000
31 -> 0.0000
32 -> 0.0000

我们可以得出结论，随机访问最有可能发生 3 到 4 路冲突。您可以使用不同的N（数组中的元素数量）、NBANK（共享内存中的银行数量）、WARPSIZE（机器的扭曲大小）和NSAMPLE（随机共享内存的数量）来调整运行生成的访问以评估模型）。

【讨论】：

哈！在数学失败的地方，实验有效：）我仍然希望看到解决方案的一些数学公式，但如果我找不到任何东西，我会接受这种方法。
rand() 是一个非常糟糕的随机数生成器，它在 Windows 上只有 16 位，而在 Linux 上只有 32 位...
@harrism：我已经报告了 Linux 上的数字。我认为rand() 已经足够好了，因为我们需要rand()%32。
这个分析对我来说似乎非常明智和有用。作为一个小观察，我认为在 Fermi 的情况下，如果 2 个（或更多）线程访问数组中完全相同的位置，这不是银行冲突，它是通过广播进行整理的。机制。但从统计上看，我认为这不会对结果产生太大影响或根本不会影响结果。而且数组越大，这种情况发生的可能性就越小。
由于似乎很难得到一个数学公式，我只接受这个作为答案。但如果有人有更好的东西，请分享！