计算二维联合概率分布答案

【问题标题】：Calculating a 2D joint probability distribution计算二维联合概率分布
【发布时间】：2013-11-13 18:49:45
【问题描述】：

我在一个正方形内有很多点。我想将正方形划分为许多小矩形并检查每个矩形中有多少点，即我想计算点的联合概率分布。我报告了一些常识性方法，使用循环并且效率不高：

% Data
N = 1e5;    % number of points
xy = rand(N, 2);    % coordinates of points
xy(randi(2*N, 100, 1)) = 0;    % add some points on one side
xy(randi(2*N, 100, 1)) = 1;    % add some points on the other side
xy(randi(N, 100, 1), :) = 0;    % add some points on one corner
xy(randi(N, 100, 1), :) = 1;    % add some points on one corner
inds= unique(randi(N, 100, 1)); xy(inds, :) = repmat([0 1], numel(inds), 1);    % add some points on one corner
inds= unique(randi(N, 100, 1)); xy(inds, :) = repmat([1 0], numel(inds), 1);    % add some points on one corner

% Intervals for rectangles
K1 = ceil(sqrt(N/5));    % number of intervals along x
K2 = K1;    % number of intervals along y
int_x = [0:(1 / K1):1, 1+eps];    % intervals along x
int_y = [0:(1 / K2):1, 1+eps];    % intervals along y

% First approach
tic
count_cells = zeros(K1 + 1, K2 + 1);
for k1 = 1:K1+1
  inds1 = (xy(:, 1) >= int_x(k1)) & (xy(:, 1) < int_x(k1 + 1));
  for k2 = 1:K2+1
    inds2 = (xy(:, 2) >= int_y(k2)) & (xy(:, 2) < int_y(k2 + 1));
    count_cells(k1, k2) = sum(inds1 .* inds2);
  end
end
toc
% Elapsed time is 46.090677 seconds.

% Second approach
tic
count_again = zeros(K1 + 2, K2 + 2);
for k1 = 1:K1+1
  inds1 = (xy(:, 1) >= int_x(k1));
  for k2 = 1:K2+1
    inds2 = (xy(:, 2) >= int_y(k2));
    count_again(k1, k2) = sum(inds1 .* inds2);
  end
end
count_again_fix = diff(diff(count_again')');
toc
% Elapsed time is 22.903767 seconds.

% Check: the two solutions are equivalent
all(count_cells(:) == count_again_fix(:))

我怎样才能在时间、内存和尽可能避免循环方面更有效地做到这一点？

编辑 --> 我也发现了这个，这是迄今为止找到的最好的解决方案：

tic
count_cells_hist = hist3(xy, 'Edges', {int_x int_y});
count_cells_hist(end, :) = []; count_cells_hist(:, end) = [];
toc
all(count_cells(:) == count_cells_hist(:))
% Elapsed time is 0.245298 seconds.

但它需要统计工具箱。

EDIT --> chappjc 建议的测试解决方案

tic
xcomps = single(bsxfun(@ge,xy(:,1),int_x));
ycomps = single(bsxfun(@ge,xy(:,2),int_y));
count_again = xcomps.' * ycomps; %' 143x143 = 143x1e5 * 1e5x143
count_again_fix = diff(diff(count_again')');
toc
% Elapsed time is 0.737546 seconds.
all(count_cells(:) == count_again_fix(:))

【问题讨论】：

stackoverflow.com/questions/18639518/…的可能副本
我也在检查stackoverflow.com/questions/16313949/… - 我不确定是否可以使用 hist3 获得相同的结果。
@LuisMendo - 这是对另一个问题的非常彻底的回答，在这里正确链接。但是，另一个问题并不具体，也不包含任何代码，因此它被关闭了。因此，我认为 francesco 在这里提出的问题值得我们很好地尝试解决问题。对您对另一个问题的精心构思的解决方案肯定 +1。只是我的2美分。 :)
@chappjc 是的，因为另一个问题已经结束，所以在这里回答是有意义的。

标签： matlab points plane probability-density accumarray

【解决方案1】：

我写了一个简单的 mex 函数，当 N 很大时它工作得很好。当然是作弊，但仍然......

功能是

#include "mex.h"

void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
    unsigned long int hh, ctrl;       /*  counters                       */
    unsigned long int N, m, n;        /*  size of matrices               */
    unsigned long int *xy;            /*  data                           */
    unsigned long int *count_cells;   /*  joint frequencies              */
    /*  matrices needed */
    mxArray *count_cellsArray;

/*  Now we need to get the data */
    if (nrhs == 3) {
        xy = (unsigned long int*) mxGetData(prhs[0]);
        N = (unsigned long int) mxGetM(prhs[0]);
        m = (unsigned long int) mxGetScalar(prhs[1]);
        n = (unsigned long int) mxGetScalar(prhs[2]);
    }

/*  Then build the matrices for the output */
    count_cellsArray = mxCreateNumericMatrix(m + 1, n + 1, mxUINT32_CLASS, mxREAL);
    count_cells = mxGetData(count_cellsArray);
    plhs[0] = count_cellsArray;

    hh = 0; /* counter for elements of xy */
    /* for all points from 1 to N */
    for(hh=0; hh<N; hh++) {
        ctrl = (m + 1) * xy[N + hh] + xy[hh];
        count_cells[ctrl] = count_cells[ctrl] + 1;
    }
}

可以保存在文件“joint_dist_points_2D.c”中，然后编译：

mex joint_dist_points_2D.c

然后检查一下：

% Data
N = 1e7;    % number of points
xy = rand(N, 2);    % coordinates of points
xy(randi(2*N, 1000, 1)) = 0;    % add some points on one side
xy(randi(2*N, 1000, 1)) = 1;    % add some points on the other side
xy(randi(N, 1000, 1), :) = 0;    % add some points on one corner
xy(randi(N, 1000, 1), :) = 1;    % add some points on one corner
inds= unique(randi(N, 1000, 1)); xy(inds, :) = repmat([0 1], numel(inds), 1);    % add some points on one corner
inds= unique(randi(N, 1000, 1)); xy(inds, :) = repmat([1 0], numel(inds), 1);    % add some points on one corner

% Intervals for rectangles
K1 = ceil(sqrt(N/5));    % number of intervals along x
K2 = ceil(sqrt(N/7));    % number of intervals along y
int_x = [0:(1 / K1):1, 1+eps];    % intervals along x
int_y = [0:(1 / K2):1, 1+eps];    % intervals along y

% Use Statistics Toolbox: hist3
tic
count_cells_hist = hist3(xy, 'Edges', {int_x int_y});
count_cells_hist(end, :) = []; count_cells_hist(:, end) = [];
toc
% Elapsed time is 4.414768 seconds.

% Use mex function
tic
xy2 = uint32(floor(xy ./ repmat([1 / K1, 1 / K2], N, 1)));
count_cells = joint_dist_points_2D(xy2, uint32(K1), uint32(K2));
toc
% Elapsed time is 0.586855 seconds.

% Check: the two solutions are equivalent
all(count_cells_hist(:) == count_cells(:))

【讨论】：

好贡献！但是 MEX 有点像骗子，是的。 ;) 但是，我在为我的研究制作联合 PDF 时使用了 MEX 文件，所以最终我同意这是要走的路。然而，对于这个N=1e7 测试数据，我更新的accumarray 方法在我的PC 上需要1.1 秒，所以这可能是一个很好的通用替代方案，不需要工具箱。
我同意！我用 accumarray 测试了你的解决方案，即使 N=3e7 也很快！起首！

【解决方案2】：

改进相关代码

您的循环（和嵌套的点积）可以通过bsxfun 和矩阵乘法来消除，如下所示：

xcomps = bsxfun(@ge,xy(:,1),int_x);
ycomps = bsxfun(@ge,xy(:,2),int_y);
count_again = double(xcomps).'*double(ycomps); %' 143x143 = 143x1e5 * 1e5x143
count_again_fix = diff(diff(count_again')');

乘法步骤完成了在sum(inds1 .* inds2) 中完成的AND 和求和，但没有在密度矩阵上循环。编辑：如果您使用single 而不是double，则执行时间几乎减半，但请务必将您的答案转换为double 或其余代码所需的任何内容。在我的电脑上，这大约需要 0.5 秒。

注意：rot90(count_again/size(xy,1),2) 有一个 CDF，rot90(count_again_fix/size(xy,1),2) 有一个 PDF。

使用累加数组

另一种方法是使用accumarray在我们对数据进行分箱后制作联合直方图。

以int_x、int_y、K1、xy 等开头：

% take (0,1) data onto [1 K1], following A.Dondas approach for easy comparison
ii = floor(xy(:,1)*(K1-eps))+1; ii(ii<1) = 1; ii(ii>K1) = K1;
jj = floor(xy(:,2)*(K1-eps))+1; jj(jj<1) = 1; jj(jj>K1) = K1;

% create the histogram and normalize
H = accumarray([ii jj],ones(1,size(ii,1)));
PDF = H / size(xy,1); % for probabilities summing to 1

在我的计算机上，这大约需要 0.01 秒。

输出与 A.Donda 从稀疏转换为完整 (full(H)) 的输出相同。虽然，正如他 A.Donda 指出的那样，将尺寸设置为 K1xK1 是正确的，而不是 OPs 代码中 count_again_fix 的大小为 K1+1xK1+1。

要获得 CDF，我相信您可以简单地将 cumsum 应用于 PDF 的每个轴。

【讨论】：

+有效！谢谢！我正在尝试用 hist3 做到这一点。
请注意：我不一定要解决一般联合概率分布问题，而是要更改 francesco 的代码以“在时间、内存方面更有效地做到这一点” ，并可能避免循环”。我认为这里有一条很好的界限，归结为这两个问题的范围和质量。我现在要去外面。 :p
如果统计工具箱可用，使用 hist3 似乎是最佳选择 - 否则 chappjc 建议的解决方案是我迄今为止测试过的最佳替代解决方案。
您使用 accumarray 的解决方案确实非常快 - 它与我的 mex 函数相当！虽然我还需要极端值，0 和 1，所以我认为矩阵的大小应该是 (K1+1)x(K2+1) - 考虑到我使用的是 edges，而不是垃圾箱.

【解决方案3】：

chappjc 的回答和使用hist3 都很好，但是由于我前段时间碰巧想要这样的东西并且由于某种原因没有找到hist3 我自己写的，我想我会把它贴在这里作为奖励。它使用sparse 进行实际计数并将结果作为稀疏矩阵返回，因此它对于处理不同模式相距很远的多峰分布可能很有用 - 或者对于没有统计工具箱的人。

应用到francesco的数据：

K1 = ceil(sqrt(N/5));
[H, xs, ys] = hist2d(xy(:, 1), xy(:, 2), [K1 K1], [0, 1 + eps, 0, 1 + eps]);

使用输出参数调用该函数只返回结果，而不制作彩色图。

函数如下：

函数 [H, xs, ys] = hist2d(x, y, n, ax)

% plot 2d-histogram as an image
%
% hist2d(x, y, n, ax)
% [H, xs, ys] = hist2d(x, y, n, ax)
%
% x:    data for horizontal axis
% y:    data for vertical axis
% n:    how many bins to use for each axis, default is [100 100]
% ax:   axis limits for the plot, default is [min(x), max(x), min(y), max(y)]
% H:    2d-histogram as a sparse matrix, indices 1 & 2 correspond to x & y
% xs:   corresponding vector of x-values
% ys:   corresponding vector of y-values
%
% x and y have to be column vectors of the same size. Data points
% outside of the axis limits are allocated to the first or last bin,
% respectively. If output arguments are given, no plot is generated;
% it can be reproduced by "imagesc(ys, xs, H'); axis xy".


% defaults
if nargin < 3
    n = [100 100];
end
if nargin < 4
    ax = [min(x), max(x), min(y), max(y)];
end

% parameters
nx = n(1);
ny = n(2);
xl = ax(1 : 2);
yl = ax(3 : 4);

% generate histogram
i = floor((x - xl(1)) / diff(xl) * nx) + 1;
i(i < 1) = 1;
i(i > nx) = nx;
j = floor((y - yl(1)) / diff(yl) * ny) + 1;
j(j < 1) = 1;
j(j > ny) = ny;
H = sparse(i, j, ones(size(i)), nx, ny);

% generate axes
xs = (0.5 : nx) / nx * diff(xl) + xl(1);
ys = (0.5 : ny) / ny * diff(yl) + yl(1);

% possibly plot
if nargout == 0
    imagesc(ys, xs, H')
    axis xy
    clear H xs ys
end

【讨论】：

这个功能很棒，但结果并不完全相同——我猜右边的边缘处理方式不同。我正在尝试了解是否可以相应地修复它。
谢谢！也许是因为“索引 1 和 2 对应于 y 和 x”？我这样做是因为 imagesc 想要它输入的方式，但也许这是一个坏主意。在这种情况下，转置应该修复它。
另外，您的 hist3 解决方案产生一个 143 x 143 矩阵，而 K1 = K2 = 142，我的函数相应地产生一个 142 x 142。
@francesco，我更改了我的函数以提供具有“自然”坐标顺序的输出。其余的差异是由于hist3 指定“边缘”忽略了位于外部的数据点，而我的函数将它们计入边距箱。如果这样调用，它的输出与hist3 的输出相同：hist3(xy, 'Ctrs', {xs ys}) 其中xs 和ys 是我的函数返回的bin 中心。感谢您指出这些不一致之处！
@A.Donda 这是一个不错的方法。 MATLAB 的accumarray 不是使用sparse 进行计数，而是非常适合累积这样的分箱数据。为了完整起见，我在答案中发布了第二个解决方案。